The Art of Observability

Jaeger Go All In on oTel | Coralogix Get Mobile

Welcome to Edition #23 of the Newsletter!

O11ympic Competition

The summer is here (for those of us in the Northern hemisphere) but it seems as if observability vendors are not taking a break. The increasing level of competition in the marketplace is seeing new features being released at a relentless rate. In our last edition we looked at advances in log management, this time around we cover major updates in features such as debugging the oTel collector, managing ingestion pipelines and a major upgrade for the Jaeger tracing tool.

There is a fair amount of speculation that the observability market may be saturated. This may or may not be true, but it is not deterring new vendors from entering the space. Whilst this means more competition, the pie also sems to be growing so it is not necessarily a zero-sum game, and consumers can enjoy the fruits of innovation and price competition.

oTel Debugging At The Double

You wait ages for an OpenTelemetry debugging tool to come along and then, inevitably two come at once. First, LightStep developer Jacob Aronoff alerted us to his Tails utility, then just a few days ago Grafana announced real time pipeline debugging in the latest version of Alloy. We look at both tools in this issue.

To The Beach!

There may be no rest for observability vendors, however, we are off to enjoy a summer break for a few weeks. The next edition of the newsletter will be early September - when we hope to be making a very exciting announcement!

Have a great August!

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

The Art of Observability

A lot of observability system marketing emphasizes pricing or capabilities such as ingestion or retention rates. Whilst these are important, observability systems are about understanding our networks and services and being able to troubleshoot effectively. The key to achieving this is through an effective user experience which manages to visualise vast masses of data and provide users with effective flows.

This article in the Observability 360 web site is an appreciation of the visual and user experience aspects of observability systems. It looks at iconic features such as The Honeycomb BubbleUp and Flame Graphs and explores the design process behind leading applications such as Grafana FrontEnd Observability and IBM’s SevOne.

Coralogix Add Mobile Observability

Image courtesy of Coralogix Blog

Following in the footsteps of vendors such as New Relic, Coralogix are the latest full-stack operator to add mobile observability to their portfolio with the rollout of their Mobile Real User Monitoring module. The solution tracks mobile application errors and groups them into four types such as Crash Error and Network Error. You can then drill down for further analytics on each type of error and see granular data such as the threads that were running at the time the error occurred. The feature also integrates with the Coralogix APM solution to offer a complete end-to-end picture of application performance.

Google Cloud Release oTel ‘Pipeline’ For GKE

Google have responded to growing demand for OpenTelemetry instrumentation of apps running in their GKE instances with the release of a “curated OpenTelemetry ingestion pipeline”. Whilst this sounds quite fancy, it is actually a set of YAML manifests for deploying a standard OpenTelemetry Collector. The ‘curated’ bit, means that Google have fine-tuned the configuration to work optimally on GKE and communicate with GCP Observability backends.

Overall, this is obviously a good thing as manual configuration of oTel manifests can potentially be a rather intimidating prospect. It is worth pointing out that the pipeline does not perform auto-instrumentation - you need to instrument your apps yourself.

Grafana Release Alloy 1.3 - With Live Debugging

Alloy is Grafana’s own open source distribution of the OpenTelemetry Collector. As well as being fully compatible with the oTel Collector it also features a number of additional capabilities - including Components - a means of creating tasks and chaining them together into flows.

The big story in the latest release is the introduction of live debugging. Once this is enabled, you can connect to a web UI exposed by Alloy and drill down into a Live Debugging screen. From here you can see the telemetry flowing through the component. For each line of telemetry, you can see the raw data as it was received as well as its state after any transformations have been applied.

This seems to be emblematic of the Embrace and Extend approach that many of the major vendors are taking in relation to the oTel Collector - providing full support but also embedding it within an implementation including proprietary features.

Products

Tails - Standalone oTel Debugging!

In our last edition we featured the otel-desktop-viewer, utility and hoped to see more such tooling emerging. Well, our wish was granted in double-quick time as we were pointed in the direction of Tails - an app that runs as a sidecar to your OpenTelemetry Collector. The app developers include Jacob Aronoff of Lightstep and Austin Parker of Honeycomb and it is a lightweight web server that listens on a socket and streams live messages from a Collector.

The app supports logs, traces and metrics and also has some cool features such as Play/Pause mode and filtering. The oTel Collector is a great piece of engineering but it can also be a bit of a black box. This is a great tool for providing visibility into the Collector’s telemetry streams and will be going straight into our oTel toolbelt.

Metoro - Microservice Observability

Metoro is the latest addition to the roster of eBPF-powered observability solutions for microservices running on Kubernetes. Interestingly, they do not currently have their own eBPF sensor - they are currently using the Coroot distribution but are working on developing their own.

As well as the features you would expect - such as logging, tracing and metrics the platform also provides automated performance regression monitoring and AI-driven root cause analysis.

The product has a very simple pricing model - charging $20 per node per month. There is a free “hobbyist” plan with a limit of one K8S cluster and two nodes if you would like to put it through its paces.

Jaeger V2 Rings The Changes

Jaeger V2 Binary Architecture

Distributed tracing stalwart Jaeger have announced some major strategic changes for the V2 release of the product. The big news is a major technical re-design which puts OpenTelemetry right at the heart of the Jaeger architecture. In fact, Jaeger are actually going all-in by directly importing the oTel Collector code as a library within the Jaeger V2 binary.

In a blog post on Medium, the creator of the platform, Yuri Shkuro, sketched out his vision for the future of the product. One of the most radical updates is native support for the OTLP format. This simplifies the internal design of the product by removing the layer which translated from OLTP to Jaeger format - and should also lead to performance improvements. This blog post is an essential read for anybody who uses Jaeger and seeks an understanding of its future direction.

From the Blogosphere (and the Chronosphere)

WASM - Ready For Prime Time

The buzz around WASM is steadily growing as the tooling continues to mature and more developers get on board. This article on the wasmCloud blog summarizes a recent discussion between Dotan Horovitz of Logz.io and wasmCloud maintainer Taylor Thomas.

The article encapsulates really eloquently the revolutionary nature of the WebAssembly paradigm. It also covers key technical concerns such as WASI Observe, which exposes interfaces for tracing, metrics and logging, the WASM Component Model and the types of workloads best suited to the WASM model.

Keeping On Top Of Your Logging Strategy

As volumes increase and technological changes open up new possibilities, it is important to continually look at the big picture and review our overall logging strategies. This article is actually a transcript of a lively roundtable discussion led by Rachel Dines of Chronosphere (the page also links to individual video clips of the discussion). The backdrop to the talk is a survey carried out by Chronosphere which underscored the continuing challenge of log management. Amongst a sample of 120 engineers, they found an average of 2.5x log data growth over the past 12 months alone. Forty percent of respondents were generating over 100GB per day.

The conversation covers a number of key issues such as log relevance, routing and pipelines. The panellists also share insights from their experiences with customers in the field - including one banking client running a fleet of over 20,000 Fluent Bit and FluentD agents. Naturally, there is some plugging of the Chronosphere product but there is also plenty of good general content to chew on.

OpenTelemetry

oTel/Prometheus Compatibility Survey

Whilst OpenTelemetry inexorably establishes itself as the industry standard for logs and traces, the picture with metrics is a little more complex. This is because of the monumental presence of Prometheus as the existing de facto standard. Even for systems which do not use a Prometheus server, the Prometheus protocol is something of a lingua franca. This essentially means we have two competing standards for metrics.

A large amount of work is going on to achieve interoperability between the two technologies and the results of this survey on OpenTelemetry/Prometheus compatibility highlight some of the differences in semantics and formatting that need to be addressed. One of the major sticking points is the use of dots vs underscores in metric names. Whilst this might seem an arcane point, many of us will have seen long and bruising battles raging over such questions at tech team meetings 😬

Visualising oTel Metrics in OpenSearch

Whilst OpenSearch may have started life as a fork of ElasticSearch, it is actually a fully-featured observability platform with support for Metrics and Traces. OpenSearch has integrated with OpenTelemetry since its inception and the Data Prepper (which is the OpenSearch data processing pipeline), has full support for OTLP.

In this article on the OpenSearch blog, SAP Observability specialist Karsten Schnitter provides a really clear and comprehensive guide to ingesting and viewing metrics in OpenSearch. The article starts off with the basics of configuring your apps to export telemetry to the Data Prepper as well as looking at the anatomy of a typical Kubernetes metric and discussing some semantic gotchas.

OpenSearch has a highly pluggable architecture and the main guts of the article looks at using the Discover and TSVB plugins. Although Discover does display histograms, it is mostly a tabular view of data. The UI has a number of options for filtering your data, but you can also run more advanced searches using Dashboards Query Language (DQL). TSVB is the Time-Series Visual Builder and the article walks through creating a number of different types of visualisation for time series data.

If you are not familiar with the OpenSearch platform, this is a great tutorial for getting some hands-on knowledge.

On the lighter side…

Some of us may seek fame, others, meanwhile, appear to have fame thrust upon them. One person who probably falls into the latter category is Kevlin Henney. A number of years ago he started collecting pictures like the one above - i.e. computer screens in public places displaying fatal error messages. He then started using the images in talks and training courses.

Over time, word spread around and eventually, a crashed screen in a public place became known simply as a Kevlin Henney, sealing his place in posterity alongside other great eponyms such as the Fosbury flop, the Panenka and the Barnum Effect. Naturally, the recent Cloudflare outage gave us a moment of peak Kevlin Henney and in this unassuming and thoughtful article, Kevlin reflects both on the story of his own celebrity status as well as the fallible nature of the software that pervades our daily lives. This is one to read and share.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from Nobel Prize laureate David H Hubel:

“We need above all to know about changes; no one wants or needs to be reminded 16 hours a day that his shoes are on.”