Observability 360
Posts
Convergence, consolidation and collaboration

Convergence, consolidation and collaboration

K8S Time Travel | Inside the C++ Black Box

John Hayes
March 19, 2025

Welcome to Edition #32 of the newsletter!

Convergence, consolidation and collaboration

It has been yet another momentous and intriguing month in the observability world. There are three big stories which, in another month, might be headline-grabbers in their own right. First up, on the theme of collaboration, was this post about a link-up between mobile observability specialists Embrace and full stack vendor Observe. There has been very little fanfare about this, but we think it represents a growing awareness amongst vendors of the importance of the mobile user experience.

Next, a remarkable convergence story with the release of eBPF for Windows. eBPF has seemingly swept away everything in its path over the past couple of years, and a flavour for Windows had been trailed for some time. At the same time though, seeing the release being rolled out is still a bit of your pinch yourself moment.

Finally, a highly significant instance of market consolidation with the surprise announcement last week that open source observability stack HyperDX have been acquired by SQL juggernaut ClickHouse. Is this the birth of a new observability powerhouse?

The saying that change is the only constant certainly applies to the observability market, and it continues to be a dynamic and highly unpredictable space.

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

BlueSky

NEWS

ClickHouse Snap Up HyperDX

ClickHouse have established themselves as probabaly the fastest and most scalable database in the world, but they are also an increasing presence in the observability space. They are the backend for a number of vendors and have even published an OpenTelemetry Exporter. As commentators such as Josh Lee have shown, ClickHouse is an observability storage engine in need of a front end. Well now, they have one in the shape of HyperDX. HyperDX is an open source stack that has built a strong following amongst developers - and of course, they use ClickHouse as their backend. This is an acquisition which makes perfect sense and represents the creation of a potent new force in the market.

Observe Ramp Up Again

Last year, Observe made waves with news of some major funding rounds. This year they are really ramping up their tech stack as they become the latest vendor to rollout Front End Observability.

In an online world, the user experience is just as important as backend server reliability, and poor web site performance can have a damaging effect on a company’s bottom line. Naturally, the new feature includes the obligatory Core Web Vitals but also includes features such as Geo Mapping, Transaction Drill Downs and correlation with backend traces, to provide an end to end view of user interactions. Observe have also teamed up with mobile observability specialist Embrace to include dedicated diagnostics and analytics for mobile users.

Sound the Klaxons! It’s eBPF for Windows

It’s an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.

SquaredUp Throw Down The Dashboarding Gauntlet

Whether or not you believe that “dashboards are dead”, for most of us they are an essential observability tool. Given that they can communicate complex and important metrics simply and effectively, they are also ideal for sharing. Recently, a number of vendors have re-vamped their pricing plans and licensing agreements to make dashboard sharing more expensive and more difficult.

SquaredUp have made a dramatic play to buck this trend by including dashboard sharing on their Free Plan and unlimited dashboarding on their Starter Plan. The pricing model means that, on the Starter Plan, 100 dashboard views costs around $1. Hit the button below to find out more.

The Disclosure Bit!
As I mention in the footer of this newsletter, I am a Product Marketing Manager at SquaredUp.

Products

StepChange - An AI-driven Monitoring Platform

Coinbase’s $65m Datadog bill is now the stuff of observability legend. Well, meet the man who had to pay it - Niall O’Higgins. Niall was a Senior Engineering Manager at Coinbase for over four and a half years. His experience with Coinbase convinced him that there was potential for a tool that could manage performance and reliability without the bill shock. Together with fellow Coinbase alumnus Harry Tormey, he formed StepChange and the company has now released APX - which they describe as “an AI-enabled monitoring platform”.

The product is not designed to replace your existing observability stack. Instead, it integrates with platforms such as Datadog and Sentry and adds value by providing a layer of performance, reliability and cost management analytics. The product was only released in January of this year, but the company already boasts a pretty impressive roster of hyperscale clients - including the likes of LinkedIn, Apple, Google and Meta.

Chronosphere Roll Out Their Telemetry Pipeline

Along with Front End Monitoring, telemetry pipelines are one of the big trends in observability at the moment. Over the past few years, we have seen the rise of specialist providers such as Mezmo, Edge Delta and Vector (acquired by Datadog), whilst vendors such as Honeycomb and Middleware have built out their own implementations.

Chronosphere have now announced their own Telemetry Pipeline implementation, and, as is the case with other vendors, their proposition centres on three main benefits:

reducing telemetry volumes
ease of connectivity and routing
telemetry enrichment.

In terms of market differentiation, the company’s technical brief claims that customers can reduce log data costs by 30% and slash infrastructure resource usage by 95%. This sounds highly impressive, but the brief does not cite any benchmarks to back this up.

In Search of Time Travel for K8S Monitoring

In a recent LinkedIn post, Kubernetes expert Pieter van der Giessen speculated on a hypothetical Kubernetes Wayback Machine. His thesis was that much of our tooling is great for troubleshooting an ongoing error but diagnosing an incident that happened yesterday at 1pm is more problematic. This is largely down to the toil involved in reconstructing the state of the system at the time of the error. Time travel is probably a feature that would be on quite a few SRE wish lists - but does it exist in the wild?

It turns out that the answer is ‘yes’. According to Andreas Prins, a Product Marketing VP at SUSE, his company’s Observability product is able to “capture all time series and track the entire tree of dependencies and their state“ on a minute by minute basis.

You may be familiar with Robusta - which we crowned as best K8S Troubleshooting tool in the 2024 Ollys. Their CEO, Natan Yellin, also stepped forward with a link to this Loom video showing off his own product’s time travelling smarts.

This has got us thinking. Is there a hidden treasure in an observability product you use that you think the world should know more about? Let us know and hopefully we can compile your suggestions into a feature for the Observability 360 web site:
[email protected]

BlueSky

AI

The State of AI for SREs

As we all know, there has been some extremely wild hyping of the potential impact of AI applications on SRE and observability job roles. Whilst some vendors have succeeded in fluently integrating productive and useful AI features into their applications, others have delivered the trough of disillusionment.

Embedding AI into an observability solution obviously involves a lot more than just firing off queries to an LLM. In this LinkedIn article, Andrew Mallaband undertakes an in-depth look at some of the underlying patterns, principles and architectures involved in building AI models for observability contexts. The article also covers key questions such as data quality and completeness and automated vs human-in-the-loop remediation. This is a detailed and well-researched survey of the current state of play.

Elastic Unveil OpenAI Observability

Elastic took an early lead in placing AI at the heart of their observability strategy, and it is an investment that appears to be paying off as they have recently posted impressive financial results. The company has now ratcheted up its LLM observability game with the release of dedicated tooling for OpenAI observability.

Whilst there are a number of solutions for generic LLM observability, this is the first we know of that is tailored to a specifc LLM. The LLM space is evolving rapidly, with each vendor offering a range of different models, services and price plans. It is there maybe not surprising that vendor-specific observability solutions should arrive on the market.

So, what’s in the box? According to this article on the Elastic blog, the tooling breaks down analytics into separate data streams covering different OpenAI services such as completions, audio, vector stores etc. The tooling also ships with an out-of-the-box dashboard for key metrics such as invocation rates, token usage and model performance.

OpenTelemetry

The oTel Demo Gets An Upgrade

The OpenTelemetry demo application is a fantastic resource. Not only is it a great showcase for demonstrating OpenTelemetry in practice, it is also a very serviceable go-to for any developer looking for a sample microservices implementation. It is also boats tremendous ease to use - you can get up and running by deploying a Helm chart or just running a Docker command. It is not surprising that the Demo containers have been pulled over 12 million times.

Version 2.0 of the app has now been released and there are some notable updates. Although the web UI will look very familiar, there are some significant changes under the hood. Updates include the introduction of Flagd-ui for feature flag management, Redis being dropped in favour of Valkey for caching and the implementation of exemplars in the Cart service.

The OpenTelemetry Operator - A Deep Dive

If you ever wanted to get a clearer understanding of the OpenTelemetry Operator for Kubernetes, then look no further. This article by Kasper Borg Nissen on the Dash0 blog covers the subject with remarkable depth and clarity.

As well as discussing high level concepts, the article also includes step-by-step instructions and YAML scripts for creating Instrumentation resources. There is expert advice on the different modes for deploying the OpenTelemetry Collector as well as more advanced topics such as the OpenTelemetry Target Allocator - and whether it can take the place of Prometheus for metrics collection. Naturally, the article wraps up with a tie-in to the Dash0 product, but it mostly consists of technical content of high quality and great practical value.

Looking Inside The C++ Black Box

As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This month’s quote is from Alan Turing:

“Machines take me by surprise with great frequency.”

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.