Observability 360
Posts
A Turing Test for AI SREs

A Turing Test for AI SREs

The Ollys - The Winners Revealed | The Efficiency Paradox

John Hayes
February 02, 2026

Welcome to Edition #41 of the newsletter!

One of the great joys of the observability space is that it is not an oligarchy. Yes, there are big players but they neither control the market nor the narratives.

Innovation in the space does not come with a billion-dollar cost of entry. In this edition we feature Clarvynn, a product built by a single developer, but which represents a whole new approach to instrumentation. It’s a great example of the open source ecosystem producing creative solutions to challenges in observability engineering.

As usual, it has been a pretty breathless fortnight in the observability world - so let’s dive in!

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

BlueSky

NEWS

ClickHouse Ramp Up Again - $400m funding round, Langfuse acquisition, Managed Postgres Rollout

ClickHouse have got off to a stunning start in 2026 with a trifecta of major announcements. First was the news of a massive $400m funding injection. The financing will enable the company to drive forward with its strategy of building a unified platform spanning analytics, data warehousing, observability and AI.

The acquisition of Langfuse underscores the rapid emergence of LLMs as a first-class concern for observability systems. It is fast becoming a must-have for full-stack vendors. Combined with last year’s acquisition of Libre Chat, ClickHouse are building out a formidable AI armoury.

But that’s not all. There was also the small matter of a new managed ClickHouse Postgres platform, underpinned by a new architecture designed to minimise IO bottlenecks. This is an announcement that covers quite a few bases and spells out a very bold ambition spanning multiple data and analytics verticals. It almost certainly won’t be the last such announcement from ClickHouse in 2026.

Lightrun Make a Bet on Context as King

Lightrun have always had a unique take on observability architecture, and, with the release of Runtime Context, they are pushing the needle even harder.

Runtime Context is Lightrun’s tilt at redefining the temporal viewpoint of observability - instead of a post-hoc inspection of outputs, runtime context is a real-time view into the live system - a kind of keyhole diagnosis for your applications.

This, they argue, is a perspective that breaks with the loop of checking logs, fixing and then re-checking outputs. This raises an obvious question - is this really observability or is it debugging on steroids? It certainly blurs the boundaries between the two domains.

The Ollys - Award Winners Revealed!

The Ollys are Observability 360’s annual awards for the Observability industry. They are our own idiosyncratic take on the best innovations, features, blogs and talks in the observability space over the past year.

This year saw the introduction of a number of new categories and there was a whole slew of new winners. The Ollys are not about crowning a “best” overall product. Instead, they highlight great implementations in particular domains. This year’s awards, therefore, included categories such as:

RUM
LLM Observability
Cost Management
Pipelining

There were also non-functional categories such as Best New Product and Best Open Source Project. If you want to know who scooped the awards, just hit the button below.

Products

Clarvynn - tackling telemetry bloat at source

The last few years have seen a rapid development in technologies for telemetry pipelining. By capturing traffic in the cloud or at the edge they can reduce costs as well as applying filtering and transformation before telemetry reaches vendor backends.

For Dheeraj Vanamala though, this still leaves users paying a processing tax at source - creating bloat and wasting CPU cycles by generating largely redundant telemetry. He has created Clarvynn, an open source tool which sits inside the application runtime and “curates signals before they are marshalled to protobuf/JSON”.

One drawback of telemetry pipelines is that they often involve SREs writing rules against services they have not written. Clarvynn puts devs back in control. With Clarvynn, devs do not need to explicitly instrument their code. Instead, they use a YAML configuration file to define policies which are then applied at runtime by Clarvynn.

This is a really interesting solution and as far as we know it is the only one that attempts to reduce bloat by applying sampling directly at source.

From the Blogosphere

Observability - Isn’t Just About Observing

Observability is not just about gaining an understanding of our systems for the pure purpose of accumulating knowledge. There are probably less expensive ways of achieving technical wisdom. The “doing” bit is the subject of this article by Solution Architecture Manager Cristiano Messina.

His premise is that modern observability systems give us all the visibility that we need. That is a problem which has been solved. We are now at the point where we know what is happening. The next question is “what decision do we need to make”.

There must be something in the observability air, because this also chimes with the position that Andrew Mallaband has recently arrived at, where he sees decision support as increasingly becoming a function of observability.

Nobody has a crystal ball into the future of observability (or any other domain for that matter) but both the article by Cristiano and this article by Andrew are valuable aids in mapping out the possibilities.

Less Pain, More Gain - The Google Approach to SRE

Engineers at Google not only face the challenges of managing infrastructure at hyperscale. They also have the pressure of meeting punishing KPIs for performance and availability. When an incident occurs, the relevant team needs to acknowledge the fault within five minutes and is then under “extreme pressure” to reduce “Bad Customer Minutes”.

This means that the initial focus of the team is on mitigation rather than root cause analysis. The article is really illuminating for its description of how teams have tightly integrated Gemini into their incident response playbook. This contains a complete set of tools for helping to manage every part of the incident lifecycle.

Reslience and the Efficiency Paradox

1177 B.C. by Eric H Cline is a fascinating historical account of the collapse of civilisation in the Late Bronze Age. What is remarkable in this story is that an accelerator of the downfall of this world was its very sophistication. It was a civilization built on a network of interdependent states and economies. When one of those pillars fell, it unleashed a domino effect that brought down the entire edifice.

This article by Uwe Friedrichsen is a reflection on the nature of resilience in modern society, but there are striking parallels with Cline’s narrative. Uwe considers the apparently paradoxical situation where greater “efficiency” results in reduced resilience. I guess, this is another of those iron triangles where something has to give. Greater efficiency means less redundancy but less redundancy means less resilience….

OpenTelemetry

What 10,000 Slack Messages Reveal About the State of OpenTelemetry

OpenTelemetry is a vast project which a large community of developers and end users. What though, are the concerns and pain points that people are experiencing as adoption grows. OllyGarden founder Juraci Paixão Kröhling decided to take a data-driven approach to answering this question - analysing the contents of nearly 10,000 messages in the #opentelemetry and #otel-collector Slack channels.

This article on the OllyGarden blog breaks down the analysis into two separate parts. Firstly, the most discussed Collector Components and then the Top 10 Pain Points. If you have ever spent an afternoon wondering why you can’t see your telemetry flowing through the Collector, you won’t be surprised to hear that Connection and Export Failures register the highest “frustration score”.

What if all your observability data was just Parquet files?

That is the subheading of a blog article by Clay Smith. As opening gambits go, it is pretty catchy and, we have to fess up, we were hooked. It’s kind of like the observability version of nuclear fusion in your back shed - who can resist?

The article is a pretty impressive walkthrough of a recipe for building your own OpenTelemetry lakehouse on a stack including Parquet, DuckDB and Iceberg.

Ultimately, this is more of a thought experiment than a manual for building your own competitor to Datadog but it makes for an exhilarating tour of a number of key technologies. A case in point is DuckDB - a technology enjoying great popularity for its ability to apply to SQL syntax over non-tabular data sources.

This is just a really dazzling trip through the landscape of possible future Planet Observability.

OTLP - the protocol that does all the heavy lifting

There is a huge body of literature on OpenTelemetry and much of it covers topics such as architecture, instrumentation and oTel Collector configuration. Generally, not much thought is given to OTLP, the transport layer that actually moves our telemetry around. It is kind of taken for granted as part of the oTel furniture. It is, however, the engine room of observability architecture, responsible for ensuring that every telemetry message is delivered, every time.

If you have ever wondered what goes on under the hood when you are setting your Exporter protocol, this article on the SigNoz blog will hopefully put you in the picture

AI

A Turing Test For Your AI SRE?

The AI goldrush is in full swing, and, automating the SRE role is a goal that many startups have in their sights. There are already numerous tools on the market that claim to be able to perform triage, RCA and even remediation.

Quesma, however, decided to raise the bar. Identifying errors and launching playbooks is now a well understood domain. How would LLMs cope with the more subtle and nuanced task of actually instrumenting code with OpenTelemetry.

The Quesma team evaluated how successfully 14 different LLMs were able to apply OpenTelemetry instrumentation across 11 different languages and frameworks. Have we reached the singularity yet? Hit the button below to find out.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from Alan Turing:

“The original question, 'Can machines think?' I believe to be too meaningless to deserve discussion.”

About Observability 360

Hi! I’m John Hayes - as well as publishing the Observability 360 newsletter, I am also an Observability Advocate at SquaredUp.

The Observability 360 newsletter is an entirely independent entity. All opinions expressed in the newsletter are my own.