Observability 360
Posts
ClickStack - The SQL Has Landed

ClickStack - The SQL Has Landed

The End of Bad Instrumentation? | Being Operationally Intelligent

John Hayes
June 30, 2025

Welcome to Edition #35 of the newsletter!

Good News for Humans!

You probably knew it if you were an observability engineer or an SRE, but just to reiterate, humans are still very much in the loop.

AI is making rapid and incredible advances, but it is better at consuming than producing. In this month’s newsletter we cover the Telemetry Score - which underlines the value of high-quality instrumentation. We explore the theme further with an article on Intelligent Observability from the Observability 360 web site. To drive the point home, we also look at Ciroos - an AI tool which positions itself very much as a ‘teammate’ rather than a replacement for SREs.

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

BlueSky

NEWS

ClickStack - The SQL Has Landed!

For the past few years, ClickHouse have been promoting their vision for SQL based observability. Now, with the release of ClickStack, they are putting the theory into practicse. ClickStack, as the name suggests, is not a monolithic platform. Instead, it is a loosely coupled three layer stack consisting of an OpenTelemetry ingestion gateway, a ClickHouse backend and a HyperDX front-end.

ClickHouse as a company already have an A-List of customers for the database product and also a recently announced a $350m cash injection. It is difficult to see them having any objective other than being a major player in the observability market.

This article on the ClickHouse blog introduces the product and outlines its salient features, whilst this article on the Observability 360 web site looks at the wider implications for the observability market.

Instrumentation Score - a Benchmark for Observability

In our last edition we trailed the work that Juraci Paixão Kröhling and his colleagues were carrying out in the OllyGarden.

Now the wait is over, and the team have unveiled the fruit of their labours - the Instrumentation Score - a project which, we think, represents a huge step forward for SREs and observability engineers.

We were lucky enough to get a preview of the tool and this was our reaction:

We believe that it is one of the most exciting and valuable projects in the observability space right now. Doing observability and doing instrumentation properly is hard. The Instrumentation Score is not just about applying a benchmark to the quality of your telemetry. It is also a mentor giving you guidance on optimising your code and configuration.

For us, the clincher is that the OllyGarden Team are totally committed to running this project on open source principles and making sure that everybody has access to the score and its recommendations for doing good telemetry.

The project is supported by the likes of Grafana, Dash0, Datadog and many others. If you are interested in levelling up your observability skills just click on the link below and start sowing the seeds for great telemetry!

Embrace Extend their Platform to the Web

Over the past few years Embrace have established themselves as pioneers in the field of mobile observability. Their platform has been built on the principle that providing a smooth mobile experience is a critical business priority that is overlooked in many observability systems.

Perhaps not surprisingly, the company received feedback from their customers that they would like to offer their web users the same quality of service that Embrace offered to mobile users. The upshot of this is that Embrace have now rolled out a new RUM (Runtime User Monitoring) product aimed at web users.

Whilst this is in a sense a major strategic shift, it makes technical and economic sense. The company has built-up hard-won expertise in crafting tooling that captures the nuances of the mobile experience and it makes sense to apply lessons learnt from this to understanding and improving the experience of the users of the web front end.

After Victoria Metrics and Logs - are Traces Next?

VictoriaMetrics have built up a huge following of users with their dedicated metrics platform. Last year, they extended the platform to include logging capabilities and, in a recent blog post, they teased the possibility that they may also support traces - and join the ranks of “full-stack” vendors.

The post is not your typical marketing blurb. Instead, it takes the form of a kind of internal technical dialogue that explores different architectural options for implementing traces. The result of this is a fascinating examination of the anatomy of a trace. If, conceptually, traces can be de-composed into log-like elements, then what are the possibilities for leveraging VM’s existing logging technology to support traces?

As well as the conceptual breakdown the article looks at the practical logistics of backend storage (yes, ClickHouse does get a mention) and the options for visualization.

This is a great article and a fascinating insight into the engineering thought process.

Products

Gigapipe a ‘polyglot observability’ Platform

With a name like Gigapipe you might be forgiven for expecting that this was going to be about Yet Another Telemetry Pipeline. In fact, it describes itself as a “polyglot observability platform” and its USP is that if ingests logs, metrics, traces and profiles into a single, unified datastore.

The platform can natively ingest a multitude of protocols including OpenTelemetry, Prometheus, Tempo, Datadog etc and claims to have agents for thousands of telemetry sources. The backend itself is a mixture of ClickHouse and DuckDB

The platform does not provide its own visualization layer - instead you need to plug in a tool such as Grafana or Perses.

The software is available as open source, but it also has an unusual pricing model for its paid plans. Rather than charging fees for ingestion, the plans are based on the cloud resource usage, with tiers offering differing levels of RAM, CPU, data transfer etc.

Ciroos - an ‘AI teammate’ for SREs

As the AI goldrush heats up, many vendors are at pains to reassure us that their products are here to help humans, not replace them. This is the messaging for Ciroos - an AI that is described as as ‘teammate for SREs’.

The premise of Ciroos - as with similar tools - is that it will relieve SREs of toil and alert fatigue, allowing them to get on with the important stuff.

Also like other tools such as Resolve AI, Ciroos is headless - it integrates with collaboration channels such as Slack and responds to alerts or user prompts. It will then investigate anomalies and provide solutions and remediations.

One interesting differentiator is that the company recognise that the many teams may already be using AI agents, and Ciroos is designed to interact with these agents rather than displace them. The software is not yet GA but you can arrange a demo to see it in action.

AI

24 Principles for your AI Strategy

The AI revolution poses a massive strategic challenge for every organisation.

The pace of change in the field can be overwhelming - with existing technologies evolving at breakneck speed and new products entering the market at a seemingly exponential rate. The feverish level of activity makes it difficult to identify fundamentals and build out a coherent enterprise strategy.

In this highly accessible article, Bijit Ghosh distils 24 fundamental principles derived from his experience in building agentic AI platforms. Some of the key messages are:

the future is about agents, not monoliths
interoperability and orchestration will be key to successful implementations
AgentOps will become the new DevOps

Whether you are a DevOps Engineer, Engineering Manager or SRE this article contains a host of bite-sized takeaways to inform your organisation’s discussions on building an AI strategy.

Natural Language Querying for your Telemetry

Natural Language Querying is one of the most exciting applications for AI in observability practice. It is a tremendously democratising capability, enabling users at all levels to explore their telemetry without the learning curve of languages such as PromQL.

In this medium article, Adriana Villela shows how you can use the Dynatrace MCP Server to enable natural language querying of Opentelemetry data. The article uses VS Code as its MCP client, but the setup instructions can easily be applied to other clients such as Claude, Windsurf and Cursor.

Once the simple setup process is complete the results are pretty jaw-dropping. For example, you can retrieve a full list of services in your enviroent just by asking:

“How many unique services are running in Dynatrace”

Zero knowledge of any query language is required. This is obviously a very simple example, but the article goes on to look at more advanced queries, as well as covering some potential gotchas you might encounter. This is a really smooth MCP integration and you would expect other vendors to follow suit.

From the Blogosphere

Intelligent Observability is here - and humans are still in the loop.

There has been feverish (and vendor-inspired) speculation about SREs and observability engineers being replaced by AI bots and agents. As this article on the Observability 360 web site points out though, humans are still very much in the driving seat.

The reason for this is that whilst machines are great at ingesting and analysing telemetry they are far better at consuming than they are at producing.

There is a growing backlash against the “ingest everything” model and a move towards helping observability engineers to reduce telemetry volumes and instrument applications with greater skill and precision. This article looks at how the growth of the “wide events” model and the emergence of the Telemetry Score return human skill and judgement to the centre of the observability stage.

Acting On Impulse - How Airbnb Do Load Testing

Load testing can be simple in theory but in modern distributed architectures, it involves a lot more than throwing requests at an individual service.

This article on the Airbnb engineering blog looks at how the company’s engineers use the Impulse load-testing framework to handle a number of more complex requirements such as:

dependency mocking
managing messaging and asyncrounous calls
collecting upstream and downstream traffic

Unfortunately, at the moment Impulse is just an internal Airbnb framework, so you won’t be able to get your hands on it at present. At the same time, the article provides a valuable blueprint for tackling advanced, real world load testing scenarios.

OpenTelemetry

Keeping Your oTel Collector Secure

If you are using the Ingress API to secure your oTel Collector, It may be time to level up.

Deploying an oTel Collector to a Kubernetes cluster is a common use case, however, if you need to open your Collector to clients outside of the cluster you are creating a potential attack vector - so it is critical that your security configuration is spot on.

This article on the OpenTelemetry blog shows how you can use the Kubernetes Gateway API and mutual TLS (mTLS) to keep the bad guys out.

This is a hands-on and relatively technical article, and you will need familiarity with Helm, SSL certificates and Kubernetes configuration to apply this solution. If you are involved in deploying oTel Collectors to Kubernetes, it should definitely be on your reading list.

oTel SDKs - Measuring the Overhead

Over the past year or so, the debate around OpenTelemetry has shifted from “should I use it?” to “how should I use it?”. Naturally though, like any other tool or framework, OpenTelemetry is not perfect and using it implies trade-offs.

One concern that is frequently expressed is the performance overhead involved in using oTel SDKs. For most of us this may not be an issue, but it can be significant when working at large scale.

This article on the Coroot blog attempts to bring some empirical data to the discussion with some benchmarks for using the oTel SDK in a simple Go application. The numbers themselves are interesting, but the method used is also a useful template if you are interested in doing your own benchmarking.

Podcasts

Operationally Intelligent - Ramping Up the RoI

Operational Intelligence is a term that is getting an increasing amount of traction. The essence of the concept is that it examines how we can leverage telemetry of all kinds - i.e. not just the classic quartet of Metrics, Events, Logs, Traces (MELT) but also a whole plethora of other signals, to help businesses improve processes in real time and achieve genuine RoI on the observability expenditure.

There are few people in the field who know more about the subject than Michael Hausenblas - who is not only a product lead at AWS but also an OpenTelemetry contributor and author of the book Cloud Observability in Action.

In this inaugural edition of the Operationally Intelligent podcast, Michael talks to Adam Kinniburgh of SquaredUp and shares insights on leveraging observability tooling to bring about real time improvements in performance and reliability.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This month’s quote is from Isaac Asimov

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...’”

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.