The Long Road to O11y Maturity

Datadog's $170m bite | PromTel's Difficult Courtship

Welcome to Edition #36 of the newsletter!

In this month’s edition we cover the ManageEngine State of Observability survey. One of the questions it considers is observability maturity. We think that this is now a really salient issue. Observability is not a “fire and forget” deal. Like DevOps, it is not just a package that you buy in and then roll out across your business like a Windows Update.

As the metaphor of the OllyGarden implies, it is a domain that requires ongoing cultivation. Equally, as the OpenTelemetry O11y By Design philosophy argues, observability needs to be embedded within a wider strategy which defines standards and practices.

We do not go along with the critiques that observability has failed or is failing. Instead, we believe that many companies are still at a relatively early stage of maturity and, as they develop clearer strategies and build skill sets, more and more tangible benefits will be realised.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

OllyGarden Officially Emerges From Stealth

If you have been following the last few editions of the newsletter, you will be aware that we think that OllyGarden is one of the most important projects in the observability space. Last month, the OllyGarden team announced their Instrumentation Score and this month the OllyGarden product itself is out of stealth.

So, what is the difference between the Instrumentation Score and the OllyGarden product? Well, the score is an open source framework for evaluating the quality of your telemetry. It consists of a set of rules as well as guidelines for applying them. OllyGarden is a SaaS platform which will apply the framework to evaluate the quality of your telemetry.

The Telemetry Score has very quickly attracted support from some of the biggest names in the observability space. Likewise, the OllyGarden product has financial backing not just from the usual investment vehicles but also from observability leaders such as Datadog Ventures, Grafana Labs and Dash0. For us, this is a safe bet as we think that OllyGarden will lead the way in defining best practise for oTel instrumentation.

ManageEngine State of Observability 2025 Survey

As you know, we have a bit of a weakness for State of Observability surveys. This is not just because we love lingering obsessively over stats - it's also because they are an empirical source of data on how observability professionals feel about the tooling they use and the challenges they face. ManageEngine, who are part of the Zoho group, have recently released their State of Observability 2025 report. They surveyed over 1,000 companies and the results give plenty of food for thought.

Interestingly, whilst many single pane of glass observability vendors bemoan "tool sprawl" in their reports, ManageEngine actually find that organisations using 10 or more tools achieve greater MTTR reduction and achieve higher productivity gains.

It is also really interesting to see that, for many companies, the return on investment in observability tooling does not derive purely from APM. It seems that improved system security and operational efficiency gains also emerge as important benefits.

Datadog face loss of $170m OpenAI account

OpenAI must be one of the most lucrative observability accounts on the planet, and the news that they are looking to build an in-house solution based on ClickHouse means that their current provider, Datadog, could be left with a $170m hole to fill. Investment brokers Guggenheim reacted by actually marking down the status of Datadog shares to Sell. Such is Datadog’s financial strength, however, that they are expected to weather the blow and continue to increase earnings.

This article on the ClickHouse blog sheds more light on some of the rationale behind OpenAI selecting ClickHouse for their solution. Obviously, performance and scalability were two of the major factors. There were, however, other important considerations such as the flexibility of ClickHouse indexes, allowing for optimisations for read operations. Equally, the open source nature of the product means that OpenAI engineers can easily tweak the codebase to quickly resolve bottlenecks.

HolmesGPT submitted to CNCF

Robusta Dev is one of our favourite products for Kubernetes monitoring and we were delighted to learn that they have now submitted their HolmesGPT AI engine for acceptance as a CNCF sandbox project.

Robusta are a relatively small company, but they have two key advantages - a deep understanding of Kubernetes and a close affinity with the needs of Kubernetes admins. This is the foundation on which they have built a powerful and intuitive product that holds its own in a highly competitive field. It is a testimony to the quality of the product that their submission is being supported by Microsoft.

Bringing HolmesGPT under the auspices of the CNCF will be a major win for the community. Hit the button below to view the submission - and maybe even give it your support!

Gartner Unveil 2025 Magic Quadrant

For an observability vendor, there is probably no greater accolade than being awarded a coveted Observability 360 Olly. A placing in the Gartner Magic Quadrant is not a bad consolation though, and the companies that have made the cut for this year’s grid have now been named.

In many ways, there is not much change from last year's rankings. The big six of New Relic, Datadog, Dynatrace, Splunk, Elastic and Grafana all resume their places in the Leaders box. For reasons best known to the Gartner ranking algorithm, ServiceNow and Logz.io have both dropped out of the quadrant altogether, whilst Apica, Coralogix and AIOps specialist ScienceLogic are new entrants into the Visionary sector.

Whilst the companies named in the quadrant may get all the bragging rights, there are also some notable logos in this year’s Honourable Mentions. Despite only officially launching last October, Dash0 get a mention for the enormous momentum they have built up. Equally, Kloudfuse - who have been busy building out a highly capable platform, also achieve recognition. If you want to find out more about Kloudfuse, then this in-depth article by Andrew Mallaband will give the full picture.

Dynatrace Unleash ‘Third Generation’ Platform

Dynatrace have tended to position themselves as being at the cutting edge of observability, with technologies such as their Davis AI Engine and their Grail data lakehouse. This is certainly a theme which is prevalent in the messaging for their latest major platform release. The company are referring to it as the third generation of the product.

According to the release blog, Dynatrace 3.0 is built on the three fundamental pillars of Knowledge, Reasoning and Actioning. In concrete terms, this seems to translate into building AI capabilities right across the product to enable faster root cause analysis, deeper automation and incident prevention. This is delivered through a number of new features such as AutomationEngine, AppEngine and OpenFeature.

We covered the Dynatrace MCP server in our last edition and we think it is indicative of how the AI revolution will play out in observability. Rather than being a big bang, it will be the accumulation of a number of incremental leaps.

Products

Sevii Emerge From Stealth With Pioneering Cyber-defence Platform

We first came across Stephen Collins a couple of years ago when he started up Logsail - a company harnessing AI to dynamically automatic application logging. Since then, the company have pivoted to the security domain. They have now rebranded as Sevii and have just launched what they describe as the industry’s first Autonomous Defense and Remediation (ADR) platform.

The company recently completed a pre-seeding funding round and is led by Curt Aubley - a former Intel CTO whose CV includes managing cybersecurity at companies such as CrowdStrike and LockHeed Martin.

A primary objective of the product is to reduce the time cycles for threat detection. There is often a significant time lag between Crowdstrike detecting an intrusion and a security team being able to assemble, triage and remediate the issue. The aim of Sevii is to overcome this lag by by hooking into Crowdstrike API's and unleashing its Cyber Warriors to instantly diagnose and, optionally, remediate any detected breaches.

If you ever wanted to see what the future of cyber attack remediation looks like, well maybe now you can.

Zymtrace - bringing tracing to the GPU

In Edition 34 of the newsletter, we covered Neurox - a startup specializing in GPU observability - a vitally important area given the expoential growth in AI workloads. Zymtrace is a new product which also offers GPU observability - albeit from a different perspective. Whereas Neurox has more of a focus on efficient utilisation, Zymtrace provides analytics at the code level. The team behind the product have already developed an eBPF-based profiler for CPUs and they have now applied their expertise to providing telemetry for code running in GPUs.

One of the major selling points of the product is that by monitoring the full code journey across CPU and GPU they can gain a deeper understanding of both bottlenecks and underutilisation. As well as highlighting cost savings the product also measures environmental impact by providing metrics on CO2 impact, and given the scale of GPU compute. the numbers involved can be very substantial.

AI/LLMs

Observe Inc Roll Out LLM Observability

LLM observability is increasingly becoming a must-have capability for full stack systems and Observe Inc are the latest vendor to roll out the feature in their product.

The first iterations of LLM observability tooling tended to take a quantitative approach - mainly returning metrics for concerns such as latency, error rates and token counts. Vendors are now grappling with the equally important but rather more elusive question of response quality.

Observe are tackling this by capturing the full context of a session, so that users can drill down through the inputs and outputs of each LLM call. The Observe LLM Explorer gives full visibility of prompts and responses created at each step of an interaction enabling users to pinpoint the origin of errors in the LLM’s chain of thought. If you want to try out the feature, it is currently in public beta.

On a related note, Observe have also rolled out an MCP server that integrates with Claude, Cursor and Augment. Similar to the Dynatrace MCP server we looked at in last month’s edition, the Observe server allows users to run natural language queries such as “show me the top five http 500 errors in the last hour“. The productivity gains of NLQ are self-evident, but there is also a secondary benefit. By integrating with IDEs such as VS Code they also help in breaking down the historic disconnect between devs and observability platforms.

Bringing Observability to MCP Servers

Machine Context Protocol has established itself in software toolchains with incredible rapidity. Unfortunately, from an observability point of view, MCP servers tend to be something of a black box. Naturally, like any other service in your application landscape, they need to be monitored. At the moment observability tooling is still playing catch-up and there are few out of the box solutions.

This timely and succinct article on the SigNoz blog clearly sets out the rational for MCP observability. It also explains how OpenTelemetry’s context propagation feature, as well as its support for multiple coding languages make it an ideal solution for MCP server instrumentation.

OpenTelemetry

Prometheus and oTel - it's complicated...

Although OpenTelemetry has firmly established itself as the lingua franca of observability, achieving a convergence with Prometheus metrics has always been something of a stumbling block. A couple of recent blog articles - the first, published on the oTel blog, and the second, by Prometheus co-founder Julius Volz, really shed light on the frictions between the two frameworks.

In the oTel world, Resource Attributes play a fundamental role in standardising resource identification and enabling correlation. Unfortunately, translating them into Prometheus labels can cause cardinality explosions.

The oTel blog article discusses the results of some research into how these incompatibilities might be resolved. In the short term, there are a number of not particularly elegant workarounds. In the longer term, there are potential solutions such as saving oTel resource labels as metadata in Prometheus.

The Julius Volz article sets out the divergences between the two technologies in rather starker terms. The article includes a disclaimer that, as a co-founder of Prometheus, the article may be biased. Having got that formality out of the way, there follows a clinical dissection of oTel's handling of Prometheus metrics. Amidst the carefully argued critique, there is also this section, which highlights the philosophical divide between the two approaches in pretty unequivocal terms:

This is an important debate and one which does not look like going away any time soon.

oTel Weaver - Making Instrumentation a First Class Citizen

As we have said, we believe that the OllyGarden tooling is a huge step forward in helping engineers to improve the quality of their instrumentation. Whilst OllyGarden can help you to debug and fix bad telemetry, it is not a substitute for establishing standards and practices across your teams to ensure that good telemetry is a priority and not just an afterthought. This is the essence of the Obervability by Design philosophy.

The centrepiece of this is the oTel Weaver, an open source tool designed to help teams to "define, validate, and evolve telemetry schemas". Achieving this consistency across multiple teams is easier said than done and the Weaver ships with a variety of features to assist with creating registries, generating documentation and validating that code complies with semantic conventions.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This edition’s quote is from the British statistician George E. P. Box:

Discovering the unexpected is more important than confirming the known.

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.