- Observability 360
- Posts
- An End to Firefighting?
An End to Firefighting?
Elastic Want Your Metrics | The Truth About AI Agents

Welcome to Edition #47 of the newsletter!
The big story since the last edition of the newsletter is the graduation of the OpenTelemetry project. This is a significant landmark and everybody involved with the project deserves recognition for this achievement.
The great success of the project though is not just in being recognised by the CNCF, it is the adoption of OpenTelemetry in the wild. The custodians of the project have faced enormous technical and logistical challenges - as well as encountering no little scepticism and resistance along the way.
The fact that OpenTelemetry is now the default standard across the industry is the culmination of years of patient and meticulous effort from highly talented and dedicated developers, architects and engineering leaders.
Obviously, this is not the end of the road, just the end of a chapter. There are well documented concerns around performance and operational complexity. However, with a community of 12,000 contributors across 2,800 companies, the momentum behind the project now seems unstoppable.
Feedback
We love to hear your feedback. Let us know how we are doing at:
NEWS
OpenObserve Secure $10m Funding Boost

If OpenObserve were one of those vendors that were on your radar but were a bit of an unknown quantity, then maybe now is the time to take a closer look. They are a company that have been quietly building out a highly scalable open source observability platform and have now been boosted by a $10m Series A funding round.
This is an investment that seems to be built on a strong foundation. The company was only launched in 2022 but is in use at over 7,000 organisations and its GitHub repo has over 19,000 stars. The system was built in Rust from the ground up and ships as a single binary - so getting started is super-simple. You can be up and running in a matter of minutes.
At the same time as the funding announcement the company also launched a suite of AI-driven features under the rubric of Observability 3.0. Other vendors and commentators have already come up with their own definitions of Observability 3.0, but in the OpenObserve world it means tooling for AI SRE, Anomaly Detection and LLM Observability.
Honeycomb and Embrace Strike Up Strategic Alliance
As you probably know, Embrace began life as a mobile observability specialist, but over the past year have also developed a market-leading RUM platform. They have also been active in building alliances - notably collaborating with Chronosphere in their ambitious plan to create a multi-vendor composable observability platform.
Their latest collaboration is a strategic alliance with Honeycomb. If you are a regular reader of the newsletter, you will probably be aware that Honeycomb already rolled out their own Front End Observability implementation in 2024. So what gives? Well, the existing Honeycomb implementation is not being replaced. Instead, the approach will be additive, integrating Embrace’s comprehensive mobile analytics to provide complete end-to-end coverage for web and mobile contexts.
Elastic - the New Home of, erm Prometheus??

Elastic has long been the champion of advanced log search. One vendor flex we didn’t have on our 2026 bingo card was that they would position themselves as the ultimate destination for your Prometheus metrics.
This is a two-pronged assault which involves making PromQL a first-class citizen in the Elastic ecosystem whilst also outperforming the Prometheus server on ingestion and scalability. Although vendor metrics must always be taken with a pinch of salt, the company is publishing some pretty bold numbers - query times 30x faster than Mimir and 60% storage savings compared to Prometheus.
Ultimately, the play is not just about pitting Elastic against Prometheus - it is also about going toe-to-toe with other metrics powerhouses such as Grafana Mimir and ClickHouse.
ClickStack Launch Private Preview of Cloud Offering

It has been a busy year for ClickStack - the observability platform built on the ClickHouse database system. Having released a slew of enhancements such as Always-On Event Deltas as well as rearchitecting their Ingestions Schemas. They have now announced the launch of ClickStack Cloud.
If you are familiar with the ClickStack ecosystem, you may be wondering how this differs from Managed ClickStack - which is also a cloud offering. Fundamentally, the difference lies in the management of the backend database. In the Managed ClickStack offering the ClickHouse database ran in the cloud but users still had responsibility for managing and configuring the cluster. Whilst this had the advantage of giving teams fine-grained control, it also involved a maintenance overhead.
The ClickHouse Cloud service removes this burden and takes care of automatic tuning for observability workloads. The service has initially been launched as a private preview but places are still available for companies wishing to get involved.
Products
NoFire AI - Stop the Fire Before it Starts

Over the past few years dozens of AI SRE products have entered the market - bringing huge advances in Root Cause Analysis and reducing MTTR.
NoFire is a product which aims to take the game to the next level. As well as helping to resolve incidents, it also attempts to intercept breaking changes before they reach production. The product launched officially last month and is backed by $2.5m from a seed funding round led by Marathon Venture capital.
In common with tools such as Traversal and RunWhen, it attempts to build up a picture of your entire production context. This goes beyond tracking your production applications and includes building a map of services, dependencies and changes. Armed with this context it is able to build a causal graph, identify risks and estimate the potential blast radius of a breaking change.
Cardinality Guardian - Stay One Step Ahead of Cardinality Explosions

Cardinality explosions are the bugbear that haunts every observability engineer working at scale. A badly chosen attribute that causes astronomical spikes in metric volumes as well as burning a hole in your observability budget.
Cardinality Guardian is a tool which provides a neat solution to this problem. It runs as an OpenTelemetry Collector Processor and detects labels which appear to be exploding. When activated, it will strip out the offending label but leave the rest of the metric intact.
The tool is developed by software engineer Yasmine Elayyat and she is currently seeking users who will give feedback on the product. The code is well worth checking out - this is a sophisticated piece of kit.
Latitude - Streamlined LLM Issue Detection

The market for LLM observability tools has grown rapidly and includes major players such as Langfuse, Comet Opik and OpenLit. These are heavy-duty environments for testing, development and debugging.
Latitude is an open source LLM observability platform that takes a different approach, specialising in detecting and tracking production issues in your agentic AI systems. It helps teams to manage their AI applications by intelligently grouping failures into Issues and then tracking their progress from New to Resolved.
The system already has 4k stars on GitHub and boasts a pretty impressive roster of logos, including Wix, Nvidia and AWS.
AI
Datadog - The State of AI Engineering

Most engineering teams have either adopted a strategy for Gen AI monitoring - or are in the process of doing so. It is, however, a new and fast-moving field and knowing what to measure is not always obvious.
This recent report from Datadog is not a strategy document per se. It is, however, a really useful resource for understanding how teams are using AI in their production environments, and contains some really practical and actionable insights to help inform your AI observability strategy.
Some of the more interesting findings centre around framework adoption and caching. The authors argue that whilst frameworks such as Langchain can bring structure to AI Engineering, there can also be hidden costs and unexpected side-effects. They also found that many teams are not maximising the potential for prompt caching. As the age of discounted AI comes to an end, these inefficiencies could potentially have significant cost implications.
Unlike some similar reports, this one is eminently readable, as it condenses its findings into 7 concise ‘facts’. It is also based on empirical evidence derived from a large sample of Datadog customers.
Agent Reliability - the Reality Behind the Stats

Your agent tool calls have a 97% success rate. It’s not Five Nines, but is it good enough? Well, as this article by James A. Wondrasek argues, tool calls do not happen in isolation - they are executed in chains. With a 3% failure rate, a task that spans 30 tool calls has roughly a 40% chance of completing without a single failure. Worse, these failures are often captured silently rather than being bubbled up to the user.
This article is a trenchant and clear-sighted trawl through some of the stark realities of running agents in production. It is not a disgruntled rant, more of a wake-up call that emphasises the need to ground AI deployments in best practices around architecture, governance, security and cost control.
OpenTelemetry
OpenTelemetry Ecosystem Explorer - Navigating the Instrumentation Maze

In Edition 38 of the newsletter we covered this article by Grafana Engineer Jay de Luca, where he discussed the OpenTelemetry Ecosystem Explorer project - an attempt to document every instrumentation library across the vast Java ecosystem. This currently spans 251 libraries comprising a complete A-Z - from ActiveJ to ZIO HTTP.
Although the project is still a work in progress, the web site is now live at https://explorer.opentelemetry.io/. This means that Java Developers can now search for a particular library by name and drill down into detailed descriptions of which spans, metrics, and attributes each instrumentation emits.
In this article on the OpenTelemetry blog, Jay provides an update on the current state of the project as well as looking at some of the remaining challenges.
Blueprints for Success - Taking the Strain Out of oTel Architecture

One of the most common themes that emerges when users are providing feedback on OpenTelemetry is the complexity involved in architecting and building the necessary infrastructure.
Although you can spin up an OpenTelemetry Collector with a simple Docker command, actually running a fleet of Collector in large-scale production environments involves careful architecting of resources such as load balancers, ingresses, daemonsets and more.
This article by Daniel Gomez Blanco on the OpenTelemetry blog introduces a new initiative called OpenTelemetry Blueprints. The aim is not to provide a single monolith guide but to define individual blueprints for specific tasks, which can then be assembled in a modular fashion.
The blueprints will backed up by Reference Implementation. These are documents where companies share the technical details of their own real-world implementations.
Gen AI Normalizer - Stopping Your Telemetry From Getting Lost In Translation

As you may know, there are a number of OpenTelemetry libraries for Gen AI, covering both generic concerns such as event and metric definitions as well conventions for vendor-specific operations.
Although this lays a foundation for a common standard, the current landscape is still a sprawl of competing definitions. This is largely because the conventions have lagged behind the pace of vendor implementation, leaving a situation where different frameworks have implemented multiple names for the same attributes. For example, to track the number of prompt tokens being consumed OpenInference uses the term llm.token_count.prompt. At the same time OpenLLMetry uses llm.usage.prompt_tokens and LangChain uses llm.token_count.prompt.
This makes life complicated for teams running applications across multiple frameworks. To create consistent analytics teams would need to implement large numbers of transformations of telemetry attributes. Gen AI Normalizer is an OpenTelemetry Collector Processor built by OpenSearch Engineer Kyle Hounslow to help teams avoid this manual toil.
The processor has a small set of simple configuration options and applies single-pass canonicalization to span and event attributes. It has recently been accepted and merged into the official opentelemetry-collector-contrib repo.
That’s all for this edition!
If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!
This week’s quote is from Nassim Nicholas Taleb’s seminal work, The Black Swan
“We do not see the infinite variety of things that could have happened but didn't... We focus on the known and mistake the unobserved for the nonexistent.”
About Observability 360
Hi! I’m John Hayes. As well as publishing the Observability 360 newsletter, I am also an Observability Advocate at SquaredUp.
The Observability 360 newsletter is a entirely autonomous. All opinions expressed in the newsletter are my own.