Observability 360
Posts
Observability Takes Centre Stage at KubeCon

Observability Takes Centre Stage at KubeCon

Pipeline Explosion! | MCP Eats The World

John Hayes
April 24, 2025

Welcome to Edition #33 of the Newsletter!

Observability Shines at KubeCon

The European edition of KubeCon took place earlier this month and there was a full turnout from the observability sector. In the headcount of logos, sponsors and speakers observability equalled or surpassed many other market sectors. It wasn’t just the big players such as Datadog and Dynatrace, there were also newcomers such as Dash0 and startups such as Control Theory. You can read our review in this article on the Observability 360 web site.

MCP: Transformation at Terminal Velocity

Like many commentators, we expected Agentic AI to be the big AI story of 2025. Within just a few months though, those predictions seem almost outdated as the the MCP wave sweeps through AI development like wildfire. It has been hailed as the USB-C of AI application connectivity and the speed of adoption is phenomenal, opening up opportunities and challenges for developers and vendors alike.

Unsurprisingly, this has resulted in a blizzard of blog articles and commentary. In this month’s newsletter we have picked out two that we think are of value to observability practitioners. One is a practical overview of the theme by the venerable Austin Parker, and the other is an example of an MCP server in action courtesy of the folks at Last9.

Pipeline Explosion

Telemetry pipelines may not have quite the cachet of the latest developments in AI, but their impact on observability architecture has been equally profound. The momentum is also showing no sign of letting up - all three of the solutions in this month’s Products section are pipeline platforms of one kind or another. The pipeline for pipelines is in full flow!

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

BlueSky

NEWS

Grafana Release 2025 State of Observability Survey

Sometimes it feels like there are more State of Observability surveys than you can shake a stick at, but the Grafana report is one that we tend to pay attention to. Their user base is large and likely reflects a wider diversity of opinion than those vendors whose customer base tends to be concentrated more on the large enterprise end of the market.

It goes without saying that the report includes the usual vendor mutterings about tool sprawl and it is also no surprise to find that cost is cited as the leading criterion for system procurement. At the same time, this is a survey that feels like it has been produced by and for practitioners in the field, in contrast to others which feel like they are more oriented to the boardroom. Overall, it is a really fascinating snapshot of current practice and trends and, of course, the visualisations are gorgeous.

Kubernetes History Inspector - debugging made fun (well, almost)

Kubernetes is now the default option for enterprise application hosting - with around 66% of companies using the platform in production. One of the downsides of the platform is its complexity - especially debugging - which can involve sifting through huge volumes of logs. The Google Cloud team have now released the Kubernetes History Inspector to lighten the load when troubleshooting K8S incidents.

The inspector runs as a Docker container, so it can be spun up with ease and it uses colour-coded blocks to highlight patterns in log activity. Users can then drill down through these blocks to see more granular details on errors and underlying behaviour. The biggest bonus though lies in the tool’s name - i.e. it shows you a history of state changes. This can be a great timesaver as it reduces the toil of threading together logs from different ephemeral instances of a given resource.

Dynatrace Beef Up Database Observability

Database observability can be a major concern and Dynatrace have now moved to strengthen their coverage in this area with the acquisition of Metis - an AI-driven database observability platform.

The importance of database observability is often overlooked, yet database bottlenecks can be the root cause of many of the errors and latency issues experienced by applications in production. Many organisations also do not have dedicated database specialists who are able to identify issues such as sub-optimal query design or missing or incorrectly configured indexes.

This is also a smart move on economic grounds. Standalone database monitoring systems can rack up considerable expanse – especially for enterprises with large database estates. Building this functionality into the observability platform is a value-add that will be attractive to many organisations.

Products

Bitdrift Mobile Observability - Less is More

Mobile observability is a path that has been trailblazed by Embrace, who have pretty much defined the discipline. Given the strategic and economic importance of mobile e-commerce and gaming, it is not surprising to see more vendors entering the space. One of the first new entrants to follow in their footsteps is Bitdrift, a company formed by alumni of Twitter and Lyft.

If you read last year’s Observability 360 review of Embrace, you may recall that they practice full fidelity ingestion (i.e. no sampling). The rationale being that sampling may mean dropping critical information, meaning that errors go unnoticed. For customers with zero tolerance of faults this can be a deal-breaker.

Bitdrift take a different perspective, arguing that much of the data ingested by vendors is of little or no value. They therefore filter out this ‘noise’ at source. Ultimately, there are always trade-offs, and the value of each approach depends on the needs of the customer.

Sawmills - the Cutting Edge of Logging

Also on a mission to reduce observability costs are Sawmills. Their offering is a product which functions as an upstream gateway to filter telemetry before it is ingested by a vendor - therefore reducing costs for the customer. Like many similar solutions, it is built around an implementation of the OpenTelemetry Collector.

The software uses AI and Machine Learning to analyse telemetry flows and then generates recommendations for eliminating redundant data. Engineers can then implement these recommendations with a single click.

As well as eliminating noise, the application can also perform optimisations via telemetry transformations e.g. by extracting metrics from log data.

Better Observability Practice via Control Theory

The last in our trifecta of new pipelining solutions is Control Theory, who emerged from stealth at this month’s KubeCon, making a splash with their announcement of securing $5m in seed funding.

Like most pipelining solutions, the product is not intended to replace your existing stack. Instead, it is a gateway that provides a control plane for your telemetry. Also, like other pipelines, it aims to help customers cut down on their observability costs by enabling filtering and transforming telemetry flows. This is probably the standard functionality you would expect, but Control Theory does have a few aces up its sleeve.

One of the major causes of observability costs spikes can be high cardinality metrics. Control Theory mitigates this by identifying the sources of high dimensionality, making recommendations and deploying filters. The application also aims to improve operational efficiency with automated log enrichment and attribution - which can assist both with debugging as well as internal organisational accounting.

The application also ships with impressive fleet management capabilities for the OpenTelemetry Collector. This includes a feature called Elastic Telemetry Pipelines, which enables users to spin up and tear down telemetry pipelines from a simple UI. You can also visualise pipeline configurations and data flows.

From the Blogosphere

Perses - A Primer on Dashboards as Code

The Perses project has set itself the ambitious goal of setting a standard for Dashboards as Code. The project has support from high-profile vendors such as Dash0 and Chronosphere and this talk by Nicolas Takashi and Antoine Thébaud received a warm reception at this month’s KubeCon in London.

If you are not familiar with the background and rationale for the project, this article in the New Stack serves as a concise and useful introduction to a number of the key concerns such as portability, scalability and adherence to GitOps principles.

If you are thinking to yourself “show me the codez”, then you might like to dip into this article on the SquaredUp blog. This article takes a more hands-on approach, looking at the Perses specification, some sample Go code and walking you through generating a dashboard as code and uploading it to a Perses server.

Tackling the Prometheus Scaling Challenge

Despite the rise and rise of OpenTelemetry and the advent of challengers such as VictoriaMetrics, Prometheus remains the standard for metrics collection and storage. Despite its durability, the nature of Prometheus’ architecture means that it is not easy to scale horizontally.

There are many articles that have been written about these technical limitations - and their potential solutions, but few have as much clarity as this article by Gaurav Maheshwari on the oodle.ai blog.

This may only be a six minute read, but it covers a lot of ground, including Functional Sharding and Federation and discusses a range of scalability solutions such as Thanos, Cortex and Mimir. If you are just starting out on building a metrics collection solution this is a useful overview of the terrain.

AI

MCP - Not Quite the Land of Milk and Honey(comb)

There is a lot of high-level hype about MCP and probably an equal volume of highly technical low-level detail. This article by Austin Parker of Honeycomb covers the pragmatic middle ground with an honest appraisal of both the potential of MCP and some of its limitations when applied to observability systems.

The article discusses a number of lessons learnt from building an MCP server at Honeycomb. The good news is that MCP server allowed clients to dynamically query the Honeycomb API and even perform valuable tasks such as improving the instrumentation of a particular service or even re-factoring telemetry. The downside is that the Honeycomb API, like most others, was written for querying by a REST client, not an AI agent. More seriously, observability backends are a dense mesh of data points and attributes, which can overwhelm an agent and generate massive token outputs.

Overall, this is an excellent insight into the practical applications of MCP in observability and does not require any knowledge of the underlying LLM technology.

‘Vibe Monitoring’ with Last9 MCP Server

Whilst the Honeycomb MCP server is still a work in progress, Last9 have gone full throttle and made their MCP server generally available. This article on the Last9 blog is a very bullish pitch for their MCP server’s ability to provide the necessary context to assist engineers with debugging runtime exceptions.

How does it work? Well, when an exception occurs in your production environment, the Last9 MCP server captures the relevant state - including request parameters and environment variables. This enables a fully-AI driven debugging workflow that developers can execute from an IDE such as Cursor. The MCP server also supports remediation so that you can run commands such as “fix the last error in service X” from within your IDE.

According to the article, the tooling works with any deployment runtime and supports numerous IDE’s including VS Code, IntelliJ and more. This is obviously not going to replace your enterprise observability stack, but it is certainly has great potential to enhance developer flow and productivity.

OpenTelemetry

OTTL Playground - Playing Nicely with oTel Transforms

Since its release in 2022, OpenTelemetry Transform Language has been widely adopted as a powerful and simple tool for transforming and enriching telemetry flows within the OpenTelemetry Collector. Unfortunately though, testing and debugging your transforms can be a slow and cumbersome process of generating telemetry, sending it to the Collector and then using different methods to check its outputs.

Luckily, Elastic have now released the OTTL Playground - an environment where you can test and evaluate your OTTL statements in real time and view the results instantly. The playground currently supports the transform and filter processors but additional evaluators are in the offing. The project is currently in beta but it seemed to be perfectly robust and functional when we tried it out.

oTel Developer Experience Survey Results

It is often remarked that the OpenTelemetry documentation seems more geared to defining specifications for vendors rather than helping end-users to configure the oTel Collector or instrument their code. This is a sentiment that certainly seems to be borne out in the latest survey carried out by the OpenTelemetry Developer Experience SIG.

219 responses were submitted for the survey and the results paint a really interesting picture both of Developer Experience but also of variances in levels of observability maturity. Unsurprisingly, many users reported that they found the documentation difficult to navigate and wanted more examples for production usage of the SDK and the Collector.

There are no prizes also for guessing that difficulty in debugging the oTel Collector was also a major concern for many developers. The sentence “it’s difficult to know why exporting isn’t working” will probably resonate with many of us. We have already seen vendors such as Grafana addressing some of these concerns with products such as Alloy and it will be really interesting to see how this feedback shapes the evolution of the oTel project.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This month’s quote is from the Computer Scientist Alan Perlis:

“Simplicity does not precede complexity, but follows it.”

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.