Life, the O11yverse and Everything

Join the BYOC Party | Observability Grows Up

Welcome to Edition #42 of the newsletter!

If you are a fan of the Hitchhiker’s Guide to the Galaxy, you will be aware of the significance of 42. Whilst we can’t quite claim to bring you the ultimate answer to life, the universe and everything, we will certainly have a stab at unpacking all the latest developments in the observability cosmos.

In this edition - the observability buying spree goes on, the first unicorn in the AI SRE kingdom, bringing your own cloud to the observability party and lots more. Strap in and don’t worry about the Panic Button - you won’t need it!

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Observability in 2026 - smarter, more mature and more expensive.

Elastic have released their latest survey of the observability landscape, and it sounds like we are maybe entering into those late teenage years - growing up, costing more and getting smarter.

One of the really striking stats is the big leap in the number of organisations describing their observability practice as being mature. The figure for this year was 60% - up from 41% last year (and just 14% in 2024). However, having invested in observability technologies, companies are turning their attention to reducing their running costs. 67% of organisations report experiencing cost overages and 96% are taking steps to reduce expenditure.

It also seems as if we are moving past the AI hype cycle and into a stage where Gen AI is genuinely bringing productivity gains - especially amongst more mature organisations. This is a highly readable report and an excellent yardstick for measuring your own observability maturity.

Resolve AI Join the Unicorns with $125m Funding Boost

The AI SRE sector is seeing huge growth as well as fierce competition - with upwards of 20 vendors entering the space in the past year or so. Resolve AI have established themselves in the leading pack and last week received a massive vote of confidence in the form of a $125m Series A funding round - an investment that places a $1bn valuation on the company.

Although Resolve only launched in August 2025, they have a client list of blue chip companies such as Coinbase, DoorDash, MongoDB and Salesforce. The funding will be invested into R&D, Product Development and Customer Success.

Obviously, a lot of smart investors are betting on AI SRE. The big question is how its positioning will play out - will it be a disruptor to existing observability vendors or a complement to existing solutions?

Dash0 Snap Up Serverless Specialists Lumigo

The observability acquisition spree continues apace. Dash0 recently announced their own funding round and have now invested some of that cash into the acquisition of Lumigo. Lumigo are specialists in tooling for serverless platforms such as AWS Lambda.

It’s a move that makes sense on a number of levels. Whilst many vendors focus on Kubernetes, serverless still accounts for a large proportion of applications running on the web. Recently, platforms such as Cloudflare have also gained traction as a low-cost and less complex alternative to K8S. Integrating Lumigo into their stack will enable Dash0 to offer end-to-end observability spanning K8S, serverless and LLMs.

There is also the small matter of acquiring a customer book that includes logos such as Taco Bell, Telegraph Media Group and Starling Bank. This can also be a two-way street - Lumigo users may now also gain visibility over areas such as K8S and RUM.

Dash0 are a company with big ambition and a formidable capacity for execution. This is their first acquisition but is unlikely to be the last.

OpenSearch 3.5 Rolls Out

Observability is not just about troubleshooting and root cause analysis. It is also about asking questions about your systems and their data - and this is a capability where systems such as OpenSearch excel. Version 3.5 of the platform has now been released, and it includes new and updated features across three main tracks - observability, vector search and AI.

The observability updates include a major expansion of Prometheus support - with the ability to view metrics alongside logs and traces in the visualisation UI. OpenSearch claim that their Piped Processing Language is particularly well suited to observability workloads. In this release it is enhanced with a number of new functions and commands for advanced manipulation of complex data structures.

Meanwhile, the Search Relevance Workbench is beefed up with the introduction of an LLM-as-Judge feature for evaluating search results, whilst there are also optimisations in query performance and storage efficiency. OpenSearch is also a platform for building your own agents and version 3.5. ships with capabilities such as persistent structured memory and context management.

A number of commentators have speculated that AI will mean the end of traditional SaaS applications. Monolithic enterprise UIs will give way to studios for agentic orchestration. Having built a platform strongly geared towards agent creation and testing, maybe OpenSearch will be better placed than others to weather this storm.

Products

Tsuga - Driving the BYOC Revolution

BYOC (Bring Your Own Cloud) is not a new concept. It has been around for quite a while. Initially, it may have felt like something of an experimental option, a model that had its advantages but also meant a lot of technical overhead.

As costs spiral, and concerns around privacy and data sovereignty become ever more salient though, the time could be ripe for products such as Tsuga - a BYOC observability stack built by alumni of luminaries such as Datadog, Cognition and Palantir.

Tsuga runs as a fully managed Kubernetes stack hosted in your own cloud environment. All of your data is kept in your private S3 store, so you do not pay vendor ingestion and storage costs.

We recently met up with company CEO Gabriel-James Safar to find out more about the Tsuga BYOC model and the company’s vision for “observability done right”. You can get the full story by clicking on the link below.

RunWhen - the multi-skilled AI SRE

As we have said earlier, AI SRE is a crowded and fiercely competitive marketplace. Not long ago, features such automated triage, natural language querying and root cause analysis might have been considered cutting-edge - in AI SRE they are now table stakes.

The term AI SRE implies a tool with a specific and narrowly defined set of functions. This can make it harder for any individual tool to stand out. With a name like RunWhen though, you can guess that the philosophy of the product is steeped in the SRE tradition of Runbooks. In an AI context, these have evolved into a comprehensive library of agentic tools - over 3,400 at at present.

Equally distinctive is the RunWhen approach to onboarding. AI SRE tends not to be a turnkey, one-size-fits-all solution. Instead, there is a learning curve (both for the AI and the customer). Although you can be up and running with RunWhen in an afternoon, RunWhen Engineers will take you through their 30 tools in 30 days program to get both humans and machines production-ready.

From the Blogosphere

Prometheus Remote Write V2 - The Incredible Shrinking Egress Costs!

If you are using Prometheus Remote Write to send your metrics to a Prometheus backend, this article on the Grafana Blog is a must-read.

Whilst Prometheus Remote Write V1 is a robust and widely used protocol, it does suffer from one major drawback - it tends to be rather verbose. When working at scale this can cause raw telemetry volumes to mount up and result in considerable egress costs when transporting loads across clouds.

In this article, Sam DeHaan and Kyle Eckhart explain how Remote Write V2 has been written with first-class support for metadata as well as vastly improved compression. When Grafana migrated their internal workloads to V2 last year they achieved a 50% reduction in egress costs (although this was partially offset by increases in CPU and memory utilisation).

Check out the article below for further technical details, as well as instructions on upgrading Alloy to V2.

Crunching the Numbers on AI SRE reliability

Achieving 99.999% uptime might be the ultimate goal for SREs, but the costs of achieving it may outweigh any benefits perceived by end users.

In this article on the Komodor blog, Udi Hofesh examines the considerations for building Ai SREs that can achieve the sweet spot for system reliability. Robust enough to meet customer experience KPIs, but also flexible enough to avoid becoming a drag on developer velocity.

Hypothetically, an AI SRE can run in an automated control loop, continually adjusting resources and processes to achieve a desired reliability goal. As Udi shows - this involves feeding in a substantial amount of system and organisational context. The SRE role is built on a large body of technical expertise, domain knowledge and awareness. Transferring this knowledge and these skills to an AI agent requires some careful calibration.

OpenTelemetry

Advisory - Sunsetting of the oTel Batch Processor

If you have ever configured an oTel Collector, you will probably be familiar with the Batch Processor. Obviously, it does what it says on the tin. It batches up incoming data for efficient processing and forwarding. News of its deprecation may therefore come as something of a surprise.

As this article on the Dash0 blog explains though, there are sound architectural reasons for the change. It also does not mean that the principle of batching itself is being sidelined. Instead, the responsibility for batching will be delegated to the respective Exporters. This should provide greater robustness and traceability of signals in the event of failures. The article provides a really excellent dive into the details of the change.

The OBI Roadmap - An Epic Journey Into eBPF

eBPF is the technology that came, saw, conquered and then embedded itself into the fabric of observability in almost the blink of an eye.

As a technology which can reach inside the kernel and both provide autoinstrumentation as well as bring coverage to the parts other tools cannot reach, its impact on observability has been profound and its reach is pervasive.

Spare a thought then for OBI, the OpenTelemetry Special Interest Group dedicated to defining standards for eBPF instrumentation. This is an enormous undertaking and you can get some idea of the scale of the task that the SIG has set itself from a look at its hugely ambitious roadmap. Alternatively, you can get the executive summary from this article on the OpenTelemetry blog.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the Hitchhiker’s Guide to the Galaxy:

“Exactly!" said Deep Thought. "So once you do know what the question actually is, you'll know what the answer means.”

About Observability 360

Hi! I’m John Hayes - As well as publishing the Observability 360 newsletter, I am also an Observability Advocate at SquaredUp.

The Observability 360 newsletter is an entirely independent entity. All opinions expressed in the newsletter are my own.