Observability 360
Posts
Observability's Secret Garden

Observability's Secret Garden

Data Pipeline Observability | Devin - A Progress Report

John Hayes
February 10, 2025

Welcome to Edition #31 of the newsletter!

The Fog of AI

You may be familiar with the Gartner hype cycle which draws out the trajectories of great expectations for new technologies, which are inevitably followed by disillusionment and then pragmatic adoption. I wonder if for the case of AI, it might be worth adding a new stage - the fog of uncertainty.

There is no doubt that pretty much every business in the market wants to adopt AI, but the problem is figuring out what that actually means. This is not least because AI adoption does not merely mean adding new features or drawing out a new roadmap.

AI is also forcing companies to reinvent themselves and make decisions about job roles and staffing levels. What makes this even more difficult is that AI is evolving at a pace that even some of the most agile organisations can not keep up with. Almost none of us are equipped to deal with paradigm shifts happening on a monthly basis. At the moment, companies are faced with the unenviable task of building a foundation when the ground beneath their feet is constantly shifting.

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

Bluesky

NEWS

Datadog On-Call Goes GA

At their flagship Dash event last year, Datadog announced a whole slew of major releases. One of the most salient was On-Call - an obvious competitor to established products such as Pager Duty. The company have now announced that On-Call is out of preview and is generally available. This means that the platform now provides an integrated solution for monitoring, incident management and paging. Incident management can often be a fragmented experience for IT staff, with users having to hop between multiple systems. An integrated solution makes sense for users, but it obviously means that the standalone vendors may see their market share being eroded.

The Secret Garden

Juraci Paixao Krohling was a Principal Engineer at both Red Hat and Grafana Labs and is also a CNCF Ambassador as well as a member of the OpenTelemetry Governing Board. When he announced at the end of 2024 that he was leaving Grafana to pursue a new venture, there was naturally considerable curiosity as to what he might be hatching. Well, we still don’t have too much detail, but a recent post on LinkedIn does offer some clues.

The post namechecked an A-List of Observerati professionals who have advised him and also includes a link to the OllyGarden web site, which consists of little more than a banner with the words “Efficient Telemetry Pipelines”. The observability pipeline space has expanded rapidly in recent years as companies build architectures for managing large telemetry volumes. Will they be the next Vector or Mezmo? We’ll be watching with interest.

ObservIQ Take Wing as BindPlane

ObservIQ have announced that they are re-branding as BindPlane - the name of their leading OpenTelemetry management solution. As the company state in their blog post, the aim is to “align our identity with our flagship product“. We covered BindPlane in January of last year and were really impressed by their pioneering work in wrapping a layer of management capabilities around the OpenTelemetry Collector.

As well as doing some rebranding, the team at BindPlane have also been busy on the engineering front. The latest version of the product delivers some major new enhancements. These include a Supervisor, which provides messaging and resilience functions as well as improved Data Routing and support for OpenTelemetry Connectors. It seems like 2025 is going to see all kinds of products building pipeline functionality around the oTel Collector and this is a really interesting example.

Catchpoint State of SRE 2025 Report

Catchpoint are obviously not the kind of people to let the grass grow under their feet. The refrains of Auld Lang Syne had barely finished bidding farewell to 2024 when they released their 2025 state of SRE report. This may be the first State of… report of the year and it certainly won’t be the last. It is however, one of the most insightful and best researched of its kind.

The 2025 edition keeps up the standard of previous years. What really sets it apart is that it is not just a number-crunching exercise. Instead, it abstracts the numerical data into a number of analytical insights. This is a report which has been carefully thought out and where the authors exhibit both a genuine understanding of the field as well as the ability to yield valuable insights from their research. The great news is that you can download the report without having to register.

Products

MetricsHub - Dedicated Metrics Pipelining

Telemetry pipeline products are gaining increasing traction within the marketplace. MetricsHub, as the name suggests specialises in collecting and forwarding metrics. The platform has been built by leading observability vendor Sentry and can be seen as the infrastructure counterpart to their well-known software monitoring platform.

Like a number of other emerging pipeline solutions, MetricsHub has the oTel Collector at the heart of its architecture. It ingests telemetry from MetricsHub agents and then forwards it onto oTel compliant backends such as ServiceNow, Datadog, Grafana and Splunk.

So why would you use MetricsHub instead of a native OpenTelemetry Collector instance? The principal reason would probably be ease of connectivity with a wide range of metrics sources. The platform ships with support for over 200 systems and apps and includes a YAML-based extensibility framework to cater for new sources. This naturally eliminates time-consuming configuration and accelerates on-boarding.

Definity - Observability for Data Pipelines

We have previously covered products such as meshIQ which provide observability for messaging systems such as Kafka. Another critical concern, especially for businesses working at scale, is visibility of their data pipelines. Many data pipelines are composed of a multitude of individual steps and the coverage provided by default monitoring logic can be patchy and superficial.

Definity is a product which aims to address this with an observability solution that covers three key areas of data pipeline operation - pipeline error detection, data quality and cost effectiveness. Definity runs as an agent with an almost invisible resource footprint and can provide feedback in real time. The company has only recently emerged from stealth and its founders have an impressive pedigree - hailing from tech giants such as Paypal and Worldpay.

Metoro - Kubernetes Observability

We have remarked a few times that Kubernetes observability is almost becoming a discipline in its own right. Naturally, as the technology which essentially forms the foundation of the cloud native space, it is not surprising that it has spawned its own market in tooling.

Metoro is one of the growing number of tools that use eBPF so that you can be up and running with zero instrumentation. It provides the APM features that you would expect - logs, traces, metrics and profiling but also boasts a number of other capabilities not found in other products in this niche. Two of the standout features are automatic regression monitoring - which detects performance degradations over time, and an autonomous root cause analysis function, which proactively monitors applications and investigates errors and anomalies.

The platform can ingest OpenTelemetry and Prometheus metrics and also boasts integrations with PagerDuty and Slack.

From the Blogosphere

How Complex Systems Fail

There is a whole branch of science dealing with complexity theory. This involves defining the characteristics of a complex system as well as the dynamics which lead to them being non-deterministic. This obviously has great relevance to today’s modern distributed systems and this article is a really interesting theoretical investigation of how complex systems fail.

Instead of delving into the technical detail of system failures, the article takes a high-level of approach. It makes a number of observations about the nature of complex systems as well as putting forward some thought-provoking arguments. For example, the author suggests that “attribution to a root cause is fundamentally wrong” and actually stems from a social and cultural need to assign blame for negative outcomes. This is a valuable read for providing theoretical underpinnings to your SRE practice.

Federating Logs for Business Value

Enterprises today can generate vast volumes of log data. This article by Franz Knupfer of Hydrolix argues that companies can gain a competitive advantage by extracting business insights from this data. The problem is that the data itself is stored in observability backends which are often walled gardens not accessible to analytics tooling. According to Knupfer, the solution to this is to federate your logs into data lakes.

Yes, Hydrolix are indeed a vendor of such a federated data lake, but this is nevertheless a thought-provoking piece. Although it may be written from the perspective of a vendor, it still has general relevance in tackling the theme of composable observability and the potential synergies that can accrue in mining observability data for wider business insights.

AI

We Need To Talk About Devin…

You may remember that last year AI startup Cognition unveiled their AI-powered SRE Assistant Devin. The release blog hinted at a longer term ambition of a fully automated SRE agent, but an initial iteration that would essentially be a junior colleague capable of triage and basic patching. The project recently exited a closed beta and it is really interesting to hear the conclusions of users who have put it to the test.

First up is this article in the Register, which concludes that Devin “appears to be rather bad at its job”. This judgement is based on a report by the Answer AI research lab. Three data scientists at the lab conducted an investigation where Devin completed only three out of twenty tasks successfully. Their conclusion being that “it rarely worked”.

Next is this review from AI software engineer Zhu Liang, who paid up his $500 for a month of Devin’s services. His conclusion was that Devin offered a great user experience but that the end results were only “passable”. This article has plenty of detail on the actual tasks that were set for Devin as well as an extensive summary of the positives and negatives in Devin’s performance.

How Sentry Tackle Alert Noise

One of the fundamental challenges for observability providers is in avoiding alert spamming, where monitoring systems are flooded with multiple alerts about the same underlying error. At first glance this might seem like a relatively trivial deduplication problem. In practice, the logic of grouping individual exceptions to a unique source requires some fairly heavy lifting as well as some pretty sophisticated heuristics to understand the context of a particular message.

This article on the Sentry blog is a really fascinating insight into the sophisticated engineering involved in error normalisation. This is a pretty deep dive that involves some slightly esoteric concepts such as Levenshtein distance and Locality Sensitive Hashing, but it is also highly readable and informative. If you are interested either in the ways in which vendors are harnessing AI or in methods for solving the alert fatigue issue, you will certainly find this to be an enjoyable read.

OpenTelemetry

GoTel! Vendors Collaborate to Build oTel Instrumentation for Go

A recent article on the OpenTelemetry blog has announced that three different vendors - Datadog, Alibaba and Quesma will be collaborating to define a common standard for OpenTelemetry instrumentation of Go applications.

As a compiled language, Go represents something of a challenge for auto-instrumentation. Both Datadog and Alibaba have opted to go down the road of compile-time instrumentation - which involves intercepting the build process and injecting instrumentation before compile time. Taking this approach can not only improve performance but can also make instrumentation more consistent and maintainable.

In the past few months, both Datadog and Alibaba have offered to donate their source code to the OpenTelemetry project. As well as being great news for Go developers this is also a great vindication of the vision of the OpenTelemetry project itself as it represents vendors collaborating to build open source resources for the community as a whole.

Lambda Observability - Raising The Baa

If you are using Lambdas in AWS then one of your motivations - apart from avoiding maintenance overhead, may be minimising cost - which naturally involves minimising compute. From an observability point of view, this raises the question of what is the most cost-effective way of enabling your observability to collect telemetry. Do you push telemetry after each invocation, or do you keep your lambda alive long enough for the next scrape interval to elapse? As you scale up, both of these options may have significant cost implications.

This article on the OpenTelemetry blog introduces the OpenTelemetry-Lambda extension layer, which functions as a local endpoint and decouples Lambda execution from the process of forwarding telemetry. Under the hood, the extension layer actually spins up a very stripped down instance of the oTel Collector, which registers itself with the Lambda Extensions API. This is quite a technical article but will be of value if you really need to optimise your Lambda observability.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

Our quote this time around is from author and entrepreneur Sukant Ratnakar

“Innovation is the outcome of a habit, not a random act.”

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp - creators of next-generation dashboarding software.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.