Grafana V12 - Firing on all Cylinders

Bad Telemetry | GPU Observability

Welcome to Edition #34 of the Newsletter!

Acceleration and specialization

In last month’s edition, we commented on the dizzying pace of change in AI development. Inevitably, in the intervening weeks, the pace has only quickened. Last month we were gobsmacked by MCP, this month, apparently, things have moved on - say hello to self-evolving agents. It seems almost impossible to predict how things will evolve, but this month we feature Roni Dover’s perspective on one possible future.

Ai and the accelerating rate of change are not the only forces shaping observability. Another defining trend is an increasing degree of specialisation within the market. Grepr, Cardinal and Neurox are all great examples of tools that have moved rapidly to provide sophisticated solutions to specialist concerns such as GPU observability and high-velocity engineering.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Grafana V12 - Firing on all Cylinders

Grafana are not just an observability provider, they are an IT phenomenon that has staked out a unique identity as both an open source observability eco-system and a major player in a multi-billion pound market.

They recently announced the release of Version 12 of their product, and it is not just another iteration - it’s a blockbuster. The list of new and updated features is pretty impressive. The biggest story is Observability as Code, which brings Infrastructure as Code workflows to building, deploying and maintaining your dashboards. For companies with large volumes of dashboards across multiple teams this is a big win. Obviously, Dashboards as Code are not unique to Grafana. As we have previously reported, the CNCF Perses project is also building out an open source standard for this model.

Like every other vendor, Grafana are striving to simplify debugging, and this release sees the General Availability of Drilldowns, which enable point and click investigation of anomalies. There is also a Public Preview of Investigations, which provide an integrated view of Drilldowns across different signals. This release also sees the introduction of Dynamic Dashboards, which enable customisable layouts for different viewing needs.

In addition to this there are a host of updates to Grafana alerts, the plugin system and management features such as user authentication and onboarding.

Mezmo’s Pipelining Price Flex

Whilst AI smarts may grab the limelight, telemetry pipelines still represent the basic plumbing for scalable observability systems and represent a major growth area, with a large number of vendors entering the space.

Mezmo are an established leader in this sector and have moved to consolidate their position with the announcement of a new pricing structure which promises massive reductions in ingestion costs. As well as reducing costs the new structure aims to make pricing simpler and more predictable. Under the pricing model, ingestion costs have been slashed from $1.80 to $0.20 per GB. You can find out more in this blog article as well as in this press release.

Sentry Add (Selective) Logging Support

Sentry has long positioned itself as purely an application monitoring tool and has built up a large user base with its focus on detecting errors in code and application performance issues. The Sentry client runs as a handler which detects errors as they occur and sends the error details to the Sentry backend. Adopting this pattern has, up until now meant that Sentry was a no-logs platform.

Well, that has now changed. Of course, being Sentry, they will be doing logs in their own idiosyncratic fashion. The idea is to avoid hoovering up massive volumes of log data on the basis that some of it might eventually be of use. Instead, they will aim to target log entries that precede an application error, and these can be used to provide users with more context for debugging. The feature is currently in open beta and, judging by this GitHub discussion, it has attracted plenty of interest in the community

Lightrun’s $70m shot in the arm for Developer Driven Observability

There may be scepticism about the phrase Developer Driven Observability being just a marketing slogan, but a group of investors have now swung behind the concept to the tune of $70m in the form of series B funding for Lightrun.

Lightrun describe their product as an Autonomous Remediation Platform and it counts a number of Fortune 500 companies amongst it customer base. Features such as non-blocking snapshots and dynamic logging simplify debugging and can dramatically reduce MTTR.

There is naturally a question mark as to whether this is really an observability tool or whether it is a debugging tool, and this seems to be something of a grey area as more and more tooling emerges and dividing lines blur. Enabling developers to generate their own ephemeral telemetry streams is a radical but clearly effective alternative to attempting to coax them into the big tent of the main corporate enterprise platform.

Products

Grepr - intelligent ingestion-on-demand

Yes - it’s another pipelining tool, but Grepr has a unique and innovative architecture for helping customers to reduce ingestion costs. Like many other similar tools, Grepr works as an intermediary between your telemetry sources and your observability backend. It differentiates itself by analysing your telemetry and only passing on valuable signals. The rest is parked in low cost S3 storage.

If you need to investigate an incident, Grepr can then select the relevant telemetry and forward it on to your backend. An added benefit is that the ‘parked’ data is not siloed away - it is held in a searchable data lake fully under user control. We have seen a lot of great pipelining solutions, but we really love the simple ingenuity of this model.

Cardinal - fixing the stuff you break when you move fast and break stuff.

We all know the mantra of “move fast and break things”. For teams that actually live that ideal Cardinal is here to help. The people behind the project are battle-hardened veterans of Netflix engineering teams who have been at the forefront of hyper-scale observability engineering since the days before the term ‘observability’ was even coined.

The premise of Cardinal is that if you are coding at warp speed, you will also want to debug at the same tempo. At the heart of Cardinal is Chip - an AI agent that takes responsibility for instrumentation so that developers can focus on coding.

As well as auto-instrumentation, it can also autonomously enrich your telemetry. By continually observing and learning about your system it is also able to dynamically create SLOs. If you are a high velocity team looking for low-effort observability, this could be just the tool you need.

Parseable Steps out of the Shadows

Parseable may well be one of the best open source observability stacks you have never heard of. They may not be great at blowing their own trumpet but they have been quietly building an impressive product that combines high performance and low cost.

The latest release includes a complete rebuild of the system’s UI, with a focus on simplicity and clarity. As you probably know, we are great fans of SQL-based obsertvability querying, so the Prism SQL Query Editor is a feature we are particularly fond of. The system is based on S3 but you can bring your own backend, so it has particular appeal to users who need both performance and data sovereignty.

Neurox - Taking Observability to the GPU

As we mentioned in our introduction, the observability market is evolving in multiple dimensions - and one of these is the growth of specialist observability tooling. Neurox is a really fascinating example of this trend.

There is already a vibrant and dynamic market in AI observability tooling. Much of this deals with application-level concerns such as tracing, token usage, performance and response quality. Neurox complement this with analytics that go right to the heart of AI workloads - i.e. GPU performance.

GPU costs can burn a serious hole in your budget, but tracking usage across multiple clusters and clouds can be difficult. The premise of Neurox is that it is the only tool on the market that combines GPU monitoring with FinOps smarts so that you can not only get detailed analytics on AI costs but also identify issues such as under-utilisation.

From the Blogosphere

Observabilty - a View From a VC

You probably don't have to spend too long in the observability space to discover that it is an incredibly dynamic and unpredictable market. For those of us who are practitioners or commentators this just adds to the inherent fascination of the discipline. Keeping pace with the speed of technological change is pretty exhausting. Spare a thought, then, for those who not only need to understand where the market is today but also place bets on where it is going.

Investment analysts have the time and resources for in-depth market research. They are also answerable to their clients, which means that their conclusions need to be evidence-based rather than grinding a particular axe. For us, this makes this article by Megan Reynolds, a Principal at Vertex Ventures, a really interesting perspective on current and future trends in the market. Overall, this a a savvy, vendor-neutral and well-researched investigation into the observability market, and articles that tick all of those boxes don’t come along too often.

Observability Spending - How Much is too Much?

In observability circles, the publication of a new thought piece by Charity Majors is something of a literary event. Her latest major article looks at the much discussed topic of observability costs. This might seem like a topic that has been done to death, but this is actually an original and entertaining take. Instead of regurgitating the usual horror stories about bill shock, it asks the fundamental question of “How much should I be spending on observability”.

Like pretty much every other question in computing the answer is “it depends”, but this article provides the valuable service of defining some parameters for making a meaningful evaluation. It also re-frames the question by asking whether observability should be regarded as a cost or an investment. This, in turn, means not just thinking about raw numbers but taking a strategic view of the function of observability for your business. As an added bonus the article starts with a great vignette of how the internet can turn a throwaway comment into a full-scale meme.

Not all telemetry is good telemetry…

Instinctively, we would probably all think that telemetry, like eating your greens, is always going to be good for you. Surprisingly, former Grafana engineer Juraci has a different take. Not only is there bad telemetry - there is a lot of bad telemetry. And this is not just bad because it is redundant or duplicative. We are talking qualitatively bad - bad in and of itself.

Although the article does not get right down into the weeds of bad telemetry it does offer a number of signposts to point you in the right direction for pruning your garden and building an evergreen observability practice. We have been lucky enough to have a sneak preview of the project Juraci is working on, and we think that there really is something very beautiful growing in this o11ygarden. In the meantime, hopefully this article will give you a taster of what is to come.

AI

MCP and the Future of Observability Engineering

MCP (Model Context Protocol) is probably the hottest topic in AI at the moment. Which kind of makes it the hottest topic in IT. It gives AI agents a standard protocol for accessing pretty much anything and everything - databases, file systems, APIs. Using a simple set of abstractions, agents can potentially be given unlimited opportunities. It seems unstoppable because it offers the premise of pretty much connecting everything to everything with almost no friction. What could possibly go wrong - well, everything - but the convenience factor is impossible to resist.

Roni Dover is the CEO of Digma and in this exuberant article he sketches out a potential future for observability applications based on autonomous agents. We think that this is an important article because Roni is not just a commentator on the sidelines, he is actually in the process of pivoting his whole business to harness this new paradigm. This is heady stuff and the article is a scintillating read.

OpenInference - a Standard for AI Telemetry

If you are just grappling with rolling out OpenTelemetry, the last thing you may want to hear about is yet another tool for application instrumentation. Do not despair though - OpenInference is not a competitor or a replacement for your existing SDKs. Instead it is designed to be complimentary to OpenTelemetry and its goal is to provide deeper context on LLM invocations.

The framework is based on a set of semantic conventions and open source plugins. A quick browse at the documentation reveals that it contains a really rich set of attributes for tracking your LLM calls. The packages are OSS and the GitHub project is maintained by the company behind the Arize AI engineering platform. At present there are libraries for every major AI provider as well technologies such as Langchain, MCP, LlamaIndex and many more.

OpenTelemetry

Is it a log or is it an event?

Application logging is an activity that may appear to be trivial and, at the level of the individual line, it probably is. As the travails experienced by the oTel logging SIG showed however, trying to develop standards around logging can be a Herculean task.

A recent article on the oTel blog by Austin Parker, adds a whole new dimension to the question by focusing not on the structure of a log, but on its purpose and proper location. One brief sentence in the article “we believe that most log records should be events“, has potentially profound consequences for logging practice.

For a start, it means that logging (or event creation) is an activity we should treat with greater precision. It also has wider impications though. For example, unless your observability backend is able to surface and meaningfully correlate those events, they are of little value. This is only a brief and apparently innocuous article, but it possibly contains within it the seeds for some substantial refactoring.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the renowned systems theorist Russell L Ackoff:

“A system is never the sum of its parts; it’s the product of their interaction.”

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.