CostLens - Observability Costs Demystified!

Prometheus 3.0 Unveiled | SigNoz Deliver The Goods

Welcome to Edition #25 of the newsletter!

Introducing CostLens

In this fortnight’s edition we are very pleased to introduce CostLens, an ambitious new feature on the Observability 360 web site. If you have ever been through the process of procuring an observability system, you will know that getting a handle on costs can be a real chore. Read our summary below to find out how CostLens can ease the pain!

Open Source Observability Day

Next month sees the inaugural Open Source Observability Day - a major new addition to the observability events calendar - and I am very proud to share that Observability 360 has been selected as a Community Partner for the event. This is a one-day virtual event that has attracted a host of A-List speakers, including Liz Rice, Charity Majors and Romain Khavronenko. You can find out more and register here.

Elastic and OpenSearch Go Their Own Way

It seems like the resolution to the Elastic fork is going to be more of a Fleetwood Mac story than a Disney fairytale ending. Just like the legendary rock band, the two parties are saying that it is the end of the affair, and they will each Go Their Own Way. ElasticSearch has returned to the Open Source fold and OpenSearch, as we cover below, is striking out in a new direction under the auspices of the Linux Foundation.

Feedback 👍👎

We love to hear your feedback. Let us know how we are doing at:

NEWS

CostLens - Making Sense of Observability Costs

Over the past few months, we have been very hard at work building CostLens - a major new feature on the Observability 360 web site. Anybody who has ever looked into procuring an observability system will know that understanding observability system costs can be hard work.

This is not the fault of any paticular vendor, it is just a function of the fact that vendors have different architectures, different products and different business models. You cannot blame Vendor A for having a pricing model that is different to Vendor B. At the same time though, ploughing through the small print and policies of multiple vendors is a laborious exercise. CostLens is a unique service that provides you with the information you need and enables you to estimate costs for a number of the leading systems on the market.

CostLens consists of three principal components:

  • an overall guide to observability pricing

  • advice on pricing policy for a number of leading systems on the market

  • a unique observability costs calculator

If you have any feedback, we would love to hear from you!

If you are a vendor and you would like your system to be included, please get in touch!

OpenSearch Marches On

In our last edition, our lead story was the dramatic news that ElasticSearch was returning to Open Source. This naturally raised a question of what the future might hold for OpenSearch - the project that forked from ElasticSearch in 2021. Well, that question has been answered in emphatic style with the news that AWS, the primary project maintainer, has now transferred OpenSearch to the Linux Foundation, where it will be managed under the auspices of the OpenSearch Software Foundation.

The announcement was made at this week’s Open Source Summit in Vienna and marks the start of a new chapter in the history of the project. Up until now, AWS has acted as the custodian of OpenSearch, but the transfer to the Linux Foundation means that the project is now under vendor-neutral governance. Although AWS are still committed to supporting the project, they are hoping that this move will encourage more involvement from the open source community.

Meanwhile, the OpenSearch Engineering team have unveiled a comprehensive project roadmap setting out the strategic direction for the platform.

Middleware Reloaded!

Middleware is a “unified observability platform” that we first covered back in Edition 5 of the newsletter. The product first launched in September 2022 and, as a relative newcomer to the market, the company pride themselves on their agility and their ability to adapt the software to customer needs. Having listened to customer feedback the company have recently released a major upgrade to the product.

One of the principal concerns of many observability customers is managing logging volumes, and Middleware have extensively rebuilt their logging pipeline with additional flags and features. They have also supercharged the on-boarding process and claim that the system can be installed within one minute!

Like most vendors, they have moved beyond the basic MELTS paradigm and towards an Observability+ approach. This means bundling additional capabilities such as RUM and code-free dashboard creation. They have also joined some of the industry pacesetters by adding LLM Observability to their stack.

Prometheus 3.0 Beta Unveiled at PromCon

Probably the most important (but least unexpected) announcement at last week’s PromCon in Berlin was the Beta release of Prometheus 3.0. If anybody thought the project was running out of steam, then this release showed that Prometheus is still the 800lb gorilla of the metrics jungle.

Incredibly, it is now seven years since the release of Prometheus 2.0, and the latest version is the culmination of over 7,500 commits. The biggest, and most immediately visible update is a complete revamp of the Prometheus UI. As Prometheus founder Julius Volz himself observed in a separate blog bosting, the existing UI dates back to 2019 and is now a little bit jaded.

Possibly, the major strategic update is the enhancements that have been made to OpenTelemetry compatibility, as Prometheus positions itself as the default choice for storing OpenTelemetry metrics. There is also a new version of the Remote Write feature, which supports a number of new elements as well as reducing CPU usage and improving compression performance.

SigNoz Launch Week

These days it seems like all the coolest kids on the observability block are having launch weeks - and this week it’s open source champions SigNoz who are stepping up to show off their latest moves. Managing telemetry volumes is top of mind for customers across the board, and one of the first new features to be unveiled is “Ingest Guard”, which, as the name might suggest, enables customers to reduce costs by regulating and filtering telemetry flows at source.

SigNoz has always been natively oTel compliant, and they are now capitalising on this with their implementation of telemetry correlation. This enables a much more joined-up debugging experience as engineers can, for example, drill down directly from logs for a particular Kubernetes node and view its associated metrics

For us, the most important feature announced so far is anomaly detection. Whilst SigNoz readily admit that this iteration is not the finished article, we believe that anomaly detection is a technology that can genuinely help observability systems deliver RoI and will soon become a must-have feature.

Launch Weeks can sometimes be a bit of a let-down - often amounting to a series of minor tweaks. In this case though, SigNoz have really pulled some rabbits out of the hat and delivered some major functional enhancements to the product.

Products

Edge Delta - Injecting AI Into Pipelines

As telemetry volumes have skyrocketed, pipelines have emerged as a critical component of observability architecture for any user that is exporting at scale. There are already a number of solutions on the market - provided either as part of full stack platforms such as Datadog and Chronosphere or stand-alone solutions such as Mezmo.

Naturally, all of these solutions provide features such as filtering to reduce volumes and routing to offload telemetry to more cost-effective backends. Edge Delta provides these capabilities but also ups the ante with features such as AI-powered anomaly detection. This uses a proprietary algorithm to monitor log streams for potentially problematic patterns - which can then be used as the basis for generating alerts.

Another very powerful feature is Edge Delta’s ability to streamline the enormous volumes of metrics generated by Kubernetes instances. They claim they can achieve reductions of up to 90% by eliminating redundant metrics.

From the Blogosphere

A Tracetest Refresher

If you have been following the newsletter for any length of time, you will know that we are huge fans of Tracetest. If you are not familiar with the product, it is based on the ingenious insight that oTel traces can be used not only to map the path of a request through a system but also to verify process outcomes. The result is that all the legwork of service discovery is eliminated and you can build tests with incredible ease.

This highly readable and well-structured article by Adnan Rahic assumes no prior knowledge of the product and will take you step by step through the process for setting up distributed testing in minutes rather than hours or days.

eBPF Goes Turing Complete!

We are probably at the stage where eBPF is now part of the mainstream rather than being a challenger technology. Naturally, a major reason for its adoption is not just its power but also the many safeguards built in to ensure that it does not compromise the Linux kernel. One downside of this is that the verifier needs to make deterministic analyses of eBPF code - which has limited the scope of the instruction set.

As this article explains, optimisations to the verifier now mean that the shackles are off! The complexity limit of the verifier has now been raised from four thousand instructions to one million. This means that eBPF programs can now be “Turing complete” - in the sense of being able to solve any arbitrarily complex problem.

This article not only gives great insight into some of the internals of eBPF, it is also highly accessible to those of us who are not kernel programmers. The clincher that makes this a must read is the simulation of Conway’s Game of Life. Prepare to spend your afternoon becoming an eBPF hacker!

Using ClickHouse As A Logging Platform

In some ways, logs are a bit like the magical brooms in the tale of The Sorcerers Apprentice. At first, they are employed to do a simple job, but then they become an overwhelming and uncontainable force. Companies operating at very large scale are finding many existing observability backends to be unaffordable.

This article on the ClickHouse blog looks at how trip.com migrated their log management operation from ElasticSearch to ClickHouse. The numbers involved are pretty eye-watering - 85 trillion rows and 50+PB of data. However, this is not just an article about dazzling the reader with big numbers - it is also something of a masterclass in logging architecture. It chronicles the evolution of the trip.com logging platform as well as clearly explaining the technical rationale behind the migration to ClickHouse. Although the migration was a huge undertaking, the stats for reduced TCO and improved performance would seem to indicate that the effort was worthwhile.

Who Monitors The Monitors?

It is a kind of philosophical conundrum of observability. You have a mission critical system which you need to monitor, but what happens if the monitoring system breaks? How can you monitor the monitor without opening up an infinite chain of regression? In this Medium article, Oren Shoval who is an SRE at Broadcom, describes one solution for fixing observability’s own Incompleteness Theorem.

Oren breaks down the monitoring process into a four layer stack and looks at how additional safeguards can be applied at each of these levels. As he also notes, a breakage in your monitoring tooling does not necessarily have to be something as catastrophic and visible as a system failure, it can also be the result of a more subtle issue such as the deployment of a mis-configured alerting rule. Overall, this is a really useful and practical contribution to the discussion on meta-observability.

AI

LLM Monitoring - A Conceptual Overview

There are a lot of great practical articles on the web with step by step guides for setting up LLM Monitoring and building out RAG tooling. This Medium article by Josh Poduska, steps back and looks at some of the higher level issues. Having carefully researched a number of frameworks, he summarises the state of play and sets out some overarching principles for building an LLM observability strategy.

This article does veer into rather technical terrain - covering concepts and terminology such as K-Means and Hellinger Distance. Having said that, you do not need to be a data scientist to get value from this article. In particular, the distinctions between evaluating, tracking and monitoring LLM’s provide a useful conceptual framework and there are also many useful references for further research.

Using OpenLLMetry and Splunk for LLM Monitoring

OpenLLMetry is, as the name may suggest an oTel compliant tool for LLM observability. The software is built and maintained by Traceloop - who we first covered back in Edition 12 of the newsletter in March of this year. The system is built in Python and is designed with ease of use in mind - you can get started with just two lines of code.

The product has developed rapidly over the past six months and has built out support for a large number of LLM’s, Vector DB’s and frameworks. Since it is OpenTelemetry compliant, its telemetry can also be sent to a wide range of backends. This excellent and highly informative article on the Splunk blog looks at hooking OpenLLMetry up to a Splunk backend (surprise surprise!). It also includes code for a sample Python Flask app so that you can build your own LLM test.

Videos

The OpenObservability podcast by Dotan Horovits consistently sets the standard for the format. He has hosted a highly impressive array of guests and the conversations maintain a clear technical focus. As we mentioned in our news section, Prometheus 3.0 was unveiled last week, and this interview with Prometheus founder Julius Volz covers a huge amount of ground.

As well as discussing the history of Prometheus and the background to the 3.0 release, Julius also eloquently articulates the formidable obstacles to creating interoperability between the oTel and Prometheus approaches. For example, how do you manage traffic between endpoints when one is push-based and the other is pull-based. If you haven’t got time for the full video, Dotan has also helpfully summarised the discussion in this article.

Thomas Graf Unveils The eBPF Roadmap

Whilst PromCon was rocking some major metrics announcements in Berlin, there was also another major event ctreating a buzz of its own - the eBPF Summit. eBPF is not just a paradigm-shifting tech, it is also an ecosystem which is experiencing a very rapid rate of innovation.

The summit was an online event which provided a showcase on the latest applications and advances. In this relatively brief video, Isovalent co-founder and CTO Thomas Graf reveals some exciting milestones such as eBPF for GPU’s and even the unthinkable - eBPF for Windows! You can also find a full YouTube playlist for the summit here.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from Clifford Stoll:

“Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."