Demystifying Observability 2.0

SigLens Break Through the Petabyte Barrier | Apache SkyWalking Turns 10

Welcome to Edition #19 of the newsletter!

Observability 2.0

For many people, the term ‘observability’ is a relatively recent addition to the technological lexicon. They may be surprised, then, to find that the field is already rife with talk of Observability 2.0. Thanks to thought leaders at vendors such as Honeycomb, the term Observability 2.0 has actually been circulating for some time. In this edition, we feature an article by Adriana Villela, as she aims to flesh out this concept. We don’t think that this is the final word on the subject, but it is another valuable contribution to an ongoing discussion as professionals re-evaluate their observability practice.

Getting Smart With Costs

The issue of cost management is one that is closely entwined with the discussion around Observability 2.0, especially as a number of vendors are now architecting platforms offering almost unlimited cardinality at low cost. In this edition we include an article from VictoriaMetrics which shows how costs can be contained at the tactical level, whilst also covering Karpenter - which enables teams to achieve dramatic cost reductions at the infrastructure level.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Apache Skywalking Reaches Version 10

Apache SkyWalking may not be one of the highest profile observability tools on the market, it does, however, boast an impressive user base and a large and thriving ecosystem. It is also one of the more mature products on the market, having recently reached the milestone of releasing its tenth version. As befitting such a milestone, the release boasts a slew of impressive features. First up is eBPF-based monitoring of Kubernetes network traffic. SkyWalking uses access logs to map traffic flows and uses these to build out some really impressive dashboards. In addition to this, there is dedicated monitoring for ClickHouse and ApacheMQ instances as well as a major upgrade to SkyWalking’s unique layers model.

Demystifying Observability 2.0

Many textbooks on observability will define the subject in terms of the classic three pillars of logs, metrics and traces. For many thinkers in the space, this model is regarded as being too narrow and mechanistic. They argue that observability encompasses a broader set of practices and goals. Charity Majors, CTO at Honeycomb is often credited with coining the term Observability 2.0 and in this Medium article, the multi-talented Andrea Villela aims to flesh out the concept. The article focuses on some key themes that define the essence of version 2.0, such as a more developer-centred approach, a clearer alignment service level objectives and a transition to an events-oriented approach to telemetry.

Embrace and Grafana Forge Mobile Observability Partnership

Leading mobile observability vendor Embrace have had a momentous start to the year and are keeping up the pace with the announcement of a “go-to-market” (GTM) agreement with Grafana. Collaboration between the companies is nothing new, as integrations between Grafana and Embrace have been in place for some time. Whilst the GTM agreement is not a legally binding arrangement, it does obviously signal a mutually beneficial strategic alignment. For users, the benefit will be deeper technical integrations and the promise of being able to correlate backend and mobile telemetry to achieve end-to-end context within a single UI.

SigLens Break Through the Petabyte Barrier

SigLens burst on to the observability scene earlier in the year, posting some amazing benchmarks for their eBPF-based stack. The company have now unveiled SigScalr, a technology capable of ingesting telemetry loads of up to one petabyte of data per day.

According to stats posted on the SigLens blog, the performance gains achieved by SigScalr would translate to a 98% reduction in infrastructure costs when importing 1 petabyte of data on AWS. Even with SigScalr though, the cost of importing your 1PT of data would still stack up to $1k per day - so most of us will not get sign-off to try this out in our dev environment.

Whilst the 1PT per day figure is a great headline grabber, and most of us will never need to work at this scale, these benchmarks nevertheless highlight the ability of challenger companies to outshine established vendors on performance and scalability. If you want to do some benchmarking of your own, the company have also released an open source Test Data Generator so that you can throw as much data as you like at SigCalr and see how it performs.

Observability Practice

The CNCF’s Other Observability Jewels

The CNCF is obviously well known for its stewardship of world-class projects such as OpenTelemetry and Kubernetes. However, incubation and oversight of software projects is not its only concern. It also consists of a number of TAG’s (Technology Advisory Groups) which aim to promote the CNCF’s values and foster best practices in areas such as security, sustainability and, of course, observability. One of the interesting outputs of the Observability TAG is the Observability Whitepaper. Whilst the OpenTelemetry project focuses on implementation at the code level, this whitepaper provides a valuable theoretical complement. It serves as a reflection of the CNCF’s philosophy as well as being a distillation of the experience of highly skilled practitioners. What’s more, as the document is open source, the whole community is able to contribute and share insights. If you would like to learn about the thinking of some of the experts who are charting the course for OpenTelemetry, then the whitepaper is an essential read.

The Pipeline Advantage

Telemetry pipelines have become a must-have for many organisations as a tool for routing, filtering and transforming telemetry flows. This article on the Chronosphere web site is a very useful starting point for any organisation looking to incorporate best practice into their pipeline strategy. It is based on feedback from hundreds of observability practitioners and is a wish list of the most requested requirements. This is a really valuable summary drawing on the hard-won experience of professionals in the field.

Cost Management

Nailing K8S Costs With Karpenter

The recent OpenTelemetry user survey revealed that most oTel installations are being run on Kubernetes instances. Whilst this offers benefits for scalability and resilience, one obvious downside is cost - especially when systems have to be replicated in non-prod environments. In cloud environments, the use of spot instances can potentially result in massive cost savings, but they don’t play nicely with the default node provisioning tools. Karpenter has proven to be a really powerful solution to this problem, seamlessly handling spot interruptions and launching new nodes.

Whilst Karpenter has been a feature of the AWS landscape for a while, running it on Azure has been problematic. This article on the Microsoft Tech Community site provides a really useful guide to configuring and running Karpenter in an AKS environment.

VM’s Smart Guide to Reducing Metrics Costs

If you are familiar with VictoriaMetrics, you will know that they regularly produce great content on minimising observability costs. This highly practical article on the VM blog illustrates a number of techniques for identifying and remediating bottlenecks and inefficiencies which might inflate metrics costs. The article suggests some very clever tweaks - such as reducing the number of significant figures in metrics values - which can have a surprisingly powerful effect.

Even though the examples in the article use VictoriaMetrics, the author states that the techniques should also be applicable to other open source solutions such as Prometheus, Thanos and Mimir. If you love squeezing every ounce of performance out of your systems, then this article is for you.

OpenTelemetry

A Guided Tour of the oTel Demo

OpenTelemetry is not just a standard or a set of specifications, it also consists of multiple SDK’s and powerful tooling in the form of the oTel Collector. One of the other great resources in the oTel locker is the OpenTelemetry Demo Application. The demo was built as a resource to assist developers with oTel instrumentation and is a realistic simulation of an enterprise microservice-based architecture. As well as the individual microservices, the demo also harnesses technologies such as Kafka and Redis as well as instances of Jaeger and Prometheus. This in-depth article by DevOps Engineer Eromosele Akhigbe provides a step-by-step guide to getting started with the Demo on an AWS EC2 instance.

OpenLIT - oTel Compliant LLM Monitoring

In Edition 15 of the newsletter, we mentioned the Semantic Conventions for Generative AI being developed by the Semantic Conventions Working Group. In a recent article on the oTel blog, Ishan Jain of Grafana introduced OpenLIT, an open source tool for monitoring LLM applications which is built on OpenTelemetry and supports the oTel semantic conventions.

OpenLIT is powerful but also surprisingly simple - it enables you to instrument your LLM apps with a single line of Python. It supports all of the major LLM’s as well as VectorDBs such as Pinecone, ChromaDB and others. The article provides an easy to follow guide to running OpenLit, using Prometheus and Jaegar as backends for metrics and traces, whilst using Grafana for visualisation.

Videos

Building An Observability Culture

Arguably, one of the defining characteristics of Observability 2.0 is that it embodies an appreciation of observability as a cultural practice which is embedded across the organisation. It is not just a set of rules that an Observability/DevOps/Platform Engineering team imposes on development teams. Daniel Gomez Blanco is a Principal Engineer at SkyScanner as well as being the author of the excellent Practical OpenTelemetry textbook.

In this video, recorded at the London Observability Engineering Meetup, he provides fascinating insights into how he and his team transformed their observability architecture and fostered an observability mindset at SkyScanner. There are a number of valuable takeaways, both technical and cultural, from this recording.

Julian Volz - the Anatomy of a Histogram

Histograms are a simple but incredibly powerful tool for visualising the distribution of values for a particular metric. This video is a great opportunity to see Prometheus founder Julian Volz discussing the theory and practice of histograms and how they are implemented in Prometheus.

The video is short but highly informative and punctuated with clear and helpful examples. It walks through manually creating Prometheus histograms in Go and then querying them in PromQL whilst also revealing some of the inner workings of the scraping process.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the archivist and historian Robert S. Arrighi:

“The way you learn anything is that something fails, and you figure out how not to have it fail again”.