IBM Swipe Right on Coralogix

oTel Me How To Do It! | Say Hello To Observability People

Welcome to Edition #20 of the Newsletter!

Celebrating KuberTENes

Unless you have been living on a desert island for the past few weeks, you cannot have failed to notice that Kubernetes celebrated its 10th birthday this month. What started as an internal container orchestration project at Google is now pretty much the foundation of cloud-native computing. From an observability point of view, K8S can be seen as something of a double-edged sword. Whilst it affords high levels of resilience and scalability, it is also complex, resource hungry and expensive to monitor.

At the moment, the position of K8S seems unassailable. The growing popularity of serverless platforms such as Railway and the increasing maturity of WASM do offer glimpses of an alternative for application hosting, but for the foreseeable future, K8S reigns supreme.

Observability People

Many people I speak to agree that one of the great strengths of working in observability is not just the technologies and the many-splendored technical challenges involved, it is also the people. Both online and in real life, there is a real and cohesive sense of being part of a community. The Observability 360 web site has just launched a new feature called Observability People, where we invite professionals to share their thoughts, experiences and challenges with the rest of the community. Our first profile is Jay DeLuca - a Staff SRE at Toast, and it really is a fascinating snapshot of the multi-textured nature of the SRE role. If you would like to take part, then please get in touch!

A Formatting Glitch

You might have noticed in this edition of the newsletter that links to featured articles are just standard hyperlinks rather than the usual big red button. This is because, in testing, we found that the button was not rendering in some browser/email client combinations. We will revert to the normal format as soon as the issue is resolved.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

IBM Harnesses Coralogix for Cloud Logging

In a move that might be seen as something of a role-reversal, this week saw the announcement that IBM’s new Cloud Logs Product would be based on the Coralogix logging platform. The deprecation of IBM’s Log Analysis and Cloud Activity Tracker services, and the rollout of an IBM Cloud Logs product has actually been on the roadmap for quite some time, and the announcement marks the culmination of quite a long process.

The move highlights a key dynamic in the current observability market, where the new wave of cloud-native companies are able to achieve massive advantages by adopting decoupled storage strategies, whilst some of the larger vendors toil with classic technologies. The fact that Coralogix are able to create a line of business from selling a white label version of their product is also something of an eye-opener. How many other companies might step up and provide observability as a service to other vendors?

Cisco Unveil Observability Roadmap

Gary Steele and Chuck Robbins of Cisco

Cisco have had a busy year shopping in the observability market. The audacious acquisition of Splunk last summer was followed at the end of 2023 by the shrewd, and more low-key, acquisition of Isovalent - creators of the seemingly unstoppable Cilium juggernaut. Given that the company already has AppDynamics in its locker, there has been a lot of speculation as to how Cisco will align these products and achieve maximum synergy from its portfolio.

You may need to go through a registration process to read this article, but, if you are a Splunker or Cisco watcher, it is worth the effort. Given that Cisco spent $29bn on Splunk, it is not much of a spoiler to reveal that it will be at the centre of the new strategy, but the article does also answer questions about the place of the Full-Stack Observability and AppDynamics products in Cisco’s new observability vision.

GigaOm Radar - Scanning The Observability Horizon

Over the past year we have happily worked our way through a fair number of “radar”, “state of” and “pulse” reports. We have even featured a number of them in the newsletter because, even though they maybe vendor-driven, they also offer up valuable insights about the market in general.

This GigaOm report is really the gold standard. In fact, it is not just a radar, it is a thorough and systematic CT scan of the subject matter. The author, Ron Williams takes a methodical and forensic approach, evaluating products with tremendous rigor and precision. It is a highly valuable resource for anybody wanting to get a snapshot of the market, and also provides a useful set of criteria and classifications for anybody looking to undertake their own product evaluations.

The report does not cover every product on the market - instead it concentrates on a cross-section, but it is one of the best, if not the actual best market analysis we have ever read. You will need to register to download the report, but this is genuinely an indispensable read for anybody interested in observability market intelligence.

Products

Kerno - A Business Reliability Platform

The Kerno UI

As we have previously noted, there can often be a disconnect between observability platforms and the developer experience. Kerno positions itself as a developer business reliability tool which brings the power of observability technologies into a unified feedback and debugging environment.

One of the primary constituents of full-fat observability is context, and this is something that Kerno has in abundance. As well as collecting telemetry, it also builds up a picture of your team structure as well as integrating with CI and source control systems. This means that not only can it pinpoint issues quickly, it can also intelligently alert only the relevant developers and stakeholders.

The product is launching this week and is currently in public beta. There has been a certain amount of talk about observability shifting left (and right). In our view, what Kerno and similar tools signify is a shift to an era of pervasive observability.

IBM SevOne - K8S Network Observability

The Gorgeous SevOne UI

Since we started this newsletter, we have probably covered more K8S observability products than you can shake a stick at. The latest vendor to join the party is IBM, who have recently unveiled the cryptically named SevOne. IBM may be late to the party, but they have turned up in their finest bib and tucker - the visuals for SevOne are truly sumptuous. In fact, the UI is, in some ways, the essence of SevOne as it is pretty much a layer sitting on top of the RedHat NetObserv Operator solution.

Interestingly, the NetObserv operator is itself a stand-alone (eBPF-powered) OSS product that can feed data into backends such as Prometheus and Loki. Whereas some K8S tools focus on application and resource management, the emphasis with SevOne seems to be more on monitoring traffic flows on the K8S network. Come for the networking insights, stay for the groovy aesthetics!

From the Blogosphere

The eBPF Effect

eBPF has had a seismic impact on the observability landscape. The ability to hook safely and securely into the Linux kernel has opened up a whole new world of possibilities. The Observability 360 web site has now published a two-part exploration of the impact of eBPF on observability engineering.

The first part takes a more generic, high-level view of eBPF, discussing its general behaviour, scope and limitations. The second part looks at functional aspects of the use of eBPF in a number of leading products and explores some of the technical challenges - especially around distributed tracing. Research for the article involved detailed discussions with vendors, and it has received positive feedback from some of the most respected engineers in the field.

How Many Nines Do You Need?

99.999% has become a magical number in reliability measurement - a pinnacle of availability. This article by Thomas Stringer, a Staff Software Engineer at Freenome, is a useful reminder of the costs (and shrinking marginal gains) of chasing five nines (and beyond). This is a very quick read, but it does pack a bit of a punch and makes its point about balancing costs and benefits extremely succinctly.

Why HyperDX Chose ClickHouse

HyperDX is a lightweight, open source observability solution that we featured way back in Edition 8 of the newsletter. It is a product that is steadily gaining in popularity and now boasts over 6,000 stars on GitHub. In this blog article, HyperDX founder Michael Shi traces through the decision-making process that led him to select ClickHouse rather than a more established candidate such as ElasticSearch.

The article is not just a rare opportunity to get inside the head of a company founder, it also eloquently describes how the quantum leap in telemetry volumes has left some storage architectures somewhat flat-footed. It also looks at the performance gains achieved by re-writing the rules of indexing. Not all of us will be building an observability stack from scratch, however, an understanding of backend capabilities can be critical in selecting the right product for your needs.

OpenTelemetry

NAV’s Journey Into OpenTelemetry

NAV’s oTel Adoption Graph

OpenTelemetry is a great framework and it seems to be building up an unstoppable momentum. Whilst the theory of building an oTel based observability infrastructure is great, the practice can be messy, and the road is not always smooth. This is a great article by Hans Kristian Flaatten, who works at NAV - Norway’s largest government agency.

Their story may have a familiar ring to it. NAV started out with a whole bunch of microservices without any standardised logging and devs had to resort to digging through queries in Kibana for debugging and troubleshooting. The article charts their progress as they implemented an oTel based observability solution and contains some really valuable insights and lessons.

oTel me How To Do It!

If you are starting out on an OpenTelemetry journey of your own, you may be interested in the results of the recent oTel Getting Started Survey. The survey was relatively short and sweet, so the results are a pretty quick read. The key takeaways were that respondents wanted better documentation and more tutorials and reference architectures.

This is an interesting finding, and it chimes with our own experience. The oTel web site documentation tends to be pitched at an abstract level and is, arguably, more geared to vendors. When looking for guidance on the practicalities of implementation, we have found resources such as Practical OpenTelemetry by Daniel Gomez Blanco and Learning OpenTelemetry by Ted Young and Austin Parker to be a valuable complement to the official documentation. If you would like to share experiences/resources from your own oTel journey, then feel free to join this Observability Engineering Slack thread.

oTel Shot Down By Sentry!

The Sentry UI

As you know, we are great supporters of the OpenTelemetry project. However, oTel also has its critics and we think it is always valuable to hear an opposing point of view. In this post on his own personal web site, Sentry CEO David Cramer lets rip with a pretty stinging salvo. The article expresses a number of frustrations, including lack of leadership, design by committee and the alleged self-interest of ‘BigMonitoring’.

Much of the meat of the article consists of a dissection of the oTel implementation of tracing. David’s style is pretty forthright, and he shoots from the hip. Obviously though, when the CEO of a major vendor with over 100k customers expresses concerns over the implementation of a key telemetry signal, the smart move would be to listen with an open mind. The article has drawn a well-argued response from Josh Lee of Altinity. We have appended our own thoughts on David’s article in a comment on Josh’s LinkedIn Post.

Observability Practice

From RUM to DEM - Delighting Your Users

One of the trends that has emerged as observability has matured is an increasing focus on user experience. This is fuelled by the increasing richness of the data available as well as the need to remain competitive in the e-commerce arena. RUM (Real User Monitoring) has been around for a while, but the related practice of DEM (Digital Experience Monitoring) is now also gaining attention.

So, what is the difference between DEM and RUM? Firstly, DEM is not a replacement of RUM - it is an extension of it. Whilst RUM, as the name suggests, analyses the experience of real users on your site, DEM users a wider range of techniques (such as synthetic testing) to ensure the best possible digital experience for visitors to your site. This article on the Dynatrace blog is a really useful and practical guide for licking your DEM strategy into shape.

Videos And Podcasts

Switching Off And Tuning Into The Big Picture

Sometimes, it is easy to get bogged down in the minutiae of day-to-day practice and lose sight of the bigger observability picture. This edition of the Explain IT podcast is a welcome opportunity to step back from tweaking PromQL queries or editing oTel YAML files and revisit some first principles. During the broadcast, Tom Rowley - a Chief technologist at Softcat, and Matt Ryer, Principal Engineer at Grafana Labs, cover quite a bit of ground. They start off with a basic definition of observability, but it is worth bearing with them as they move on to discuss issues of culture, technological fragmentation, MTTR’s, the need for a single source of truth and much more food for thought besides.

A Dash of RED With Mirko Novakovic

You may not be familiar with the name Mirko Novakovic, but as co-founder of the ground-breaking Instana platform, he has cemented a place in observability history. He is now back with a new observability venture in the form of Dash0 (creators of the oTelBin product, which we covered in Edition 7 of the newsletter). In this video, Mirko shoots the breeze with long-time collaborator Michele Mancioppi, himself a formidable authority on observability.

In recent months the Dash0 web site has produced some outstanding documentation, including a wish list for an ‘OpenTelemetry-native’ platform. This podcast is a very easy listen and also serves up some interesting clues as to what the Dash0 team might be cooking up to make that platform a reality.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the great Danish physicist Niels Bohr:

“No, no, you're not thinking; you're just being logical.”