Mapping the Observability Landscape

OeC - the New O11y Stack | OpenTelemetry Goes Mainframe!

Welcome to Edition #12 of the newsletter!

In this fortnight’s edition we introduce a new Careers and Professional Development section. This will be an occasional feature where we gather together information on jobs, training and certifications. This is in line with our aim of not just being a source of news but also providing content which will hopefully be of benefit to the community.

The observability domain is not just about technical practice. It is also the subject of analysis and theory to help inform and improve that practice.  In this edition, we feature two surveys of the observability landscape. The first is from the perspective of investment analysts Sapphire and the second looks at how three key technologies are re-shaping observability architectures.

The OpenTelemetry project continues to steam ahead, exerting an ever stronger gravitational pull on observability. We look at some of the latest news and views in our dedicated OpenTelemetry section.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Mapping the Observability Landscape

The observability space at the moment is a bit like the early universe - rapidly evolving and expanding in all directions. Attempting to define classifications and draw boundaries around the teeming constellations is a tall order. That, though, hasn’t prevented Casber Wang and his colleagues at Sapphire Ventures from trying. In this detailed analysis they draw out a summary of the current trends in the market, which is now apparently worth a cool $105bn. Some of the placings of vendors within the classification scheme may be questionable (SigNoz are probably not best described as an Infrastructure Monitoring vendor) but it is a valiant attempt and a well-researched report.

bpftop - Monitoring for eBPF!

With so many companies now developing programs leveraging eBPF it was inevitable that at some point there would need to a monitor for eBPF clients themselves. Whilst eBPF apps should have a minimal footprint, having multiple clients running on a server could potentially create a processing overhead - especially if those programs are not optimally performant. bpftop is the latest app to be open-sourced by the developers at the Netflix labs and it uses the built-in BPF_ENABLE_STATS command to create performance analytics across running ebpf instances.

OeC - the New Observability Stack?

Up until recently the Observability stack mostly consisted of proprietary systems or composable stacks such as ELK or Fluent/Prometheus/Grafana. This article on the Observability 360 web site sketches out the rise of a new challenger stack based on OpenTelemetry, eBPF and ClickHouse. It also looks at the implications for the traditional “three pillars” model and asks whether we are entering a ‘post-instrumentation’ age.

Products

Traceloop - Observability for LLM Apps

Traceloop is one of the latest tools to support developers who are incorporating calls to LLM API’s into their applications. It is designed to help LLM app developers evaluate the quality of their prompts as well as providing tracing for API calls. It started off as an internal tool and has now been released as an open source product. It works with numerous LLM providers including OpenAI, LangChain and Anthropic and also integrates with Splunk, Grafana, Datadog and other observability platforms.

ddosify: Supercharge Your K8S Monitoring

These days it seems like Kubernetes monitoring tools are almost a dime a dozen. It takes something special to stand out from the crowd. ddosify truly raises the bar. As you initially scroll through the feature list nothing seems to out of the ordinary - but then the wow factor kicks in - and doesn’t really let go. Apart from a stunning UI, it supports load testing across 25 countries, latency testing, API support and a no-code Scenario Builder. It uses eBPF (of course) and, best of all, it is open source. This is destined to become an essential item in the observability toolkit.

The Olly ToolKit

Olly are an Observability services company that provide support for open source observability tools - with Prometheus being a specialism. In fact, one of the company’s co-founders is a Prometheus maintainer. As well as providing observability services, they have also developed a set of open source tools to help you “debug, augment, and manage your open source observability stack“. The tools include a Prometheus metrics linter, a PromQL parser and a Mimir resources estimator.

From the Blogosphere

Tetragon - Deep Network Observability

Tetragon is an eBPF-based network observability product originally created under the umbrella of the Cilium project. If you are familiar with Cilium, you might ask what is the difference between Tetragon and the Cilium Hubble module. Whilst there is someoverlap between the two, Tetragon provides lower level and more granular visibility into activity on your network hosts. Use cases involve investigating latency, auditing file access and monitoring lateral movement. This article provides a clear overview as well as a really helpful walkthrough of the product in action.

An Engineer’s Personal Retrospective

This is a really engaging blog post by Infrastructure Engineer Jack Lindamood, where he reviews nearly every infrastructure decision he made over four years working at a start-up. Each choice is graded with a Regret, Endorse or an occasional Unsure. Whilst not explicitly observability-related, it will however, have resonance for any engineer forced to make technological choices (which is probably all of us). The article contains much distilled wisdom and some strong opinions, as well as general observations on the challenges and trade-offs faced by infrastructure engineers. GitOps, Terraform vs Cloudformation, network meshes - read on to find out which got the thumbs-up.

OpenTelemetry

OpenTelemetry for Mainframes!

In the era of cloud-native and distributed computing, the mainframe might seem like something of an anachronism. However, these workhorses still play a vital role in industries such as banking and insurance and this has been recognised by the creation of an OpenTelemetry Mainframes SIG (Special Interest Group). The goals of the group include defining appropriate semantics and metrics for creating mainframe telemetry as well as introducing instrumentation for programming languages used in mainframes. We may yet see a FORTRAN OpenTelemetry SDK!

OpenTelemetry Collector Antipatterns

The OpenTelemetry Collector is the centrepiece of OpenTelemetry architecture. Whilst it can be easy to deploy, it can also be easy to misuse. In this post on the OpenTelemetry blog, Adriana Vilela, one of the leading figures in the project, sets out five basic principles which may often be overlooked when deploying and managing oTel Collector instances. This is an easy-going read which does not get bogged down in technical details.

CAREERS AND PROFESSIONAL DEVELOPMENT

We kick off our first ever careers section with news of a Google-sponsored course aimed at aspiring SRE’s. The course offers a grounding in SRE principles and is led by Google Site Reliability Engineer Salim Virji. The duration is two weeks and places are limited.

Sticking with the SRE track, mysterious GitHub contributor Maksim has created a repo full of really excellent resources to help candidates prepare for SRE interviews. The links cover a wide range of themes including System Design, Troubleshooting, IaC and, of course Interview Questions.

If you have ever thought that it would be great if there was a web site dedicated to Observability/SRE jobs well, say hello to OllyJobs. At the moment, there are only a small number of posts listed but hopefully this will grow over time as it is an idea whose time has come.

If you are looking to achieve accreditation for your Prometheus skills, the CNCF offers a Prometheus Certified Associate certification. Unfortunately, despite its rather hefty $250 price tag it only describes itself as a confirmation of ‘foundational knowledge‘.

Events RoundUp

First up in this fortnight’s roundup is the Women In Tech Global Conference. This is a mega-event aiming to attract 100,000 women both online and in-person over the course of three days. The event is split into four 'summits', each with their own set of tracks. There will be a mix of educational & training content, keynotes, panels, breakout rooms and technical workshops.

Last month, the London Observability MeetUp hosted a well-attended event featuring a discussion on the use of LLM’s and you can see a recording here. This MeetUp is attempting to adapt its format to provide space for debate and discussion so that attendees are not just passive listeners. You can also keep the discussion going on their Slack channel. We think this is a great example for others to follow.

Apparently, some 85% of papers submitted to CNCFCON/KUBECON get rejected. Not to be downhearted, some of the rejectees will be gathering in Paris to present their papers at the Cloud Native Rejekts event. The event is free and boats speakers from Microsoft, SUSE, Grafana and Google.

In April, Seattle will play host to Open Source Summit North America. The event runs over three days and comprises a number of ‘microconferences’, including LinuxCon, ContainerCon and CloudOpen. Living Legend Linus Torvalds will be in attendance as will the venerable Kelsey Hightower - so don’t forget to pack your selfie sticks.

📣 Reminder!

Don’t forget - you can find a fuller listing of events on the Observability 360 calendar.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is, perhaps improbably, from Vincent Van Gough:

“Great things are done by a series of small things brought together.”