Cisco - Observability's Secret Superpower

Free Synthetics with Prodzilla | Opening Up the Prometheus Black Box

Welcome to Edition #11 of the newsletter!

Observability is big business, and it is no surprise that big businesses have entered the market. We feature a Cisco press release that outlines their far-reaching observability ambitions.

Also in this edition - really exciting news from the Web Assembly ecosystem, extreme reliability with SAP, Slight Reliability with Stephen Townshend, new products, an events roundup and more!


We love to hear your feedback. Let us know how we are doing at:


Cisco - Observability's Secret Superpower

Traditionally, Cisco have been pretty much synonymous with router technology and networking. In recent times they have been getting seriously wired into software services. Their acquisition of Splunk last year made the headlines but it wasn’t really a bolt from the blue. It was just the latest step in a long-term strategy that included the acquisition of AppDynamics in 2017 and then Thousand Eyes in 2020. This is underpinned by a perspective that recognises the strategic importance of observability across multiple business functions. The article is a clear statement of Cisco’s ambition and their view on the strategic importance of observability.

Web Assembly Joins The oTel Revolution

Web Assembly (WASM) is a tech we have been following for a while. WASM Cloud is one of the major projects in the space and its runtime allows components written in any language to run in any cloud and connect to each other frictionlessly. The team behind the project have recently announced that Version 1 of the product, due to ship this quarter, will include full support for OpenTelemetry. The current release - 0.82 -already provides support for logging and tracing, with metrics to be bundled into 1.0. Although WASM is not part of the mainstream today it is a technology with enormous transformative potential.

Prometheus Black Box Explorer

Certificate Expiry is one of the great bugbears for any infrastructure engineer - and it is especially painful (and embarrassing) when you are hosting public-facing systems. At the same time, keeping track of certificates can be a headache. In this blog article, Prometheus Jedi Julian Volze describes how the Prometheus BlackBox Exporter can be configured to probe http endpoints and raise alerts for certificate expiry.

Grafana Unveil K8S Alerting Tool

The Kubernetes ecosystem must be one of the most crowded spaces in the technosphere. Grafana have now thrown their hat in the ring with the release of their dedicated Kubernetes Monitoring product. The aim of the tool is to eliminate the complexity of K8S management and provide a means for rapid identification of actual or potential issues. If you are already running Grafana in a K8S environment, then this tool may may boost your diagnostic capabilities with minimal overhead.


Prodzilla - Open Source Synthetic Monitoring

Synthetic Monitoring is part and parcel of observability for any operation with public-facing endpoints - especially if you have strict SLA’s. Unfortunately, running synthetic tests on dozens of services from multiple locations hundreds of times a day can become prohibitively expensive with commercial vendors. Prodzilla is an open source monitoring tool capable of handling complex user flows and even chained requests. In addition to this you can even take advantage of free deployments on the Shuttle platform. Is there a kicker? Well, yes, sort of. There is currently no UI for the product, so you have to build your monitors in YAML. This is actually not that daunting as the template specifications are very easy to work with.

DeepFlow: eBPF-driven Observability

If you read our piece on Groundcover in the last newsletter, then the premise of DeepFlow will have a familiar ring to it as it is also an eBPF-based, full-stack observability platform. There are, however, a couple of intriguing differences with Groundcover. The first is that DeepFlow is open source and the second is that it includes WASM in its tech stack. It uses ClickHouse as its backend but also boasts a custom string encoding process which, it claims, offers 10 times greater compression than ClickHouse Lowcard. DeepFlow is produced by Beijing-based Yunshan Networks. The product looks intriguing but, for an English speaker, the web site is not always easy to follow.

From the Blogosphere

Scaling Prometheus With Thanos

Prometheus is still the standard for handling metrics. One major drawback to the product though, is that it only scales vertically, not horizontally. Thanos is an open-source, CNCF project aiming to overcome this fundamental limitation. This article looks at the practical impediments to scaling out Prometheus horizontally and explores how Thanos effectively wraps a scalability layer around Prometheus instances. The article provides valuable insights and guidance in navigating the challenges in building a Thanos implementation.

Storing Profiling Data in ClickHouse

Profiling is often referred to as the fourth pillar of observability and an increasing number of vendors are including profiling tooling in their portfolio. One issue with profiling is that it can generate huge volumes of telemetry, and this means that care needs to be taken in storage design to ensure high compression rates whilst avoiding slow read times. In this article, from the excellent Coroot blog, Nikolay Sivko explains how his team met these challenges when creating a profiling solution that uses ClickHouse (them again!) as its backend.

An ‘OpenTelemetry-native’ Wish List

The term ‘OpenTelemetry-native’ has started to achieve some currency recently. In this thought-provoking post, Mirko Novakovic and colleagues from Dash0 argue that the term is currently not very well defined and needs to be fleshed out. They go on to put together a wish list of features for their ideal product - including one of our favourites - a standard querying language for telemetry. If you are interested in joining the discission you can comment on this LinkedIn thread.


Extreme Reliability With SAP

It is an astonishing statistic, but apparently 77% of global transaction revenues will pass through at least one SAP system. With those levels of throughput in critical business systems, cloud engineers have to make sure that their SAP platforms are rock solid. In this edition of the Dev Interrupted PodCast, Conor Bronsdon talks to Guilherm Sesterheim - a SAP Site Reliability Engineer at AWS. The discussion affords a fascinating glimpse into the challenges of applying DevOps and Chaos Engineering principles against a monumental system in a risk averse environment.

Slight Reliability - a YouTube Channel for SRE

Slight Reliability is a YouTube Channel presented by Stephen Townshend, an SRE based in New Zealand. The channel has been running for a year and is the successor to the Performance Time Channel. The tone is easy-going and down to earth but also well-informed and opinionated (in a good way). As well as full-length videos on issues such as SLO’s and Rapid Incident Response there are also bite-sized takes on topical DevOps issues and themes. We definitely agree with his views on MTTR and DORA metrics.

Events RoundUp

There is a great mix of events on the horizon for the next month or two. Londoners will want to clone themselves next week as there are two great meetups on February 28th. The ClickHouse MeetUp will be taking place in Camden Town, while in High Holborn the London Observability MeetUp group will be munching pizza and talking LLM’s. The following day sees the LEAP 2024 conference (I’m not sure if we have to wait until 2028 for the next one).

Early Bird tickets have now sold out for GrafanaCON in Amsterdam in April. Standard price tickets are still available but will almost certainly sell out soon. SRECON Americas will be landing in San Francisco on March 18 while on April 8th QCon 2024 will kick off three days of software development-focused talks and keynotes in London - the event also includes a dedicated Observability track.

📣 Reminder!

Don’t forget - you can find a fuller listing of events, meetups and webinars on the Observability 360 calendar.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from Alan Gannett in his book The Creative Curve:

“You can't have insights about things you don't know anything about”