A Milestone for OpenTelemetry

plus Azure Chaos, CloudWatch AI and much more

A Logging Milestone For OpenTelemetry

Welcome to newsletter number six and, as usual, there is a lot happening in the observability space. Probably the most significant story is the announcement that Logging in OpenTelemetry has now been announced as stable🍾 

The giants of cloud technology have also been busy, and there are exciting releases from both the Azure and AWS stables. We also cover an eye-opening critique of verbosity in K8S metrics from the Open Observability team and much more besides.


As practitioners in the field, you will know that every good observability system needs a feedback loop. Let us know how we are doing at:


OpenTelemetry’s Logging Milestone

Possibly the biggest observability story at this year’s KubeCon was the announcement by Morgan McClean, an OpenTelemetry co-founder, that “OpenTelemetry Logging hit 1.0”. The OTel logging architecture consists of four main components and all of these have now been designated as stable. Achieving a consensus around logging was always going to be a monumental task given its relatively loosely structured nature and the fact that there are so many varied implementations in the wild.

One of the major gains of the release is that Logging runs in the context of the Collector, meaning that logs can more easily be correlated with other telemetry such as traces and metrics. OpenTelemetry have a keen awareness that “logs are relatively computationally expensive to capture” and have attempted to tackle this issue by specifying a new logging data model, which aims to reduce both CPU consumption and storage requirements.

Azure Chaos Studio Goes GA

The Netflix Chaos Monkey has long been part of SRE folklore. Now Azure users can also officially start breaking things in the name of resilience testing, as Microsoft have announced the Generally Availability of Azure Chaos Studio. As you would expect, it offers a range of features for testing the behaviours of your distributed applications in the event of different types of failure. Chaos Studio allows users to build and run ‘experiments’, where you can specify target resources, and then define faults to test their resilience. The experiments can then be used in DR simulations or even incorporated into CI/CD pipelines.

Boost Your Mobile Coverage With Embrace

According to this study, native mobile applications account for nearly 90% of mobile device usage. This means that native apps are not only a key component of digital strategy but also that they need to be factored into overall observability workloads. Embrace describes itself as a mobile-first observability platform. As well as providing the typical application-level instrumentation, it also captures mobile-specific diagnostics in areas such as networking, device and OS performance to provide a full picture of the user experience. The Embrace toolkit provides SDK’s to integrate with all the major mobile development platforms and their API supports integrations with Grafana, DataDog, New Relic and other providers.

AWS Logs Get Intelligent

AWS re:Invent is now in full swing and one of the new features on show is natural language querying for the AWS Cloudwatch service. This means that users can now query their logs by asking questions such as “Show me the 10 slowest Lambda requests”. The feature also offers line-by-line query explanation as well as refinement of existing queries. The technology is currently in preview and it would be interesting to know how it works in practice. If you have tried it out, feel free to share your experience with us via email or Twitter.

DevOps Dozen Nominations Announced

The DevOps Dozen is a set of annual awards organised by DevOps.com. The nominations make for interesting reading – not least for the absence of some big names and the presence of lesser known brands. There are some 18 products in the running for the Best Observability Solution category. Alongside established names such as New Relic, Grafana and HoneyComb, there are also nominations for newcomers such as Edge Delta and vFunction. Voting is open until 31st December and anyone can take part.

From the Blogosphere

Splunk Blog - Observability Shifts Right

Normally we hear about DevOps culture representing a shift to the left - so that practices such as testing occur early on in the development lifecycle. In this blog article, Wiliam Cappelli, a thought leader at Splunk, looks at the equally important process of shifting to the right and encompassing domains such as Service Management.  His discussion of CMDB’s may possibly raise a few hackles amongst ITIL purists. In contrast to some more orthodox opinions, he argues that the IT estate of large enterprises has always been “too complex and volatile” to be captured in a CMDB and that CMDB’s can at best only be loosely coupled with observability frameworks.

Measuring Service Mesh Performance

One of the choices that engineers need to make when spinning up a K8S cluster is whether to swap out the default CNI (Container Network Interface). There are a number of service mesh products to choose from, but each have different algorithms and functional priorities for features such as network traffic management. This is an interesting study by Eman Aktas on the Trendyol Tech blog, where he benchmarks network performance for different service mesh implementations. The tests compare the performance of Cilium, Calico and Flannel across a number of network traffic scenarios both on bare metal as well as on a CloudStack instance.

Cost Management

Is K8S Talking Too Much?

If you are running Kubernetes clusters, you will know that they emit very considerable volumes of metrics. The chances are that these are being funnelled into your observability system and potentially resulting in ingestion and storage costs. In this episode of Open Observability Talks, Aliaksandr Valialkin, CTO of VictoriaMetrics suggests that up to 75% of the metrics generated by Kubernetes could be superfluous for most purposes. This is a really instructive piece that calls for some standards around actionable metrics.


Building Resilience at Santander

In this presentation at RoachFest 2023, Thomas Boltze of Santander stresses both the technical and also the cultural dimensions involved in building resilient systems. This video is only 18 minutes long but it covers a lot of ground, including the importance of “the right mindset”, as well as giving an overview of the Santander platform (AWS/Kafka/Cockroach). This is a really lively and informative talk that will be of will be of interest not only to SRE’s but anyone with an interest in the topic of resilience.

Predictive Analytics With InfluxDB

Your observability system will probably amass a huge volume of time series data. Naturally, this will provide great insights into current and past system performance. This video from InfluxDB looks at using tools such as Quix and Hugging Face to harness your data to build models for predicting future trends and identifying anomalies. One of the interesting takeaways from this video is that InfluxDb is used by CERN for processing data from the LHC. Chapeau!


HoneyComb’s LLM Journey

If you are a customer of Honeycomb – or if you are just interested in how LLM’s can be leveraged in observability systems, then this webinar may be of interest. It looks at the journey of HoneyComb engineers as they built their AI-driven Query Assistant, a tool which has been well received by Honeycomb users.

Skill up in ClickHouse

Choosing the right backend storage system is a critical architectural decision for observability infrastructure. Full-stack providers such as SigNoz have achieved significant performance advantages by adopting the ClickHouse platform. It is also the storage system of choice for a number of hyper-scale businesses such as eBay, Uber and Cisco. You can now sign up for a free ClickHouse Fundamentals training course. This consists of 6 hours of expert-led tuition spread over two sessions. It should be of value for anybody with an interest in column storage technology or in observability architecture in general.

📣 Reminder!

Don’t forget - you can find a fuller listing of events on the Observability 360 calendar.

That’s all for this edition!

This week, we will leave you with this rather apt and succinct quote from UptimeRobot.

“Monitoring is like seeing the tip of the iceberg, while observability dives deep into the unseen layers”.