Observability 360
Posts
Finding The Unknown Uknowns!

Finding The Unknown Uknowns!

OTel Survey Results | Observability For ML Models

John Hayes
May 16, 2024

Welcome to Edition #17 of the newsletter!

In the 10 months since we started this newsletter, one constant within the observability market has been proliferation. It seems as if we are in the midst of our own Cambrian explosion, and it shows no sign of abating. Whether it is a new full stack solution, a point product or a new plug-in, the rate at which new products are coming on to the market appears to be showing no signs of slowing down. In this edition alone, we will be featuring products from Volkov Labs, Antithesis and Ctrl B. For customers, this can be a double-edged sword as it means an almost bewildering array of choices, but it certainly testifies to the strength of observability as a technical imperative within IT.

In our News section this week, we feature some enhancements that Mezmo have delivered to their top-tier pipeline solution. We think that this is important for two reasons. Firstly, it once again underscores the importance of pipelines within observability architectures, secondly it highlights some of the potential benefits of stream analytics, which we feel will be an increasingly important and in-demand capability. On the subject of pipelines, we also feature a really neatly structured blog article by Nimbus developer Kevin Lin.

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

https://twitter.com/TheObsGuy

NEWS

AWS Boost For EKS Observability

Two years ago, Amazon announced their AWS Observability Accelerator - which provided templates for building an AWS-centric observability stack using either Terraform Modules or CDK Patterns. They have now announced the release of a pre-built CloudFormation solution for managing Amazon EKS infrastructure. The solution is used to deploy and configure an Amazon Managed Grafana workspace as well as an Amazon Managed Service for Prometheus instance for metrics ingestion and storage. The aim of the solution is to simplify EKS management by creating a pre-defined set of metrics as well as providing a suite of ready-made dashboards. This article walks you through deploying the solution as well as calculating the potential costs.

QuickWit Goes Head-to-Head With Loki!

We have previously featured QuickWit as a lightweight and blazingly fast engine for querying telemetry at scale. Whilst the product is valued for its speed and low cost, it might not be clear where it stands in relation to similar products in the space. In this article on the Quickwit blog, company co-founder François Massot discusses the results of a benchmarking exercise which pits Quickwit against the might of Grafana Loki. He makes a commendable job of attempting to be impartial and provides some very helpful conclusions into the use cases for each product and their relative strengths and weaknesses. As well as being interesting in its own right, the article is also valuable at a more general level in providing criteria for evaluating different log aggregation systems.

Mezmo Unveil ‘Intelligent’ Pipeline

As ingestion volumes grow in size and complexity, pipelines become an ever more critical component of observability architectures. Mezmo have been a leader in this space and have now raised the bar with the latest enhancements to their pipeline product. In a press release last week, the company announced the launch of ‘stateful alerts in stream’ - which they claim is an industry first. Mezmo pipelines are now able to analyse data whilst it is in flow and raise alerts on errors and misconfigurations which might result in spikes and excessive costs. Many commentators believe that real-time analytics will become an increasingly important observability requirement and Mezmo now seem well placed to take advantage of this. This is a really interesting glimpse into the possibilities of telemetry pipelines.

Grafana Leverage K6 for Synthetics Testing

Synthetic testing is a pillar of any observability toolkit, as it provides feedback on how your services are responding to the outside world. It is critical, however, that synthetic tests go beyond the basics of checking availability and latency. Grafana first dipped their toes in the waters of Synthetics with their worldPing tool, released in 2015. Grafana Cloud Synthetic Monitoring is the company’s third generation of synthetics tooling, and it leverages the Grafana K6 load testing system to generate advanced synthetic testing for complex workflows. If you are evaluating your synthetic testing options, this is definitely a product to consider.

Products

CtrlB - A Bold New Pipelining Contender

CtrlB is a new offering in the observability pipeline space which aims to offer low-cost storage, flexible routing and high-speed querying. It is designed to complement rather than compete with your existing observability stack. With many providers, retaining all of your telemetry can be expensive. CtrlB allows you to route all of your telemetry to their low-cost backend, whilst also sending key telemetry streams to your main observability stack. They claim that this allows you to ingest and query at terabyte scale whilst also avoiding bill-shock from your main observability provider.

Volkov Labs - Taking Grafana Further

One of the great attractions of Grafana is its open source nature - which has spawned a rich eco-system for data visualisation tooling. Volkov Labs are one of the leading contributors to this eco-system, producing a suite of high-quality, open-source plug-ins catering for a number of advanced business scenarios. The suite includes capabilities for calendaring, rendering PDF’s and even updating source data. Our favourite component is the Business Charts Panel. This includes over 100 different chart types and leverages Apache ECharts to truly make your visualisations pop. If your role involves Grafana visualisations, it is definitely worth keeping Volkov Labs on your speed dial.

Antithesis - Finding The Unknown Unknowns

Whilst automated testing has resulted in a great leap forward in software quality, one limitation faced by developers and test engineers has been the infeasibility of testing every possible execution path in a complex, distributed system. Indeed sometimes, tests only cover a hypothetical Happy Path. Antithesis, as their name suggests, aim to re-write the rule book on testing - sometimes by turning existing assumptions on their head.

They describe their product as a “continuous reliability platform”. It runs your code in a simulated environment and continually explores new execution paths to discover “unknown unknowns” across distributed systems. The platform also includes Chaos Engineering capabilities such as simulating network retries or node restarts. Whatever the business case, from an engineering point of view, the lure of charting every possible journey for your code is hard to resist.

From the Blogosphere

Observability Principles for ML Models

A survey carried out by McKinsey in 2021 found that 57% of respondents were already using Machine Learning to support at least one business function. ML is no longer a niche concern but is becoming a core component of development and CI/CD practices. As this post from the Datadog blog notes, the efficacy of ML models will inevitably degrade over time, so monitoring their performance and reliability is critical.

The article really drives home the point that ML is a domain with its own specific behaviours, and effective monitoring requires building out new processes, metrics and even infrastructure to cover issues such as Data Drift, Prediction Drift and Concept Drift. Whilst the article does use some specialist terms, it is a highly readable and practical guide to the subject of ML monitoring.

Building A Telemetry Pipeline With Kevin Lin

We have previously mentioned Kevin Lin as the developer behind Nimbus - a tool for log optimisation and cost reduction. In this article, he dons his blogger hat to provide an ‘opinionated’ guide to pipeline management. He starts off with the example of a very simple two-part Vector pipeline and then, step-by-step, builds it out to add features such as transformations, aggregations and complex routing. Naturally, the article does include a plug for the Nimbus product, but it is still a great jumping off point if you are not yet familiar with the principles of telemetry pipelines.

OpenTelemetry

OTel Me More! Insights from the oTel Survey

The OpenTelemetry team have just published the results of a survey on how companies are using the oTel Collector. Whilst the sample is relatively small (186 respondents) and not necessarily representative, it has nevertheless turned up some really interesting findings. The majority of companies (100/186) deploy 10 or more Collectors - which hints at some pretty large-scale deployments. Unfortunately, the article does not provide extensive detail, so whilst the most sought after improvement in the product was stability, there is no further information on the types of issues users may have been experiencing. The report is relatively short but still represents a fascinating snapshot of real-world usage of the oTel Collector.

A Smooth Operator for the oTel Collector

The OpenTelemetry Collector comes in many different flavours, with several vendors publishing their own distros. As well as this, there are a number of different deployment patterns. In this blog article, leading oTel contributor Adriana Villela shares her experiences in deploying the Collector via the OpenTelemetry Operator For Kubernetes. Kubernetes Operators offer many advantages in deploying and maintaining resources running in a K8S cluster. The oTel Operator not only takes care of oTel Collector deployment, it also provide services such as OpAmp integration and auto-instrumentation. If you are looking for expert advice on best practice for your oTel Collector implementation, then this article is highly recommended.

Fun Stuff!

We came across this animation (unfortunately we can only share it as a static image) on the wonderful xkcd web site and (jokingly!) commented on social media that it looked like the internals of some predictive AI modules we have used 😀 . However, it is not just a cool animation, it is actually editable, so everybody can make their very own virtual Rube Goldberg Machine. What will you come up with? Let us know and we will be happy to share your own wizard inventions on social media!

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the great English thinker John Ruskin:

“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution.”