The SRE With 30 Million SLO's!

The State of SRE | The Bot That Got Too Smart!

Welcome to Edition #22 of the Newsletter!

Observability’s Numbers Game

There was a time when comparative advertising seemed to be rare. A vendor would say “our car is fast” rather than “our car is 20% faster than vendor B’s car”. It seems though that, as the observability market becomes increasingly competitive, the practice is becoming more commonplace. This has resulted in some terse exchanges on social media as vendors at the unfavourable end of the comparison hit back.

Our feeling is that this brand of marketing is a double-edged sword. The problem with marketing yourself as being “cheaper than <name of other brand>” is that a cheaper competitor invariably comes along. The same applies to measures such as ingestion rates, retention periods, compression rates etc. There is always someone bigger or faster waiting in the wings.

Ultimately, if you are buying a car, you will not make your purchase on the basis of a single criteria such as price, speed or storage space. You also want safety, reliability, low running costs etc. The same goes for observability systems. They are not just storage or ingestion engines. They are a tool for meeting specific business needs, for being able to answer questions, explore your data, identify anomalies, diagnose failures, prevent outages etc. There may be some value in the impressive stats used in product marketing, but they are only part of the story when it comes to procuring a system that will be the right fit.

Introducing The CatchUp

This edition of the newsletter features a new section - The CatchUp. This is where we provide a roundup of the latest news, features and versions of products which we have previously covered in the newsletter. We really hope it provides a useful reference for keeping up with developments in observability tooling.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

The State of SRE (X2)

You wait ages for a report of the state of SRE, and then two come along. First up is the Catchpoint SRE Report, which is probably worth reading just for Steve McGhee’s introduction alone.

Unsurprisingly, given Catchpoint’s line of business, the report finds that monitoring 3rd party endpoints is really important (tbf - it probably is). Other, more vendor-neutral findings, include the percentage of work regarded as toil has dropped from 20% to 14% in the past year. Interestingly, the authors of the report do not ascribe this to the use of AI. Whilst some vendors push the “single pane of glass” paradigm, the report finds that the bulk of companies use 2-5 tools, with 10% using 6 or more. For other juicy stats, such as how many companies breached a contractual SLA, you can download the report.

The second report is Gartner’s SRE Hype Cycle. Unfortunately, the full report is only available to Gartner subscribers. You can however, find an overview in this New Stack article by Ido Needham. On the basis of Ido’s summary, the Gartner report is heavy on analysis of automation within SRE - covering trends such as Monitoring as Code, Policy as Code and the usage of GitOps-type processes.

Loq Querying Smartens Up

Grafana Explore Logs

For a long time, log querying UI’s were almost an oasis of familiarity in an ever-changing world. You ticked a few options, entered a query and then waited until the screen filled up with rows of results. Gradually though, it seems as if the tides of AI and automation are creeping in. Last year, AWS announced support for AI-powered natural language querying in Cloudwatch, and recently both Azure and Grafana have announced upgrades to their log querying capabilities.

Up until recently, querying logs in Azure Monitor required a knowledge of Kusto - the Azure query language. With the introduction of Log Analytics Simple Mode, users can now easily filter by any field in a table using dynamically generated drop down lists. It is also possible to generate simple aggregations just by pointing and clicking.

Log Analytics Simple Mode

Grafana have also introduced an Explore Logs feature, which is designed to help engineers use logs to resolve errors without the need for writing complex queries. This is a much more visual experience than the Microsoft update. Explore Logs presents an array of graphs representing logging flows. Users can then zoom in on anomalous activity. There are also extremely powerful built-in analytics such as Patterns, which group together logs with similar textual content so that they can either be investigated further or eliminated from the diagnostics process.

VM Get MAD With Anomaly Detection

Anomaly Detection in time series data is notoriously complex. Detecting anomalies in observability metrics tends to be even more difficult. This is partly because the best results arise from training models on a single metric, whereas observability systems may ingest millions of different metrics. The patterns of the data are also likely to be more unpredictable and exhibit characteristics such as sparsity and extreme right skew.

Victoria Metrics are one of the leading vendors taking on this challenge, and last year they rolled out their Anomaly Detection feature for their enterprise customers. A recent post on their blog summarises a number of key updates to their toolkit. The first of these is Presets - which can target well-known metrics types such as those generated by the Kubernetes Node Exporter. They have also released the MAD (Median Absolute Deviation) model but warn that it is not suited to seasonal or trendful data.

There is really excellent documentation both on the tooling and on anomaly detection in general on the VM web site. At the same time, these are quite sophisticated instruments and obtaining meaningful results probably requires both an understanding of the characteristics of your data sets as well as some fluency in data science principles.

The Bot That Got Too Smart

Robusta is a tool we have featured a few times. It harnesses AI for advanced K8S monitoring. A recent LinkedIn post by company founder Natan Yellin, showed that sometimes the AI we use can be almost too smart. Natan’s team wanted to see if their backend LLM would correctly detect database errors. They therefore created a Java class to generate an error.

Even though the LLM was aware of the error, it did not flag it is an exception. Instead, it applied a higher level of logical reasoning. It read the Java code and detected that the error was the intended outcome of the code and therefore should not be classed as an exception.

Natan’s post may only be very brief, but it is also a really fascinating exemplar of the way in which the law of unintended consequences may assert itself when interacting with AI systems.

Products

oTel Desktop Viewer

It may seem odd that, despite the increasingly widespread adoption of the OpenTelemetry framework, not much in the way of third-party tooling has emerged. There is DashO’s very handy oTelBin validation tool and the otel-cli utility but not much else besides. It was a pleasant surprise, therefore, to stumble across oTel Desktop Viewer - a tool that generates visualisations of OpenTelemetry Traces on your local machine.

This is a really handy utility for those times when you want to view traces but have not yet got around to setting up an oTel collector or a third-party backend. If you are a strictly command-line oTel ninja then fear not, GitHub user Y.Matsuda has created a version that runs in a terminal. It would be great to see more oTel tooling like this emerging. If you are aware of any - then please let us know!

Treblle - Observability for API’s

API’s are at the core of Microservice architectures, and monitoring their health is of critical importance. Trebble aims to provide a platform which brings together API management tools which might previously have been spread across multiple different solutions.

The platform’s Observability module provides the usual metrics and APM features you would expect, but also includes API-specific tooling such as detailed request and response information and endpoint detection. There are also modules for user analytics, security, governance and generating API documentation.

Hydrolix - A Low-Cost Streaming Data Lake

As log volumes have exploded, a number of vendors have appeared on the market offering solutions for ingesting and storing logs at vast scale but at a fraction of the cost of some of the incumbents. The latest entrant to this growing sector is Hydrolix, who describe their product as a “streaming data lake”. Despite the “data lake” tag, Hydrolix does provide support for indexing and transformation on ingestion.

The technical spec is quite impressive - it offers integrations with Kafka and Kinesis and querying via SQL and Spark API’s. The web site has a simple tool for calculating cost. When you use it, you notice that the result includes ‘Cloud Provider Expenses‘ - this is because Hydrolix is not a SaaS product - instead you buy a license to run it on a K8S cluster. Having said that, the running costs quoted by the calculator are extremely competitive.

The CatchUp

We first covered Tracetest back in Edition 9 and its use of oTel traces to run distributed tests is a real game-changer. Since then, the product has seen a raft of major enhancements, including automated provisioning of environments, secrets management and full support for running Playwright scripts. It is an incredible tool to help maximise the return on your investment in OpenTelemetry.

Like Tracetest, ClickHouse are rolling out updates at a ferocious pace. One of the biggest new features is ClickPipes - a no-code solution for creating data ingestion pipelines. They have also rolled out a major integration with Fivetran - which includes connectors to over 500 data sources. Finally, the ClickHouse Cloud is now generally available as a service on the Microsoft Azure platform.

Groundcover, the eBPF-based observability stack we featured in Edition 10 are not only gaining traction in the engineering community. They have also caught the eye of investor’s and are included in The Generalists Future 50 list as “one of the world’s highest-potential startups for 2024“.

From The Blogosphere

The What and Why of Percentiles

Percentile values are ubiquitous in SRE and Observability, and in this blog post, SRE expert Alex Ewerlof takes a step back and ask “what are they?” and “why do we use them?”. This is a really excellent re-cap on something that we probably all take for granted. It provides a useful refresher on some basic statistical terms as well as some good tips on analysing data sets. As a bonus, there is also a link to Alex’s open source tool for illustrating the concepts he discusses in the article.

Digging Inside The Blazor WebAssembly Sandbox

WebAssembly is a really exciting technology, with the potential to revolutionise how apps are built and deployed. However, due to the fact that it effectively runs in a sandboxed environment in the browser, achieving visibility can present significant challenges. For example, it is not possible to make HTTP requests from within a WebAssembly component - so it cannot directly submit telemetry to a backend.

In this thorough and technically astute article, Harry Kimpel, Principal Developer Relations Engineer at New Relic, looks at how the challenge of instrumenting a Blazor WebAssembly App can be met through a mix of Javascript invocations in the .NET code and the use of the New Relic browser. We strongly believe that wasm is a technology which is ready for Prime Time and an awareness of its fundamentals will be of great value.

📢 If you are interested in WebAssembly you might also like to check out the latest OpenObservability Talks episode, where Dotan Horovits talks to Taylor Thomas of Cosmonic.

Obirdability - Fowl Play With Grafana!!

Grafana dashboards have been put to all sorts of uses over the years - for everything from space missions to monitoring milk production. In this fun but highly informative article Ivana Huckova and Sven Grossman walk us through building an observability system for bird song. Whilst this might sound slightly quirky, the techniques could be applied to all manner of applications which need to record and analyse audio inputs.

The article is a great showcase for a number of Grafana capabilities - including installing Alloy on a Raspberry Pi and adding context to Dashboard data by dynamically query sources such as Wikipaedia and the Open-Meteo weather information service. The example dashboard also makes use of some really cool features such as Mixed Data Sources and Recorded Queries.

We had a hoot playing along with this and, since the software is open source, we didn’t end up with a large bill :-)

Videos

Going from 30 to 30 Million SLOs

The prospect of managing 30 million SLO’s sounds like the stuff that SRE nightmares are made of. Obviously though, when you work at Google, it’s all in a day’s work. Presenter Alex Palcuie has been an SRE in the Google Compute division for over seven years.

In this talk delivered to the London Observability Engineering MeetUp group he charts how his team initially measured reliability across a small number of axes and dimensions but, as the Compute Function expanded, the number of dimensions grew exponentially, leading to the 30 million SLO’s referred to in the title. There are some extremely valuable insights into how Alex and his team evaluate the customer impact of outages in a system running thousands of services.

Another Big Conversation With Dash0

Jonah Kowall has probably forgotten more about observability than many of us will ever know. His CV includes senior leadership roles at Logz.io, Cisco, Kentik and Aiven. In this wide-ranging and highly thoughtful interview with Mirko Novakovic, he shares valuable insights on a range of hot topics in observability. The discussion on using LLM's for querying is really fascinating and highlights the enormous potential for interoperability between query languages.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the master of observation, Mr Sherlock Holmes:

“It is a capital mistake to theorize in advance of the facts.”