- Observability 360
- Posts
- The New Table Stakes of Observability
The New Table Stakes of Observability
Taming Kubernetes | AI-driven Debugging | oTel Certification
Welcome to Edition #29 of the Newsletter!
A Pivotal Year
2024 has turned out to be a huge - and maybe even pivotal year for observability. The market continues to grow, new vendors are entering and products are rapidly diversifying.
The increasing competition has seen some, at times, fractious exchanges as fault lines opened up over issues such as OpenTelemetry and vendor costs. Whilst the topic of cost has dominated the debate for much of the year, it shouldn’t really overshadow some of the really exciting advances we are seeing in critical areas such as browser observability, root cause analysis and anomaly detection.
The Ollys
Whilst we are reviewing the year, it might be an opportune moment to mention the Ollys - our unique awards for the Observability industry. We ran these for the first time at the end of 2023 and the reaction was overwhelming. We will send out an email later in the month with our end of year review and a link to the page for the 2024 awards.
Feedback
We love to hear your feedback. Let us know how we are doing at:
NEWS
Not Just Metrics! VM Now Does Logs As Well!
What’s in a name? Having started out as an alternative to Prometheus, Victoria Metrics have taken a step towards being a full stack player with the launch of Victoria Logs. The software is open source and, like the metrics product, has a strong focus on resource efficiency. They claim to use 30 times less RAM and 15 times less disk space than comparable systems.
Amongst the other design goals of the product are ease of setup and speed - apparently it is capable of ingesting over 100 million logs per second. The system can receive logs from a range of collectors including Promtail, Vector, Filebeat and many more. It also supports a host of other enterprise-level features such as multi-tenancy, alerting and backfilling. Will they soon be adding traces to their portfolio? Will they one day change their name to Victoria MELTs?
Taming Kubernetes With Datadog Active Remediation
Kubernetes may have established itelf as the dominant paradigm for app hosting, but configuring and managing K8S environments is not trivial. With skills in short supply, vendors are aiming to ease the pressure by providing solutions which can assist with tasks such as maintenance and debugging. AWS recently released their EKS Auto Mode and Datadog have now rolled out Kubernetes Active Remediation.
This is tooling that is explicitly designed to ease K8S troubleshooting for non-expert users. If you have ever heaved a weary sigh at the sight of the dreaded “CrashLoopBackOff” error then help is, apparently, at hand. The promise is that you can troubleshoot without having to run the trusty Kubectl logs command. The tooling identifies problematic clusters and then provides guidance with root cause analysis and remediation.
Lightrun Launch Visual Studio Extension
The problem of the disconnect between developers and observability solutions is something of a recurring theme. Lightrun is one of a number of tools that seek to bridge the gap by bringing observability into the developer’s IDE.
The Lightrun plugin can dynamically insert logs and metrics and provides visibility across a number of platforms. At the recent AWS re:Invent conference, the company announced two major updates to the product. The first is an extension for Visual Studio, which is now available as a public beta. The second is the launch of Dynamic Traces. This is a rather different beast to the kinds of distributed trace we find in OpenTelemetry. Lightrun has the concept of Snapshots, and a Dynamic Trace is a way of grouping together these Snapshots so that you can track variables across a particular workflow.
Elastic Survey: “99% of companies face challenges implementing observability“
If you are struggling with the many and varied challenges of implementing an observability strategy, then it would appear you are not alone!
According to a survey of observability “decision-makers” sponsored by Elastic, pretty much everybody else is in the same boat. The survey was carried out amongst companies with 500 or more employees and, interestingly, the biggest challenge was not configuring Kubernetes or monitoring microservices but dealing with the competing requirements of different teams.
Not all surveys are the same and this one genuinely has a number of findings of interest to any practitioner defining or implementing an observability strategy.
Products
LogicMonitor Shines at Gartner’s London Excel Event
LogicMonitor are something of a veteran of the Observability scene. The company was founded way back in 2007 and they released a SaaS version of their product as early as 2014. As they were an exhibitor at the recent Gartner Infrastructure and Operations Conference in London, I had the chance to see the system in action, and it stood out as a sophisticated and polished platform.
The system is highly versatile, but one feature which really stood out was its agentless architecture. For many customers, this is a decisive consideration as it means they can be up and running with telemetry flowing through the system in minutes rather than days or weeks.
As a company with a wide customer base and a long history they are used to having to adapt the system to cover new technologies. On the strength of this they have developed their own extensibility model, which allows users to develop their own complex workflows and custom resources.
Relvy - Plugging AI Into Your Observability Stack
The model of composable observability is one that holds considerable promise and Relvy is a great example of the concept. It is an AI platform for incident discovery and remediation that taps into the telemetry in your existing stack. As well as the standard observability signals such as logs and metrics it can also ingest data from sources such as collaboration tools and code repositories, to gain additional context for debugging issues.
There is a sandbox on the company web site where you can see the system in action and our initial impression is that it is a clean UI with a really intuitive user experience. One item of particular note is the pay as you debug pricing model, where you get charged for each incident discovered by its agents. You might want to sort out your technical debt before signing up!
From the Blogosphere
The New Table Stakes For Observability
The latest lead article on the Observability 360 web site looks at the increasingly competitive nature of the observability market and argues that the minimum requirements for a “full stack” observability system are being raised.
More and more full stack vendors are enhancing their products with valuable new features such as front end observability, anomaly detection and AI-driven Root Cause Analysis. The article looks at powerful new capabilities being built out by a whole swathe of vendors including Middleware, Kloudfuse, Logz.io, Victoria Metrics and more.
It seems as though we are reaching a stage where the traditional “three pillars” of Logs Metrics and Traces are merely the foundation, and vendors who do not build additional value on top risk being left behind.
eBPF For Everyone!
We all know that eBPF has had a revolutionary impact on observability technologies. Allowing access to the Linux kernel has opened up dramatic new possibilities for lightweight and scalable application monitoring.
As this article on the Equinix blog shows though - it is not just the exclusive preserve of observability vendors. Its power is also being harnessed as a tool in its own right. The article looks at use cases such as infrastructure monitoring, system security and network traffic analysis. Although the examples cited look at implementations at “tech giants”, you don’t necessarily have to be an Uber or a Facebook to leverage the power of eBPF in your own enterprise.
Elastic’s EDOT Gets End-to-end Kubernetes Support
Elastic’s big push towards OpenTelemetry continues apace. Having already contributed their Universal Profiling Agent to the project, they have now beefed up their Elastic Distributions of OpenTelemetry (EDOT) package with deep support for Kubernetes environments.
EDOT integrates with the OpenTelemetry Kubernetes Operator and installs Elastic’s own distribution of the oTel Collector into your cluster. It includes the necessary receivers and processors to collect telemetry from K8S and forward it to your Elastic instance. You can then monitor your cluster performance via a set of out of the box dashboards. The idea is to provide an end-to-end experience with a minimum of configuration.
ClickHouse: The SQL Revolution, One Year On
A year ago this week we ran with the headline “SQL Strikes Back” and covered an article on the ClickHouse blog that outlined the company’s vision for a SQL-based observability stack. At the time, the idea of using a SQL database as a backend for an observability system might have seemed fanciful. It was pretty much axiomatic that SQL could not be used for ‘big data’ and that telemetry data should be queried using a proprietary language.
In the intervening 12 months though, the company’s success in normalising (😉) the SQL observability paradigm has been stunning. This in-depth article by Dale McDiarmid and Ryadh Dahimene charts the ClickHouse observability journey over the past year and highlights key developments such as building in full support for JSON as a native type. This is a lengthy article but it is a really valuable read for anybody interested in SQL-based observability.
OpenTelemetry
Linux Foundation Launch oTel Certification!
OpenTelemetry is rapidly establishing itself as a central plank of observability practice but many organisations are reporting a lack of the necessary skills. The Open Telemetry Certified Associate (OTCA) qualification marks a significant step in recognising subject matter expertise and providing a framework for gaining familiarity with the technology.
The curriculum covers competencies such as The OpenTelemetry Collector and Maintaining and Debugging Observability Pipelines. Although you can purchase entry to the exam as of now, you will not be able to schedule an exam until January 2025.
Grafana Launch oTel Collector Fleet Management
Grafana were one of the first big vendors to provide a wrapper for the oTel collector with the release of Grafana Alloy back in April of this year. One feature which was missing was Fleet Management, but this has now been rectified in the latest iteration of the product.
Best practice dictates that you would not just run a single oTel collector - instead you would run multiple collectors with each dedicated to a particular type of signal or service. As you scale up, this can mean running hundreds or even thousands of collectors. The Fleet Management solution provides the tooling for managing installations at this kind of scale. It is designed both to simplify configuration as well as to enable remote activation and deactivation of pipelines.
OpenTelemetry Tackle LLM App Development
With LLM application development rapidly emerging as a domain in its own right, there is an obvious need for standardisation around LLM telemetry. Whilst existing oTel patterns can cover processes such as connectivity with endpoints and runtime errors, there is a whole class of AI-specific behaviours that need to be modeled and codified.
This article on the OpenTelemetry blog is a really important update on the current state of play of OpenTelemetry coverage for Generative AI. Python has pretty much established itself as the lingua franca of AI development and the article looks at the Python Instrumentation Library for OpenAI, as well as discussing the equally critical issue of Semantic Conventions.
COST MANAGEMENT
OpenCost Advances to CNCF Incubation
The issue of costs has been one of the hottest topics in observability over the past few years. If you are running services in Kubernetes, you will know that, despite bringing benefits in terms of resilience and scalability, it can also have major cost implications. OpenCost is an open source project aiming to help companies reduce their Kubernetes spend and it has now been accorded CNCF incubating status.
The OpenCost UI provides visualisations for costs on all of the major cloud providers and also has a range of plug-ins for tracking costs across numerous sources such as Datadog, MongoDb, OpenAI and more.
Videos
Object Storage Is All You Need
Whilst the title of this talk may have echoes of a classic Beatles song, getting someone to watch a video on Object Storage could be a hard sell even for the Fab Four.
If you can suspend judgement for a minute or two though, you will find that Justin Cormack’s exploration of S3 is more of a Magical Mystery Tour than a Hard Day’s Night. Justin is both a master of his subject and an excellent story-teller and this is a really informative and enjoyable watch.
Open Source Observability Day
The inaugural Open Source Observability Day took place in October, and videos of the sessions are now available on the OSODay YouTube channel. There were some really excellent presentations, with highlights including a KeyNote talk from Charity Majors outlining the principles of Observability 2.0 as well as sessions by Josh Lee and Ryadh Dahimene on SQL Based Observability. I found the talk on the Netdata observability platform by Costa Tsaousis to be a real eye-opener. It is not a product I was familiar with and this is a fascinating dive into its philosophy and architecture.
That’s all for this edition!
If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!
This week’s quote is from artificial intelligence researcher Eliezer Yudkowsky:
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.”
About Observability 360
Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.
The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.