- Observability 360
- Posts
- Moving Faster and Fixing Things
Moving Faster and Fixing Things
Datadog Snap Up Quickwit | The O11ys - Find Out Who Won!
Welcome to Edition #30 of the newsletter!
2025 - The Year of Moving Faster and Fixing Things
Welcome to the first edition of 2025. As we have said previously, 2024 was a pivotal year for observability, with huge progress being made in OpenTelemetry and significant advances being made in fields such as Front End observability, LLM observability, anomaly detection and more.
With the slew of recent announcements around advances in AI, Quantum Computing and compute power, it seems as though we are not just at an inflection point, but at an acceleration point. We think that as more and more vendors leverage AI, we will see ever more disruption and an ever-increasing rate of change. With so many products dangling the carrot of automated remediation, it may even be the year of moving faster and fixing things.
X/Twitter
As a rule, we do not engage in political commentary, but we feel the need to make an exception. Self-expression and the rational exchange of views and ideas are the lifeblood of a democracy. Unfortunately, we cannot ignore the fact that X/Twitter is no longer a “Town Hall”. We have looked on with increasing concern as Elon Musk’s feed has transformed into a platform for misinformation, division and attacks on democracy itself. We have now migrated over to Bluesky - if you have an account there it would be great to connect: https://bsky.app/profile/observability-360.bsky.social
Feedback
We love to hear your feedback. Let us know how we are doing at:
NEWS
Datadog Snap Up Quickwit!
Log aggregation specialists Quickwit have capped an incredible few years with the sensational news that they are now "joining" Datadog. The company was only founded four years ago but has won many admirers with their large scale ingestion capacity and lightning fast query engine.
Last year the company made headlines when they ousted Elastic to become the logging backend for hyper-scaler Binance. The acquisition means that Datadog now have some serious technological firepower at their disposal. Having acquired pipelining specialist Vector a few years ago they now have a product capable of running sub-second queries over hundreds of millions of logs. The good news is that the Quickwit platform itself remains under a permissive open source licence, so the technology itself is still available for the community to maintain and take forward.
The O11ys - the Awards for the Observability Industry
On the 31st December 2024 we ran the second edition of the O11ys - the awards for the observability industry. The spirit of the awards is not to call out one system as being better than another, but to celebrate achievements in different aspects of observability.
This year there were some 15 categories, including Best Front End Tooling, Best LLM Tooling, Best OpenTelemetry Implementation and many more. The observability space is so vibrant and teeming with innovation that it is very difficult to pick out a “winner” for any given category. What is remarkable though, is that some of the very best new products are being developed by relatively new companies with very small teams. We think this is fantastically encouraging.
Click on the button below to find out who scooped the coveted awards as well as Special Mentions and Ones To Watch for 2025.
2025 in Preview, Part One: B Cameron Gain
In the past week or two we have read more 2025 previews than you shake a stick at and this article by B Cameron Gain ranks among the best for its clarity of vision and breadth of industry knowledge. In common with pretty much the rest of the world, he sees AI and OpenTelemetry as being big themes. He also sees observability shifting both left and right, a view which chimes with Gartner analyst Matt Crossley’s recent prediction that observability would be extending its reach across a broader range of business domains in 2025. As ever, there are alternative viewpoints - Dynatrace CTO Bernd Greifeneder recently describe shift left as “a disaster for enterprises”.
2025 in Preview, Part 2 - The View From Grafana
With features such as Alloy and Adaptive Logs, Grafana were at the forefront of observability innovation in 2024, so it is interesting get a glimpse into their thinking on the trends to watch for in 2025. Whilst most forecasters (including ourselves) lead with the themes of oTel and AI, Grafana kick off their preview with the prediction of a convergence between traces and profiling. If you read the article, you will probably agree that their rationale makes a lot of sense. They also give their thoughts on the cloud repatriation trend as well as potential developments in the growing field of Platform Engineering.
Products
Google’s Gemini Assist - Observability as a Widget
Microsoft’s GitHub CoPilot kicked off a revolution by bringing the power of LLM’s right into the developer’s IDE. Gemini Code Assist is Google’s equivalent of CoPilot, but they have now upped the game by incorporating vendor plugins into their development environment. Initial partners in the program include vendors such as Dynatrace and Elastic. This opens up a whole new world of possibilities for leveraging observability insights within custom applications.
At a practical level there is no great magic happening here - the tooling is essentially a wrapper around the existing vendor API’s. At the same time, abstracting away much of the complexity of query API’s and building solutions based on the principle of composable observability.
Airia - Enterprise Platform For AI Management
LLM observability is becoming an increasingly important concern and in our O11y awards we highlighted dedicated tooling such as Langtrace and Datadog’s LLM Observability product. Airia is a rather different proposition - it is a full-blown platform for managing LLM application portfolios. It is capable of enforcing high-level policies as well as lower-level technical functions such as scanning workloads for potential leakage of PII or sensitive data. However, its overall capabilities extend way beyond performance monitoring and governance. It is a sophisticated, enterprise level solution orchestrating processes such as model deployment and routing and building agentic workflows.
From the Blogosphere
Getting Started With Kafka Monitoring
Kafka is the backbone of many modern distributed systems. Like Kubernetes, it can play a mission critical role in enterprise infrastructure, but also like Kubernetes, installation and maintenance can require considerable levels of expertise. As well as maintenance and configuration, observability critical to ensure that messages are3 not being lost or delayed.
In this Medium article, Data Engineer M. Çağrı AKTAŞ defines a model for monitoring Kafka clusters using the JMX Exporter, Prometheus and Grafana. The walkthrough uses a simple of example of a 3-broker Kafka cluster and covers configuring the JMX Exporter as a metrics endpoint as well as configuring Prometheus scraping. It is a really digestible introduction to the principles of Kafka monitoring
Open Source - The Community Strikes Back
The Open Source movement has been rocked be some major convulsions over the past few years as major projects such as Redis and Terraform have gone over to “the dark side”. Dotan Horovits is a CNCF ambassador and a leading commentator on the state of open source. His writings on the subject have actually been very prescient and he has warned off the dangers of the licensing rug being pulled for a number of years.
In this article, Dotan enumerates some of the high-profile licensing switches of recent years and recounts the ways in which the open source community have responded - with hugely successful forks such as ValKey and OpenTofu. The article contains many insights of value to any FOSS practitioners.
Mass Migration - Stripe’s Shift to AWS
This is a story of a major IT migration which somewhat goes against the grain. The usual trajectory for many very large scale companies is to either migrate from established vendors to newer, more agile offerings, or in the case of companies such as Shopify, to develop their own composite solutions.
In what can only be seen as something of a coup for AWS, Stripe, one of the world’s leading payments services, migrated their observability infrastructure from an unnamed “legacy” system to the AWS platform. The company track over 300m metrics across 10 thousand dashboards so the using services such as Managed Prometheus and managed Grafana. This is a really interesting case study in large scale migration with a number of valuable takeaways.
SRE
Google Update The SRE Runbook
Google pretty much defined the fundamental principles of SRE - SLO’s, error budgets, eliminating toil etc over 20 years ago. Those principles have been remarkably resilient and effective, surviving two decades of constant innovation. Arguably though, the cloud, microservices, hyper-scaling and other trends have fundamentally altered the IT landscape and Google are responding to this with a recalibration of the theoretical underpinnings of SRE practice.
The new thinking is explained in a paper authored Benjamin Treynor Sloss (who coined the term Site Reliability Engineering) and Google SRE team leader Tim Falzone. This is quite a theoretical piece, delving into concepts of Control Theory and systems Theory and explaining Google’s adoption of the STAMP (System-Theoretic Accident Model and Processes) framework. Who is going to be the first to refer to this as SRE 2.0? Not us 😀
OpenTelemetry
Getting Added Value With The Sum Connector
If you are familiar with OpenTelemetry Collector then you will probably be aware of concepts such as receivers and exporters. These are building blocks which can be flexibly combined to build the overall telemetry pipeline. Another useful feature built into the oTel architecure is the connector - which as the name suggests, can be used to join pipelines together to create advanced telemetry flows.
The Sum Connector is a special kind of connector that can create aggregations across arbitrary sets of numerical values and emit the results upstream. This article on the Splunk blog shows how you can use the Sum connector to aggregate values stored in custom business attributes. This means that we can easily carry out telemetry transformations to transform data from logs and traces into metrics. The article uses the example of summarising totals for sales and discounts, but the process could be applied to any business metric. This is a really useful article and highlights an important strategy for helping to align observability practice with business goals.
oTel Can Play Nice With Windows
An artist’s impression of an oTel Collector
There is often a perception that using Open Source software on Windows is a bit like eating soup with your dessert spoon. It is perfectly possible, but it is not quite the done thing. Certainly, many samples for using OpenTelemetry tend to be based on instrumenting Linux VM’s or cloud-native hosts. This article by Martin Thwaites and Vivian Lobo on the Honeycomb blog is a useful reminder that, as well as being vendor neutral, OpenTelemetry is also an Operating System-agnostic framework.
Many of us will have used Helm charts to deploy the Collector on K8S instances, but for Windows there is an installer which can run the Collector either as a normal program or as a Windows Service. The article contains all of the configuration needed to collect logs from Windows Event Logging as well as Host Metrics and Windows Performance Counters.
Events RoundUp
2025 is kicking off with some really attractive Open Source events. If KubeCon is the Ceasar’s Palace of Open Source, then FOSDEM is probably the Burning Man. It is a free two day gathering of over 8,000 open source engineers attending over 900 lectures and lightning talks. Instead of corporate logos expect crowded lecture theatres, Belgian waffles and maybe a beer or two.
After FOSDEM, you can hop on the EuroStar to London and take in State of Open Con 25, which opens on Feb 4th in Paternoster Square, nestled alongside the iconic St Paul’s Cathedral. The event comprises seven tracks including AI Openness, Software and Security and The Future of Open Source. There will be speakers from AWS, Percona and even the UK House of Lords!
After waffles in Belgium and chips in London why not head off to Copenhagen in March for some Smørrebrød at the Experts Live event. This is a one day event dedicated to Microsoft technologies, including sessions on AI, Azure, Microsoft 365 and more. At the end of the day you can feel the hygge with a free movie and refreshments.
That’s all for this edition!
Wishing you all peace, happiness and fulfilment in the coming year!
If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!
As it’s a new year, let’s start off with a note of optimism from Nelson Mandela:
“It always seems impossible until it’s done.”
About Observability 360
Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.
The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.