Elastic's Open Source U-Turn

Say Hello To Dash0 | Automating Your Logs

Welcome to Edition #24 of the Newsletter!

A Summer Blaze!

Hello! Summer is meant to be a quiet season for news, but we went on holiday for a few weeks and the biggest observability story of the year broke! Last month, Elastic, who caused uproar in the Open Source community with their pivot to a proprietary license in 2021, dramatically announced that they were returning to the Open Source fold. Elastic CTO Shay Banon has said that he expects the move to be greeted with cynicism in some quarters and this will inevitably be the case. Our take is that this is a really positive development and maybe even a cause for celebration.

Dash0 Goes Beta

No sooner had we unpacked our suitcases than we were hit with more huge news. Dash0 - the latest project of Instana co-founder Mirko Novakovic - has gone into Public Beta. The initial buzz around the product suggests it is primed to be a significant player in the observability market and we will be providing further details in future editions of the newsletter and on the Observability 360 web site.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Elastic Announce Stunning Open Source U-Turn

Elastic last month broke one of the biggest observability stories of the year, announcing that that they were making their flagship ElasticSearch product available under a license which conforms to the requirements of the Open Source Initiative.

In recent times major vendors such as Terraform and Redis that have both incurred the wrath of the Open Source community by flipping over to proprietary licensing models. Elastic have now announced a stunning reversal of this trend by committing to place the ElasticSearch code base under a GNU AGPL licence (the same license as used by Grafana). In a highly effusive post on the Elastic blog, CTO and Founder Shay Banon declared that “open source is in my DNA” and revealed that the company had even healed their rift with the former arch-enemy, AWS.

In retrospect, one could maybe see hints that Elastic were laying the groundwork for a reconciliation with the Open Source community. Initiatives such as open-sourcing the Elastic Profiler and declaring their full commitment to the OpenTelemetry Project could be seen as harbingers of a strategic shift. Despite this, few would have predicted a full 180 degree turn back to the Open Source model.

Clearly, within the community, there will be some trust issues to be resolved. Having pulled up the ladder once before, Elastic will have to work hard to get the community back on board. One factor in their favour though, is the opportunity to leverage the power and sophistication of the current ElasticSearch code base - which has benefited from substantial investment and boasts a number of leading-edge capabilities.

Dash0 - A Major New Observability Challenger

You may not be aware of the name of Mirko Novakovic, but you will almost certainly be aware of Instana - the revolutionary observability product he co-founded, and which was later acquired by IBM. Having created one ground-breaking observability product he has now returned with Dash0 - a full stack observability system based on a re-thinking of observability first principles.

The company have been trailing the new product in blog posts and videos on social media for some time, as well as demo-ing it at recent tech conferences. So, what can we expect? Well, some of the key themes of the product include:

  • native oTel support

  • transparent pricing

  • application speed

  • advanced filtering

  • fully integrated Real-User-Monitoring

  • a single query language for all telemetry signals

The initial feedback on the beta has been extremely positive and we think that Dash0 have all the attributes to establish themselves as a major player within a comparatively short time span.

Azure Introduce GPU Monitoring

With the increasing use of LLM’s in corporate applications, we have seen the emergence of dedicated LLM observability tooling to monitor performance at the application and data level. Obviously, LLM’s can also place huge demands on system performance, and Microsoft are responding to this with the inclusion of metrics such as Framebuffer Memory Usage, GPU Utilization, Tensor Core Utilization, and SM Clock Frequencies in the Azure Managed Prometheus offering.

Unfortunately, this is not just a simple, autonomous plug-in, it does require a certain amount of plumbing to gather, export and display the metrics. This article on the Azure Community Blog covers configuring the dependencies and using Helm charts to make the relevant updates to your Prometheus and Helm deployments. Watch those dials!

Grafana’s Stock Rises Ever Higher

Our lead story in this edition of the newsletter is the dramatic return of Elastic to the open source fold. There may be many reasons behind this decision, but one can’t help but consider that they took into account the inexorable rise of Grafana in recent years. Grafana take a hybrid approach of fully open sourcing their software whilst also providing a managed cloud commercial model. They have also exhibited a relentless focus on product improvement, expansion and innovation which places them amongst the top end of observability vendors.

This twin-track strategy has been a spectacular success, enabling them to appeal to developers and major enterprises and rapidly gain market share. Investors are clearly impressed with this formula and a recent post on the company’s blog announced the completion of a $270m funding round, which values the company at $6bn. The post also reeled off some impressive numbers - Grafana now has over 5,000 paying customers and Annual Recurring Revenue in excess of $250m. As one respondent on LinkedIn commented - they have “come a long way since 2016”.

Inside The Gartner Observability Magic Quadrant

Gartner recently published their 2024 Magic Quadrant for Observability. This is a major report produced by a highly respected organisation and its approach of analysing company performance across particular dimensions creates a really interesting schematic. At the same time though, we think that the famous graph is open to misinterpretation and over-simplistic use for marketing purposes by organisations that make it into the much prized “Leader” Quadrant.

This article on the Observability 360 web site is a brief reminder that the Gartner Magic Quadrant is not a league table or a ranking system and should not be digested without understanding its overall context and philosophy.

Products

Patchwork: AI-Driven Logging Automation

Logs are obviously a crucial diagnostic tool. Unfortunately, logging standards may vary across development teams, leading to inconsistent or incomplete log data. Many of us will know the pain of trying to troubleshoot an issue where the only log message is “Error occurred”. This is where Patchwork comes in.

Patchwork is a tool that aims to take the slog out of logging by using AI to analyse code and automatically generate structured logs for application code. In this really interesting article on Hacker News, the team behind Patchwork provide more detail on the product whilst also discussing some of the profound technical challenges involved in developing the product.

The product can currently be previewed in demo mode and is limited to updating existing logs, but dynamic generation of logs for new code should be coming soon.

At the moment, many businesses are paying large amounts of money for, essentially, exporting and storing vast amounts of useless logging data. We think that, in the longer term, tools which can apply intelligence to logging at source can be of enormous business value.

Kloudfuse: Cloud-Native Observability And More

There are already a number of cloud-native platforms on the market, but Kloudfuse boasts a number of attributes to differentiate it from other products in the space. One of the primary differences is that it is not a SaaS product, instead you install the stack on a K8S cluster running on AWS, GCP or Azure. The minimum spec for running Kloudfuse is one node with 8 cores and 32GB of memory. The system architecture uses a number of open source components including Kafka, Apache Pinot and Grafana for UI.

The company claims that its product can provide significant cost savings by storing data in the customer’s own environment - although, naturally cloud storage costs can quickly ramp up if they are not effectively and proactively managed. Although the data plane resides in the customers environment, the control pane runs in the cloud. As well as metrics, logs and traces the system also supports events, SLO’s and advanced analytics.

As a company, Kloudfuse seem to have a solid base, having been through two rounds of funding and attracted clients such as GE Healthcare and Workday. You can find an independent analysis of the company in this recent report by 451 Research.

From the Blogosphere

How Incident.io Do Observability

Incident.io are one of the leading providers of SaaS-based Incident Management solutions. As their service needs to be available 24/7, it is obviously important they have their own observability house in order. In this article on the Incident.io blog, Product Engineer Martha Lambert provides a number of really valuable insights into the company’s own observability strategy.

Obviously, there is no such thing as a one-size-fits-all approach to observability and SRE, but the opportunity to learn about the methodologies and principles that market-leading companies use to ensure high availability is always valuable.

Above all, we really like the way the author emphasises the value of fundamentals such as the importance of a good UX - so that very engineer can understand your system with a minimum of friction - as well as building your tooling in a structured manner that enables fast and intuitive drill-down.

How Zomato Souped Up Their Metrics With VM

Zomato is a restaurant aggregator and food delivery service that generates vast volumes of metrics. As their company grew, they adopted a Prometheus/Thanos-based architecture - running some 144 Prometheus servers. As metrics volumes continued to skyrocket, even this architecture started to creak and the Zomato SRE team began the search for an alternative solution.

In this article on the Zomato blog, the team discuss why they opted to migrate to Victoria Metrics as well as discussing a number of features of the system which enable them to achieve better performance, lower costs and greater scalability.

The technical challenges were pretty daunting - the project involved migrating over 800 dashboards, 300 microservices and 2.2 billion active time series. We would commend this article not just for its technical insights but also for taking a warts-and-all approach in documenting some of the technical limitations of the VM solution.

Deep Network Observability With Hubble

Isovalent’s eBPF-based Cilium application is rapidly establishing itself as the de facto solution for network management. Hubble is the observability component which ships with the Cilium package and, in this three-part series, Isovalent engineer Shedrack Akintayo provides a really useful primer for anybody looking to get started with the technology.

The series starts off with a clear overview of Hubble and then goes on to look at installation and configuration as well as identifying use cases. Even though this series is spread over three articles they are all clearly written and easily digestible and can probably be read in one session without any great exertion.

Prometheus Exporters - Small Is Beautiful

There are probably few people on the planet who know more about Prometheus than Julius Volz, so when he gives advice on best practise for optimising metrics processing, we will certainly take notice. The Exporter is a critical component in Prometheus architecture and in this expert article, Julius explains the advantages of using multiple exporters, each aimed at a specific target, rather than one a single multi-target exporter.

This is an extremely enlightening article that exposes a number of weaknesses in the single-exporter approach - such as losing granularity in target health monitoring. Excuse us as we head off to re-factor our metrics instrumentation 😯

OpenTelemetry

OpenTelemetry Governance Unveiled

Over the past year we have covered a whole raft of articles on technical aspects of OpenTelemetry - SDK’s, sampling, file-logging etc. As the second largest project in the CNCF though, the project also has numerous organisational and administrative aspects. Add in to the mix the fact that almost every big name in IT is involved in the project and you realise that there are also any number of corporate political dynamics at play.

In this article on the Grafana blog, Principal Engineer and OpenTelemetry Governance Committee member Juraci Paixão Kröhling pulls back the curtain on the work which goes on in the background to keep the project on track and how committee members need both technical expertise as well as a certain measure of diplomatic finesse. This is a really interesting insider view of the inner workings of the biggest project in observability

Telemetry Transformations With The oTel SpanConnector

Last9 are the creators of Levitate - an observability platform specialising in managing high cardinality telemetry. They also have a blog with many high quality and pertinent articles on observability theory and practice.

Transformations are a great way of extracting additional data from your telemetry flows, and, in this blog article, Prathamesh Sonpatki looks at how metrics can be derived from traces using the oTel SpanConnector. This is particularly valuable if you are using an instrumentation SDK with little or no support for metrics. Like most of the technical articles on the Last9 blog this is written with great clarity and readability and pretty much zero vendor self-marketing.

Observability People

Diana Todea

In the second profile in our Observability People series we are very proud to feature Diana Todea. Diana is a Senior SRE, Elastic alumni and regular speaker at tech conferences. She is a respected advocate for women in technology and has a strong interest in the potential impacts of Gen AI on observability practice. In this Q+A she talks about minimal tooling, the importance of layering and her work on an oTel certification!

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

Our quote for this edition of the newsletter is from the zoologist Paul A. Meglitsch:

“Nearly every great discovery in science has come as the result of providing a new question rather than a new answer.”