Datadog Unleashed!

Quickwit Go Stratospheric | Batch To The Future

Welcome to Edition #21 of the Newsletter!

The Road To OpenTelemetry

In the last edition of the newsletter, we covered David Cramer’s no holds barred takedown of the OpenTelemetry project. Whilst a few individuals in the community have echoed some of those sentiments, the consensus seems to be that oTel has won the day.

As things currently stand, almost every established vendor has now swung behind the project, and it is hard to think of any new entrants to the market that don’t support OpenTelemetry. For us, the question is no longer whether to adopt OpenTelemetry - it is now the standard. The discussion has moved on to how to implement it and how the technology will evolve. The road to oTel may not be paved with gold, but it is a better path than the maze which has gone before.

Observability In The Dock?

Is this the Summer of Discontent? As well as rumblings about oTel, there have also been mutterings about the value of Observability itself (see our Observability In The Dock feature below). Over the past few years, we have seen the field evolve from Monitoring to Observability and then to Observability 2.0. This has seen some very impressive technical strides. However, with observability consuming significant proportions of IT budgets, there is no room for complacency, and it is healthy to continually reflect and ask whether observability systems are delivering results.

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Datadog Unleashed

DASH is the annual meeting for the Datadog community and, if the assembled faithful were hoping for some tasty morsels of news at last month’s event, then they were not disappointed. Attendees were hit with so many new features that they may have left the vast Javits centre feeling somewhat punch-drunk.

First up was the announcement of Datadog On-Call. Whilst this could be seen as parking a tank on PagerDuty’s lawn, it is also a product that makes sense in the context of the existing Datadog portfolio, which already boasts its own Incident Management module.

From a technical point of view, perhaps the key announcement was the news that the Datadog Agent will now ship with a fully embedded OpenTelemetry collector. Previously, the oTel collector and the Agent worked as separate telemetry channels within the Datadog landscape and the two did not align either with each other or the full range of Datadog’s backend functionality. With the latest release of the Agent, it seems as if this mismatch has been corrected. Another benefit for admins is that fleets of oTel Collectors can now be managed by the Datadog Fleet Automation tool.

The third major announcement was Log Workspaces, a product which brings ETL-like capabilities to Log Querying. With Log Workspaces, users can connect logs from heterogeneous data sources, run transformations on the data and then join the transformed data structures. This will enable users to gain rich insights by pooling data from across the enterprise.

Datadog may be the darlings of investors, but this event proved that they have certainly not turned into fat cats, and, in engineering terms, they are still running with the leaders of the pack. As well as building their portfolio outwards they have also shown a good nose for sniffing out the future direction of the core product.

We only have space to cover a small sample of the 19 new products or features that were announced at DASH. For a full summary, click on the button below.

Observability In The Dock?

In a recent LinkedIn post, Josh Grose, a former Product Manager at Splunk, expressed his frustration at the current state of observability, arguing that it had become “a compulsory tax on software with little to no accountability to deliver anything more than dashboards”. That is a pretty hard-hitting critique, but it seemed to strike a chord. The post attracted over 70 comments - many of them sympathetic - from a number of expert voices.

Many contributors expressed the familiar concern about spiralling observability costs, whilst others felt that the technologies themselves were not really delivering results. Interestingly, this chimes with a recent report by the Continuous Delivery Foundation, which wondered whether DevOps was a movement which was running out of steam. Amongst other findings, the survey reported that, in some companies, Mean Time To Repair had actually increased compared with four years ago.

We have weighed in with our own comment on the LinkedIn post - where we kind of put the case for the defence. However, we feel that if there is a perception that observability systems are not delivering then maybe there is a need to engage more closely with stakeholders and establish clearer goals and outcomes.

Microsoft Unveil Advanced Network Observability for AKS

Microsoft have announced the release of their Advanced Network Observability service, which is described as the inaugural feature of a comprehensive Advanced Container Networking Services (ACNS) suite. So - what is the functionality behind this brand-name salad? Well, it is a network observability stack which, essentially, decouples Hubble from Cilium at the base layer.

This means that users can obtain Hubble insights without having to use Cilium as their Container Network Interface (CNI). Telemetry is then fed from each AKS node to an Azure Managed Prometheus Instance, which serves as the data source for set of dashboards running in an Azure Managed Grafana instance.

Naturally, this does not come for free. Running ACNS on an 8 node AKS cluster would cost around $144 per month. On top of that you would also need to factor in the costs of the Managed Prometheus and Grafana instances.

Quickwit - To Infinity And Beyond…

Quickwit may not be one of the biggest names in observability (yet), but a story on their company blog really signals their entry into the industry’s Premier League. Binance are the world’s leading crypto exchange - processing 21 million log lines per second. They are now using Quickwit as their logging backend - having migrated away from ElasticSearch. Using Quickwit, Binance are able to index 1.6 petabytes of logs per day and have slashed compute costs by 80% and storage costs by 95%.

Amazingly, the indexing and search performance achieved by Binance surpassed even the expectations of Quickwit’s own engineers. This is because they had never previously had such a huge dataset for benchmarking their system. The article details how the Quickwit architecture was able to cope with enormous ingestion loads as well as highlighting the ingenuity of the Binance engineers who pushed the system to its known limits - and beyond.

Products

Veriom - From Insights To Actions

One of the defining characteristics of Observability 2.0 is that it is a proactive and strategic practice which uses data-driven insights to improve performance across the organisation. This involves more than just gathering telemetry and requires looking at the bigger picture.

This is the aim of Veriom - a tool which combines telemetry from your observability systems with inputs from multiple sources across the enterprise to give an overall bill of health. In addition to this, not only does it generate reports and diagnostics, it is also able to provide detailed recommendations for actions and even, in some cases, provide remediation itself. The scope of the tool is broad and ambitious, and it can play a role in intelligently turning the actionable insights gained from your data into concrete strategies.

MyDecisive - Building The ‘Observability OS’

MyDecisive is a company founded in 2023 by Ari Zilka - formerly CTO of HortonWorks and general manager of the New Relic incubator. The vision of MyDecisive is nothing less than to create an ‘ObservabilityOs’, which will enable teams to “connect any data source, integrate context seamlessly, have full control over data storage, and automate actions with programmatic control“.

The project is still in its formative stages - it describes its status as “pre-Alpha”. The premise, however, seems to be a fascinating one and it will be really interesting to see how the product takes shape. The project is open source, so if you are curious to see the code behind a potentially paradigm-shifting platform, you can head over to their GitHub repo.

Micrometer - Getting The Measure of Java

Micrometer is not a new product; in fact it has been around since 2017 and has achieved great popularity within the Spring community. It was brought to our attention by Jay DeLuca in his recent Observability People profile and we felt it merited a shout out.

Micrometer is a specialist tool which does one job but does it well. It works as a facade for metrics observability within JVM-based applications and provides Java developers with really simple semantics for creating constructs such as Meters, Gauges and Timers. It also has bindings for instrumenting a range of different types of endpoints, including Kafka, MongoDB and Jetty. Micrometer also has flexible options for routing to a large number of observability backends.

From the Blogosphere

Logging in Golang - A Deep Dive

We have frequently mentioned that the SigNoz blog is a great source of observability content, and this very in-depth guide to logging in Golang certainly fits that description. Interestingly, the author, Aayush Sharma, decided to use the Slog logging library rather than the OpenTelemetry SDK for Go. However, no two applications are the same and different use cases may require different tooling.

The article provides useful detail on the benefits of Slog and covers features such as structured logging, handlers, child loggers and a number of other advanced topics. There is also a sample app and guidance on setting up an oTel Collector to forward logs to a SigNoz endpoint.

Network Observability - And Why You Need It

For those of us who work almost exclusively in the cloud, it can be easy to get swept-up in cloud-native mindset and forget that vast swathes of IT infrastructure still exist in physical data centres and corporate LANs and WANS. In this article on the Kentik blog, Tech Evangelist Leon Adato reminds us that the network has not gone away and the on-premise and hybrid world represents a huge part of the observability space.

In this witty, engaging and informing article Leon illustrates how knowledge of some fundamental networking principles is key to making sense of anomalies such as spikes and latencies and the effective management of network traffic in general. Leon is a really excellent communicator, and this is both an eye-opening and entertaining read.

Batch to The Future With S3

It has frequently been observed that one of the key differentiators in newer observability systems is their ability to leverage low-cost and highly scalable storage technologies such as S3. In this really illuminating article, Pablo Matias Gomez, a Senior Software Engineer at Embrace details how his team achieved a 70% reduction in storage costs by switching from Cassandra to an S3-batch-object- store.

As is often the case though, the devil is in the detail. Whilst S3 is a technology which can deliver incredible economies, its pricing model actually means that highly write-intensive applications can actually turn out to be prohibitively expensive. This article provides a fascinating insight into how the Embrace team engineered their way around this challenge and will be of interest to any engineer working with S3 storage.

Robusta’s AI Plug-In: A Helping Hand For SRE’s

We are possibly at the peak of the hype cycle around AI and, in the context of observability, this has included speculation that human engineers might soon be replaced by AI-powered bots.

In this Medium article, Platform Engineer Artem Lajko takes a rather more grounded approach and looks at how the AI Plug-In in Robusta.dev (a K8S monitoring and admin tool that we featured in Edition 13 of the newsletter) can be harnessed to help reduce the load on Platform Teams. In particular, he looks at how the AI features can potentially make junior engineers more confident and productive.

The article is quite in-depth and covers a lot of ground, including Robusta architecture and installation as well as integrations with notification sinks such as Slack and Teams.

OpenTelemetry

Advisory - Hardening of the oTel Collector

The maintainers of the OpenTelemetry Collector have posted an update regarding an important change in oTel Collector behaviour. The change is effective from version 0.104.0, which was released last month. Prior to this version, the default behaviour for oTel Collector servers was to bind to the IP address 0.0.0.0. Unfortunately, this represents a potential security vulnerability and, as from version 0.104.0, servers will now bind on localhost. If your OLTP Receiver config looks like this:

Then you may need to update it. Please see the blog article for full details.

How Cloudflare Migrated From Syslog-Ng

In the last edition of the newsletter we covered the OpenTelemetry Getting Started Survey, which reported, amongst other findings, that users were seeking more information on case studies and reference architectures. When it comes to observability at scale, there are probably not many case studies better than this article from the CloudFlare blog.

Their global logging pipeline ingests millions of log events per second and the article chronicles the challenges they faced in migrating from syslog-ng to the OpenTelemetry Collector. The article contains some incredibly valuable insights into how Cloudflare built out their tooling as well as documenting the lessons they learnt from their migration.

Instrumenting PHP Applications With The oTel SDK

PHP may have dropped down in the rankings of the most popular programming languages, however, it still powers around 75% of all web sites and represents a huge codebase that needs to be instrumented. This article on the Better Stack site is a really useful hands-on guide to instrumenting a sample distributed PHP system using the OpenTelemetry SDK.

The article is full of really useful detail on using the oTel SDK and integrating it with Monolog - regarded as the de facto logging standard for PHP applications. In addition to this, there is guidance on setting up an oTel Collector and connecting it to a logging backend such as - surprise, surprise - Better Stack.

CAREERS AND PROFESSIONAL DEVELOPMENT

We kick off this edition’s Careers roundup with the news that the Linux Foundation have introduced an Advanced Cloud Engineer IT Professional Program. The program consists of 6 courses spread over a 26-week period and also includes entry into the CKA (Certified Kubernetes Administrator) exam. As well as covering topics such as Containers and Kubernetes Fundamentals, the program also includes modules on monitoring and cloud native logging. The kicker is that enrolling on the course will cost $1,200.

If you are a New Relic user you can now show off your expertise with the Full Stack Observability Certification. Although the title sounds quite generic, the exam principally tests your knowledge of New Relic rather than general observability practices or concepts. The certification is free - you just need to register with the New Relic learning site.

Google pretty much wrote the book on SRE, so it is not surprising they they have some great resources for SRE professional development. They have now put together a ‘systems engineering syllabus’ consisting of high quality research papers, YouTube talks and workshops.

Last, but certainly not least, GitHub user Pavlos Ratis has put together a really comprehensive and expertly curated repository of SRE resources. It lists hundreds of articles, books, papers and videos and is grouped into categories such as Reliability, Performance, Culture, Tools, Capacity Planning and many more. This is definitely one to add to your favourites and come back to over and again.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from someone who needs no introduction - Edwin Hubble.

“Observation always involves theory.”