The Front End Gets Front of Mind

Overmind's Crystal Ball | A Change of Tempo

Welcome to Edition #27 of the Newsletter!

A Change of Tempo

Over the past year, publishing the Observability 360 newsletter has been a remarkable and highly satisfying journey into the unknown. It has opened the door to all manner of possibilities, encounters and relationships that I could never have anticipated. Although observability has become big business, I have discovered that it is still a space with a strong sense of community, and it is a great privilege for me to be part of that community.

A guiding principle for the newsletter is to provide quality content. This means researching every article and watching every video I recommend and, where possible, meeting with vendors before I feature their products. Unfortunately, monetising the various strands of Observability 360 activity has been difficult, with returns being pretty modest.

Against this backdrop, I was recently offered a fantastic opportunity to work for a truly progressive and exciting company in the observability space. For the sake of myself and my family, this was an offer I could not refuse and I am very proud and delighted to share that I will next week be joining SquaredUp in a Product Marketing role.

Happily, this does not mean the end of the newsletter. Richard, the CEO of SquaredUp, is a fan of Observability 360 and we have agreed that I can allocate a portion of my working time to continuing the newsletter. Naturally this is an extremely generous arrangement, but it does not compromise the integrity of the newsletter, which remains a separate and fully independent enterprise.

I am not sure quite how the balance between full-time job, family commitments and newsletter will work. Initially, therefore, I am going to experiment with switching the newsletter to a monthly cadence for the immediate future.

On a four weekly cycle, news that occurs in week one will obviously be rather stale by publishing time. I will, however still be covering the latest news and events on LinkedIn and Twitter/X - if you use those platforms, let's hook up! For other content, I will probably have to be a bit more selective - but that should certainly not be detrimental to the quality of the newsletter. The rhythm may change but, as Led Zeppelin said, The Song Remains The Same. I hope you will continue to enjoy the newsletter and stay on board for the next stage of the journey!

The Main Event

It is less than one week until Open Source Observability Day. This is a free, online event with a truly top quality lineup of speakers. Just to whet your appetite, the day kicks off with Charity Majors giving the definitive take on Observability 2.0 - who better than the person who coined the term? Check out the full schedule and block out some time in your diary!

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

Honeycomb Go On The Front Foot

For a long time, Front End observability has been something of a blind spot in observability systems. Whilst we are almost drowning in metrics about K8S memory usage and ingesting terabytes of application traces, our view of the user experience can be somewhat patchy. Honeycomb are one of the vendors atttempting to address this deficit with the release of Honeycomb for Frontend Observability. This is not just some dressing up of the classic Core Web Vitals defined by Google (although it does include them). It is part of a coherent solution for end-to-end visibility with scope for unlimited custom attributes and sophisticated analytics .

The feature consists of two parts - an instrumentation package with a rich API for gathering data at a high level of granularity (and cardinality) and Launchpad - a visualisation layer for performing analytics on your telemetry. This is a major release which confirms the status of the Front End as a first class citizen in the modern observability stack.

Zabbix Get In With The Cloud Crowd

Zabbix are stalwarts of the enterprise observability sector, with a strong reputation for infrastructure monitoring. Their product has previously only been available as an on-prem installation, but they have finally jumped on the cloud bus with the rollout of their own SaaS offering. The cloud version will support all features in the on-prem version and will be available in no fewer than seven pricing tiers. The tiers are based on a maximum number of new values per second (NVPS). Once you exceed the maximum you can either upgrade to the next tier or the excess values will be discarded.

Products

Overmind: You Might Not Want To Deploy That….

The current state of play in observability is that we have highly sophisticated tooling for diagnostics, querying and correlation. Up until now though, we don't have any tooling to prevent outages. Or maybe we do....

Overmind claims to be the first tool that can conduct a 'pre-mortem' on your Terraform IaC scripts, and warn you about changes that might break your AWS infrastructure. Overmind uses AI capabilities to define a blast radius for changes in your Terraform scripts.

The tool works by integrating with your source control workflow and generating a report when a Pull Request is submitted. A video on the Overmind web site looks at the example of a Pull Request for an apparently innocuous change to a port number. In a lot of code reviews this change might go through on the nod. In this case, however, Overmind correctly identifies that the current change would cause the Kubernetes Health Check to fail - obviously preventing the pod from loading and resulting in a service failure.

If you are not a Terraform/AWS user then fear not - support for Azure, GCP, Pulumi and other providers is in the works!

Pinot - Real Time Analytics At Scale

At the moment, it seems like the Apache Software Foundation is a hyperactive flywheel of software production. Certainly, they seem to be spitting out storage tech and frameworks like bullets from a machine gun.

Recently, we have seen the widespread adoption of Iceberg and Parquet specifications and now, racing up on the inside rail is Pinot - a massively scalable real-time analytics platform. Amongst its adopters are Kloudfuse, who use Pinot as the backend for their full stack observability product. This article on the StarTree AI blog looks at how the technology is used at hyper scale by Uber. As well as achieving huge performance gains, they also realized massive savings in infrastructure costs - reducing the number of CPU cores by 80% - which presumably also translates into lower CO2 emissions.

Flying The Flagger For Seamless Releases

Kubernetes is a great tool for running zero downtime upgrades, it drains pods intelligently and guarantees service availability. This Is great for running simple rolling updates but if you need to run advanced or multi-stage upgrades with conditional logic and rollback options you will need a more specialist tool.

This is where Flagger comes in. Flagger is an industrial strength tool that gives you fine-grained control over every step of your Kubernetes deployments. It enables sophisticated scenarios such as canary rollouts, which are not possible with a vanilla Kubernetes deployment operation. You can also carry out in-flight tests and analytics and configure Flagger to either promote or roll back depending on the result.

From the Blogosphere

Going Under The Bonnet Of An LLM App

LLM-based solutions are proliferating at an almost unsustainable rate. Not all of these solutions are likely to stay the course and those which do survive the inevitable shakeout are likely to be the systems which leverage LLM functionality within robust architectures with rogorous training on specialised datasets.

In the last newsletter, we covered Resolve AI - a tool which uses AI to assist SRE’s and on-call engineers with triaging and remediating incidents. This illuminating article on the Resolve AI blog discusses the considerable engineering effort involved in providing LLM’s with the necessary context to produce truly meaningful results in complex production environments. This covers everything from continually refrreshing the graph of infrastructure and services to governance issues such as regulatory compliance and guardrails to protect against hijacks and jail-breaks. There really is a lot going on under the hood.

Another Win For Quickwit

If you are a regular reader, you will know that we are huge fans of the QuickWit platform - which offers a stunning combination of power and speed at an incredibly low cost. But don’t just take our word for it! In this Medium article, Noble Varghese, a software developer at Thena, describes how his company migrated to Quickwit from a Mezmo + Elastic solution. As well as the cost and performance benefits, Noble also covers features such as Grafana and Jaeger integration. As an added bonus he also recaps on how the Thena solution also harnessed a Vector pipeline for rock solid reliability.

OpenTelemetry

An OpenTelemetry Primer From Kloudfuse

Many organisations are just at the discovery phase on their oTel journey - where they are just considering dipping their toes in to the water. If that describes your own state of readiness, then this article from the Kloudfuse blog will provide a useful overview for taking your first steps. It looks at some of the high-level architectural questions - such as running the OpenTelemetry Collector in Daemon or Gateway mode and provides a simple summary of pros and cons. It also provides an overview some of the other fundamentals to consider in formulating your initial strategy - e.g. sampling, enrichment and data filters.

Timeline for an oTel Strategy

The Kloudfuse article featured above is a great resource for getting your bearings with OpenTelemetry and following good practice in instrumenting your code. If, however, you have progressed to the stage where you are aiming to migrate an existing codebase and telemetry infrastructure to OpenTelemetry, you may potentially have quite a large project on your hands. OpenTelemetry itself is a large and ever-growing framework with SDK’s, API’s and infrastructure components such as the Collector. Equally, re-instrumenting a large distributed codebase requires no little coordination and pre-planning.

The people at Honeycomb are veterans of a number of oTel migrations and they have summarised many of their key takeaways in this article. As well as sharing their own experiences, the document also includes a really handy timeline for planning your own implementation.

CAREERS AND PROFESSIONAL DEVELOPMENT

It is time for our occasional CPD round-up, and in this edition we have unearthed some really great resources.

Sajeeva Lakmal is currently one of only four people in the world to have passed every single CNCF certification. That is a pretty impressive record. Equally impressively, he has published this article where he shares his tips on how to pass the Cilium Certified Associate exam. This is a really valuable resource for anyone thinking about gaining this very desirable qualification.

Next up, a couple of SRE-related resources.

If you have ever wondered what it takes to ace an SRE interview at a hyper-scaler then this article by Krishna Vinnakota will be of interest. Krishna is an SRE at TikTok and in this piece on Cracking The SRE Interview, he provides some high level insights into the key skill sets he looks for in a candidate and gives examples of some of the kinds of questions you might expect.

If you are looking for some materials to prep for your big interview, you might want to dig into this reddit thread, where redditors provide some great suggestions to a user asking for guidance on SRE training materials.

Outreachy

We have also come across a really wonderful project called Outreachy, which is a diversity initiative run by the Software Freedom Conservancy. The initiative offers internships in open source and open science to people from under-represented groups, with a stipend of $7,000. If you are interested in supporting the initiative, there are also opportunities for mentors.

Publications

If you are anything like me, then your reading list is probably long enough to see you through to the end of the decade. At the same time though, it is hard to resist adding a few free e-books to that list.

Mezmo are one of the leading suppliers of observability pipeline solutions and this publication is a really excellent guide for anybody wanted to learn more about the subject. The report is aimed at DevOps, site reliability and security engineers but it is very clearly written and not overly technical so you don’t need any specialist knowledge. It is also refreshingly free of vendor self-promotion.

If you are interested in the Cilium certification that we mentioned earlier, then you will definitely want to get hold of this practical guide to managing Kubernetes network traffic with Cilium. The target audience for the publication is probably System and Network engineers as it assumes familiarity with core networking principles as well as an understanding of concepts such as BGP.

AI technologies seem to be evolving at breakneck speed and keeping up with the latest developments is a dizzying task. If you want to press the pause button and get an overview of the current state of play then this State of AI report from VC firm Air Street Capital might fit the bill. It is a Google slides presentation that shoots from the hip with an incisive and occasionally irreverent snapshot of the market and the wider global economic and political context.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from the great theoretical physicist Richard Feynman:

“We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.”