Observability 360
Posts
Observability in Flames!

Observability in Flames!

Observify! - Shopify's In-house platform | Solving the AI Measurement Problem

John Hayes
November 15, 2024

Welcome to Edition #28 of the Newsletter!

The Boom Rolls On

Over the past year there have been a number of bear-ish warnings that the observability bubble would burst as the market became saturated, interest rates rose and customers crashed into the trough of disillusionment. Whilst at some point there may (or may not be) a ‘correction’, the market seems to be in a healthy and vibrant state. Many of the major players are reporting steady growth whilst the launch of Dash0 shows that there are also opportunities for new entrants.

Raising The Table Stakes

Whilst the market rmains healthy it also continues to evolve quickly and be a highly competitive domain. In the motor industry there was a time when technologies such as Power Steering or ABS were considered to be premium features - nowadays they are standard.

The same logic applies within the observability space. As you will see from our segments on Dash0 and New Relic, the bar for being a full-stack player has risen sharply. MELTS is no longer enough. Whether it is Perses support, Anomaly Detection, RUM, auto-instrumentation or configurable pipelines, vendor after vendor is upping their game and defining a new standard.

Feedback

We love to hear your feedback. Let us know how we are doing at:

[email protected]

https://twitter.com/TheObsGuy

NEWS

Brendan Gregg - His Latest Flame

I totally understand this - honestly!

Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan’s latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.

How Shopify Observify Themselves!

Shopify probably have one of the most fascinating back stories of any of the Silicon Valley unicorns of the past few decades. Originally founded as an e-store for snowboarding goods, founder Tobias Lütke found that selling his e-commerce platform was more lucrative than actually selling physical goods - and so one of the largest and most successful e-commerce platforms on earth was born.

Now Shopify have built their own internal observability platform and have released a fascinating series of YouTube videos discussing the rationale for the project and describing each of the components in the system.

Shopify have achieved enormous success on a philosophy of innovation and top class engineering. This video series is a real treasure trove of insight and experience. The session on building a network monitoring architecture with eBPF, Vector and ClickHouse was the pick of the bunch for me.

Code RED!! Dash0 Launch ‘OpenTelemetry Native’ Observability Stack

After 18 months of intensive development, the Dash0 observability platform went live last week. The company founder, Mirko Novakovic, was previous a co-founder of IBM Instana, so it is not surprising that Dash0 appears to be a product which is both expertly engineered as well as having a cohesive and focused product identity.

Observability is a crowded space and that means it is vital for new entrants to have a very clear proposition. The Dash0 pitch appears to be strongly oriented towards developers - placing itself roughly in the same space as vendors such as Honeycomb and SigNoz. Certainly, the focus on speed and productivity seem designed to appeal to developers working in IDE’s where they are used to keyboard shortcuts and minimal clicking around.

As you might expect from the company that brought us the very nifty oTelBin tool, OpenTelemetry support is built into the product from the ground up. In addition to this, we think that strategic choices such as Perses support, intuitive UI, ease of on-boarding and pricing transparency put Dash0 in a strong position.

Thoughtworks Technology Radar - Edition 31

As you are probably aware by now, we are keen watchers of the Thoughtworks Tech Radar. Whether or not you agree with their ratings, the Radar is a document which is highly influential and represents the opinions not just of analysts but also of practitioners in the field. Not surprisingly, AI and LLM tooling and technologies loom large over all of the quadrants. We are pleased to see Web Assembly getting a vote of confidence, with SpinKube being awarded an Assess in the Platforms sector.

From an observability point of view, the main talking point (apart from the lack of observability tooling in ths edition) is the inclusion of Observability 2.0 in the Techniques quadrant. This is a term that is attributed to Charity Majors and the team at Honeycomb and the definition in the Radar document concentrates on the implications for telemetry and storage. I think that the conversation about evolving from Observability 1.0 should probably be more wide-ranging, and include themes such as incident resolution, developer tooling and enterprise integration.

Products

meshIQ - Getting The Message!

meshIQ are a company that fall into the exclusive category of Big Companies That Not Many People Have Heard Of. The company actually has a long pedigree, having been formed in 1994. Up until last year they traded under the name Nastel. They are a niche provider specialising in messaging observability and count financial giants such as Citi and UBS among their customers.

There are, of course, other tools for monitoring your Kafka or RabbitMQ instances but in sectors such as banking, high fidelity at very large scale is paramount. Large swathes of the banking system run on messaging and a single lost trade could have huge financial and regulatory percussions.

Debugging and maintaining large and complex messaging implementations can be highly resource-intensive. The meshIQ platform offers ultra-deep visibility of message transport across the whole life-cycle for a range of platforms. In addition to providing high-granularity inspection, the platform also supports diagnostics for detecting slowdowns and bottlenecks.

Kloudfuse Ramp Up With 3.0 Release

Kloudfuse have now released version 3.0 of their observability platform as they establish themselves as serious contenders in the market. As we mentioned in our introduction, full stack systems now need to go beyond the MELTS (Metrics, Events, Logs, Traces) baseline to remain competitive and this release is a great example of the trend.

Two of the major new capabilities are RUM and Anomaly Detection - capabilities which we think will soon become standard for observability vendors. Front End observability has long been a blindspot for observability systems and Kloudfuse have joined a number of vendors such as Honeycomb, Middleware and Grafana who have sought to plug this gap. The Anomaly Detection feature is an implementation of the Facebook Prophet project and Kloudfuse claim that it is especially suited to analysing observability data. There is a lot more to this release than we can cover here so hit the button below for the full details.

Goldpinger - For Your K8S Odd Jobs!

If you are responsible for managing or monitoring microservices running on Kubernetes instances then you will be familiar with the challenge of investigating potential connectivity issues when requests fail. Whilst Isovalent’s Cilium/Hubble stack is a fantastic diagnostic tool, it is not the only game in town. The wonderfully named Goldpinger is a Kubernetes tool that was developed internally at media giant Bloomberg and has now been released under an open source licence. The tool actually has quite a pedigree. It was presented at Kubecon as far back as 2018 and its Docker image has over 1m pulls.

As well as doing the basics of connectivity diagnostics, Goldpinger can also tackle tasks such as arbitrary DNS resolution and external Http checks and even exposes Prometheus metrics. To seal the deal, it also produces beautiful visualisations like the one shown above.

New Relic Keep Their Foot On The Gas

At the moment we seem to be at a fascinating juncture where observability vendors are both extending the functional breadth of their products - e.g. by incorporating capabilities such as Incident management and on-call paging, whilst also building increasing levels of sophistication into their core services. This is certainly evident across the latest slew of releases from New Relic.

Grouped together under the rubric of New Relic Intelligent Observability Platform, the new release includes an Agentic AI Engine, which powers anomaly detection, impact assessment and configuration recommendation capabilities. Another key strategic feature is Pathpoint Plus - which is designed to provide visibility across multiple business domains and seems to be conceptually similar to the Dynatrace Business Flow function that we covered in Edition 26 of the newsletter. Even if you are not a New Relic user, this article is valuable as a snapshot into the pace and breadth of innovation in the observability space at the moment.

From the Blogosphere

Getting The Skinny on Wide Events

Most observability vendors have realised that placing log, event and trace telemetry in walled-off silo’s is wasteful and counterproductive. Most systems now support the ability to create correlations across signals which helps engineers to debug faster and gain richer insights into their telemetry. Whilst correlating disparate telemetry streams represents a way forward, Honeycomb have gone back to first principles and questioned the notion of splitting out your telemetry in the first place.

If you have ever familiarised yourself with the Honeycomb system or its philosophy, you will have come across the concept of the Wide Event. This is essentially an entity which captures all signal types into a single data structure and persists them in a single backend datastore. This obviates the need for subsequent correlation.

That is the theory - in practice it obviously requires a fundamental shift in the way that you think about your telemetry and instrument your systems. This article by Jeremy Morrell is a great resource if you are looking for guidance on creating the Wide Events that enable you to truly leverage the power of the Honeycomb system.

52 Weeks of SRE - An Online Learning Journey

There are a lot of books and training resources out there dangling the promise of instant expertise - whether it’s learning Python in a day or becoming a Kubernetes admin whilst waiting for your morning toast. In reality, true mastery requires putting in some hard hours over a prolonged period. In the latter spirit, Software Engineer and SRE João Pereira has launched 52 Weeks of SRE - a year long series of in-depth articles on all aspects of the SRE role.

The content is structured into four phases - Foundation, Intermediate, Advanced and Expert. The first four articles have already been published and the content and presentation are both excellent. There is also a GitHub repo to support readers who want to tackle the practical exercises and labs.

AI/LLMs

Galileo aim to Solve the “AI measurement problem”

The AI observability market has grown rapidly as more and more companies incorporate LLMs into their systems and applications. There are a large number of tools which are capable of running traces and providing valuable metrics on errors, latency, bottlenecks and cost. What is more problematic though is evaluating the quality of LLM responses.

At the moment, model responses are evaluated either by humans, who need to read through individual responses, or by other LLM’s. As Galileo Chief Executive Vikram Chatterji says, these techniques are “expensive, slow and do not scale”. This is the “measurement problem” that Galileo are aiming to tackle with their Evaluation Intelligence Platform, which embeds evaluation into the AI development pipeline. The company has recently raised $45m in funding, with investors including Databricks, Citi Ventures and ServiceNow.

💎SWAG & BLING CORNER

With KubeCon running this week and Microsoft Ignite coming up next week, it’s an ideal time for style-conscious observability vendors to strut their stuff as they join battle to attract attendees to their booths. Our first exhibit, below left is a natty yet practical jacket being modelled by Kerno CEO Sean Madigan. Reducing developer toil has rarely looked so chic.

Kudos also goes to SigNoz (centre) for their T-Shirt which combines their love of oTel with a LOTR theme. Probably most eye-catching of all though are the threads being sported Brooks Townsend of wasmCloud and Dave Gee of NATS (below right). The NATS jacket was apparently auctioned off but we’re not sure how you can get your mitts on the wasmCloud cycling top.

Lego and Star Wars merch are both highly desirable swag in their own right, but SquaredUp have really raised the bar by bringing the two together in their highly popular Lego Swag Store, which will be opening its doors once again at next week’s Ignite conference in Chicago. If you swing over to the SquaredUp stand, not only can you download a mobile app to order your own customised minifigure, you can also see dashboards visualising the progress of orders through the production pipeline.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is this wonderful nugget from Enrico Fermi:

“There's two possible outcomes: if the result confirms the hypothesis, then you've made a discovery. If the result is contrary to the hypothesis, then you've made a discovery.”

About Observability 360

Hi! I’m John Hayes - I’m an observability specialist and I publish the Observability 360 newsletter. I am also a Product Marketing Manager at SquaredUp.

The Observability 360 newsletter is an entirely autonomous and independent entity. All opinions expressed in the newsletter are my own.