The Rise and Rise of Grafana

AI Round-Up | Jobs & Certifications

Welcome to Edition #18 of the newsletter!

Grafana

In recent years the full stack observability space has been dominated by big names such as Splunk, Datadog, Dynatrace and New Relic. In previous editions we have documented how new players such as Observe and Chronosphere are revving up to compete with this big four for market share. Grafana have been growing rapidly in recent years, gaining market share whilst also building out their tech stack. The news of a $300m-plus funding round is both a reflection of Grafana’s own considerable ambitions as well as an indicator of the strength of the observability market in general.

AI/ML

As the hype on AI settles, we thought it might be useful to look at some of the practical enhancements vendors are incorporating into their products. Obviously, there is a large number of products on the market and a full review would require a lengthy article. We are therefore just restricting ourselves to a sample of some of the features now being brought to market by leading vendors such as IBM, Logz.Io and Elastic.

Jobs/Careers

This edition of the newsletter sees the return of our Careers/Professional Development section. This is an area where information is relatively thin on the ground, so if you are aware of any relevant resources, please let us know - we would love to feature them!

Feedback

We love to hear your feedback. Let us know how we are doing at:

NEWS

The Rise and Rise of Grafana

It is perhaps a sign of the times in observability that one of the biggest stories of the year originates in the business press. An article in Forbes magazine has revealed that Grafana are seeking to raise between $300-$400m in additional funding. This would value the company at around $6 billion - placing it virtually neck and neck with New Relic. By comparison, the recent Chronosphere fundraising valued the company at around $1.6bn. The article also includes some other interesting numbers on Grafana’s business - they now have more than 5,000 customers and over 1,000 employees.

It is not clear what the funding would be used for. Interestingly, however, the move follows on closely from the recent announcement that Grafana would be launching an application development platform which could potentially revolutionise the product - providing almost unlimited scope for extensibility and truly opening up a third party ecosystem around the company’s observability stack.

New Relic Roll Out Mobile Observability

We have previously mentioned Embrace and their dedicated mobile observability platform. New Relic have now also recognised the importance of providing observability functionality which adapts to the specific needs of mobile users with the rollout of their new Mobile User Journey product. As we mentioned in our feature on Embrace, mobile device failures can have a serious impact on user perception as well as hitting the bottom line for e-commerce applications. The product is billed as a ‘critical component’ of the company’s overall investment in DEM (Digital Experience Monitoring) and provides features such as user journey mapping, crash analysis and proactive optimisation.

Simplified Container Log Parsing for oTel

The recent OpenTelemetry Collector survey (which we covered in our last edition), highlighted that most installations were, not surprisingly, running on K8S clusters and that collecting container logs was an important use case. One difficulty up until now has been that the different container runtimes each have their own logging format - which made configuration for the OpenTelemetry filelog receiver rather complex.

To resolve this issue, there is now a new container log parser, which has been implemented as an Opentelemetry operator (not to be confused with a Kubernetes operator). This article on the OpenTelemetry blog provides context on the need for the new parser as well as technical detail on the rationale for implementing it as an operator rather than as a processor.

Products

Logfire - Observability for Python Devs

In Edition 16 of the newsletter, we featured Digma - a developer-focused tool seeking to overcome the disconnect between developers and observability systems. Logfire from Pydantic, aims to achieve a similar objective but with some crucial differences in scope and implementation. First of all, whereas Digma provides feedback within the developer’s IDE, Pydantic analytics are displayed via its web portal. Secondly, Logfire focuses principally on the Python programming language. Logfire is OpenTelemetry compatible and collects logs, traces and metrics for your Python Apps. It also supports features such as sampling, scrubbing of sensitive data and sql-based querying. If you are interested in evaluating the product, it is free to use whilst it is in beta.

Highlight - User Centric Observability

Whilst traditional monitoring tooling tended to focus on server-side and backend systems and processes, there has been a recent shift to understanding and improving the user experience. Metrics such as page load times and client-side page errors are becoming as important as backend performance and server-side telemetry. Unfortunately, however, standards around front-end telemetry and user experience are still patchy and fragmented, whilst many systems are unable to correlate front and back-end issues.

This is the problem that Highlight aims to address. Highlight seeks to join the dots by uniting telemetry for session replay with server-side logging to provide a holistic picture of the user experience. Highlight is open source and describes itself as a full stack platform. As well as full session replay, it also supports logging, metrics, and traces.

From the Blogosphere

Internal Observability Uber Style

Stories about Uber architecture always seem to be interesting, not least because they always involve technology at huge scale - such as this trillion record migration from DynamoDB. This article, however, is actually interesting on a number of levels. As well of being of technical interest it also provides some fascinating insight into internal team topologies and management processes - which are also fundamentally important aspects of managing observability at scale. Whilst most organisations will only operate at a fraction of Uber’s scale, every organisation is seeking to minimise costs and improve service to users, and the article provides a number of insights which would be of interest to most observability practitioners.

Bloom Filters - An Evergreen Performer

As telemetry volumes grow exponentially, observability providers are facing ever greater challenges in efficient storage and querying of data. Technologies such as S3, Arrow and Column DB’s have all led to huge improvements in performance and economy. Interestingly, a fifty-year-old data structure known as Bloom Filters appears to have found new popularity recently. As well as being implemented in the Apache Parquet file format, Bloom Filters are also an experimental feature in in the latest version of Grafana’s Loki Log Aggregation platform - delivering dramatic improvements in query performance. Although they can be powerful for certain query types, Bloom Filters need to be used judiciously. This article on the Influx DB blog takes an in-depth look at Bloom Filters exploring the trade-offs involved and the use-cases for optimum performance advantage. This is a rigorously researched piece which serves as a really useful reference for query optimisation.

Getting A Grip on Cardinality

The spectre of cardinality explosions looms large in the observability engineering mindset, as they are synonymous with huge spikes in telemetry volumes and, potentially, costs. If you are not familiar with the issues around high cardinality metrics then this post on the Observe blog will serve as a really great introduction. The article starts off with a clear definition of the term as well as an explanation of how cloud-native environments provide such fertile ground for cardinality explosions.

Naturally, the article is also a bit of a sales pitch for the Observe product and not everybody might agree that strategies adopted by other vendors are ‘sub-optimal’. More generally though, the article highlights how a number of newer observability vendors with cloud-native backends are able to differentiate themselves by offering large scale ingestion and storage at very low cost.

AI/ML

Intelligent Remediation in Instana

IBM has a long and illustrious track record in the fields of AI and Machine Learning - including the 1997 victory of Deep Blue over chess grandmaster Gary Kasparov and IBM Watson’s winning turn on the US TV quiz Jeopardy in 2011. Watson has now been superseded by WatsonX, and this is the engine which powers the Intelligent Remediation feature in IBM’s Instana observability platform.

Intelligent Remediation is a preview technology which continuously monitors a system for faults and anomalies. As well as drawing upon system telemetry, it also uses expert knowledge for causal analysis and then suggests remediations. The remediations can be implemented using pre-built actions selected from a catalogue. As well as the Remediation feature, Instana also has AI-driven capabilities for summarising, diagnostics and making recommendations.

Anomaly Detection in Logz.Io

Logz.Io is a popular full-stack observability platform built on top of open source technologies such as OpenSearch, Prometheus and Jaeger. Whilst their platform has been equipped with AI tooling for some time, the company are circumspect in not overplaying the current capabilities of AI. Whilst they recognise that AI can assist in areas such as reducing noise and summarising incidents, they do not make any claims in terms of causal analysis or remediation. You can learn more on the company’s AI posture from this really illuminating webinar.

The Logz.Io platform ships with an Observability IQ Assistant, which harnesses AI to support natural language querying and chat-based analytics on your telemetry data. The most powerful AI feature in the Logz.Io platform though, is probably the Anomaly Detection tooling that is integrated into the App 360 module. One problem with anomaly detection is that it is not business-aware, and, if not applied carefully, it may end up creating yet more alert fatigue. To combat this, Logz.io Anomaly Detection allows users to target critical services and take a more SLO-driven approach.

Elastic - Supercharging Search With AI

The ELK Stack has been at the forefront of the log aggregation and analytics space for many years. Despite controversies over licensing, Elastic is still a hugely popular and influential product. Highly powerful search capabilities are at the core of its product offering and it is no surprise that this is a domain where Elastic seeks to differentiate itself from other platforms in terms of its AI tooling.

It seems though, that Elastic’s ambitions extend far beyond log searching and it is positioning itself as a first-choice platform for advanced corporate data analytics. The centrepiece of this vision is the Search AI Lake, which incorporates RAG, search and security functions and is built on a cloud-native architecture. The company claims that this enables search over vast volumes of data at high speed and low cost.

CAREERS AND PROFESSIONAL DEVELOPMENT

Regular readers of the newsletter will know that we are huge fans of the ClickHouse DBMS, which powers a growing number of observability platforms and is the backend of choice for a number of leading hyperscale organisations. The company recently announced their first ever certification - the ClickHouse Certified Developer. As well as proving product-specific knowledge this is also a certification which will demonstrate expertise in managing data at scale.

Cilium is being ever more widely adopted as a technology for network security and observability and there is now a Cilium Certified Associate certification which is accredited by the Linux Foundation. Isovalent, the creators of Cilium have also provided this study guide with a large number of resources grouped around each curriculum topic.

If you are interested in keeping track of job openings in the observability space then head over to the AdatoSystems web site, where prolific author and veteran observability specialist Leon Adato publishes a weekly listing of observability, SRE and related jobs. There are also links to other useful career resources.

That’s all for this edition!

If you have friends or colleagues who may be interested in subscribing to the newsletter, then please share this link!

This week’s quote is from former IBM CEO Ginni Rometty:

“Some people call this artificial intelligence, but the reality is this technology will enhance us. So instead of artificial intelligence, I think we’ll augment our intelligence.”