Cache me if you can!

A performance masterclass, a double RUM and an encounter with the Tetragon!

Welcome!

Welcome to the fourth edition of the Observability 360 newsletter!

Cache me if you can!

Managing performance at scale and within a budget is a challenge faced by many observability practitioners - at the recent P99 conference Danny Kopping of Grafana Labs delivered a masterclass. We also cover new product releases and updates from Microsoft, Prometheus and Honeycomb as well as a handy guide to Observability at KubeCon

A Double RUM

Runtime User Management is no longer a niche concern as businesses aim to unlock metrics to improve customer experience. Both Microsoft and Splunk have recently rolled out updates to their RUM tooling.

Feedback

As practitioners in the field, you will know that every good observability system needs a feedback loop. Let us know how we are doing at:

NEWS

Perses - a new visualisation tool for Prometheus

One of the big announcements at this years PromCon was the unveiling of the Perses project. This represents something of a step change as it blurs the boundaries between Prometheus and Grafana by enabling fully-fledged metrics visualisation within Prometheus itself. Perses also seeks to bring DevOps principles to dashboarding by ushering in Dashboards as Code. This is definitely one to watch.

and then

Honeycomb roll out dedicated K8S diagnostics

Honeycomb are a full stack observability provider with a strong developer focus and a mission to provide actionable insights. Their platform now includes dedicated tooling for intelligently investigating the “is this incident an application or Kubernetes issue?” question. The best way to understand this is to see it in action for yourself. Honeycomb have helpfully provided a sandbox with a step by step guide to diagnosing a K8S failure by digging into log and metrics correlations.

Getting granular with Autometrics

Most observability platforms take a top down view, starting with the aim of a single pane of glass over the whole infrastructure landscape. Autometrics, which describes itself as an “open source micro framework for observability“ takes the opposite approach. It works by installing a language-specific library and decorating your code with attributes and wrappers. Autometrics then takes care of generating telemetry, SLO’s and alerts at the function level. At present there is official support for Go, Rust, Typescript and Python, whilst there is also a community project to support C#.

Enter the Tetragon!!

Tetragon sounds way cool - like a cross between Tetris and a dragon - it is also the name of a new component in Cilium’s eBPF based observability architecture for hardening K8S security. It is natively aware of K8S security concepts such as namespaces and can be used to define and enforce policies. It can monitor files and low level network events and even kill processes which violate policies.

Observe platform ramps up with gen AI

Another week and another observability platform with big financial backing enters the arena. This time it's Observe, who are fresh from receiving a $50m cash injection. One of their main selling points is the integration of AI driven tools into user workflows. This includes O11y Help to accelerate learning, Olly Extract for generating regexes and the Opal Co-Pilot. Another differentiator is their Data Graph - which sits at the heart of their Observability Cloud architecture. This intelligently maps and visualises relationships between datasets ingested into the Data Lake.

Grafana Sift enters public preview

Grafana Incident was added to the Grafana stack over a year ago and has proven itself to be an effective tool for coordinating responses to incidents. It has a plethora of useful features such as Slack integration, document templating and activity timelines. The latest enhancement is Sift investigations - which uses Machine Learning to provide suggestions when working to resolve an active incident. It works by performing analysis on a range of data sources related to the incident.

Take Five(s)

We all know about the Five Nines and the Four Golden Signals - now cloud security specialist SysDig are proposing their 5/5/5 metric as a benchmark for incident response. They describe it as an “ambitious” framework - and that is no underestimate. They are setting the bar at detection within 5 seconds. This is followed by 5 minutes of correlation, with an incident response being issued after a further five minutes. You may not wish to share this with your senior managers.

EVENTS

Unfortunately, we only have space to highlight a few of the many upcoming events relating to Observability. See the Observability 360 calendar for a fuller listing!

KubeCon North America, November 6-9, 2023

KubeCon North America is nearly upon us. To help you prepare, Mina Karamercan from Red Hat has put together this really handy guide to observability-related sessions. Highlights include a presentation on the Kepler observability framework.

Zabbix get back on the road

Fresh from their recent summit in Riga, Zabbix are back on their travels with a meetup in Sweden, where potential customers can find out more about the company, meet the Zabbix team and find out more about Zabbix use cases and features.

From the Blogosphere

Edge Cases - Arduino meets Elastic

Most of us will be familiar with ingesting observability data from web applications, containers and VM’s. The world of IoT opens up a whole vista of other use cases with metrics and logs being sent from distributed arrays of sensors and small devices. This blog article looks at using a client library on an Arduino board to connect to Elasticsearch over HTTP. Beware - this article may contain C++ code!

Why Node developers need Observability skills

In this article from the SigNoz blog, Nočnica Mellifera hammers home the point that observability can no longer be an afterthought but needs to be integrated into the software development lifecycle as a first class citizen. She also looks at using OpenTelemetry AutoInstrumentation to instrument a Node application and emit telemetry to (surprise, surprise) a SigNoz endpoint, for viewing in a, you guessed it, SigNoz dashboard

RUM

Splunk’s RUM gets session replay

Splunk introduced their RUM tool over two years ago to provide a comprehensive and in-depth picture of the user experience. A notable omission from the product was the lack of session replay - and this has now been rectified to bring Splunk into line with players such as Datadog and New Relic (who announced the release of their session replay feature in September). The product ensures privacy by obfuscating PII and also integrates with the Splunk Waterfall

Application Insights adds RUM for Java

This kind of went under the radar, but earlier in the year Microsoft released their Azure Monitor OpenTelemetry Distro for Java. The distro includes a “browserSdkLoader” configuration which improves the developer experience for adding RUM to Java apps. The configuration initialises a script which enables collection of client-side data. Once this is activated you can view user journeys, trends, funnels and other analytics in Application Insights

VIDEOS & TUTORIALS

Cache Me If You Can!!

One of the highlights of the recent P99 conference was a masterclass in performance and reliability masterclass delivered by Danny Kopping of Grafana Labs. It was a fascinating investigation into the trade-offs involved in different storage and caching strategies and is a surprisingly gripping tale. You will need to register to view it, but it is well worth 20 minutes of your time.

Prometheus best practice with Julius Volz

Are you in danger of creating alert fatigue or unleashing cardinality bombs? Make sure that your Prometheus configuration is in good shape with this really clear and informative refresher from the maestro himself.

Community

📣 If you would like to publicise an Observability-related meetup/standup etc then please let us know and we will list it here.

OpenTelemetry End User Discussion Groups

These are online discussion groups where community members can share experiences on how they are using OTel. Outcomes are fed back to relevant project maintainers. There are monthly meetups across three geographical areas. November’s meetings are:

US: Nov 16 2023, 5pm GMT

Europe: Nov 21 2023, 11AM GMT

APAC (Asia Pacific): Nov 22 2023, 6:30Am GMT

The meetings are facilitated by the meetup.com platform.

Clickhouse community meetup, 12 December, NYC

Clickhouse build the database engine used by Uber, Cloudflare, Ebay and a high profile list of other companies working at hyper-scale. The meetup organisers promise that this gathering will be “a tech-filled fiesta that'll leave you inspired and buzzing with new ideas“.

That’s a wrap!

That’s all for this fortnight’s edition. See you in two weeks!