Observations on Observability - #O11ycon report

Aug 2, 2018
7
8/2/18 9:28 AM PST
Kicking off o11ycon, the Observability conference with Honeycomb CEO Charity Majors
8/2/18 9:32 AM PST
Christine Spang on why observability matters in software, especially when it comes to distributed systems based on 3rd party SaaS tools.
8/2/18 9:35 AM PST
From servers to services, software applications are more and more complicated as more systems are added, and scaling is automatic. “I manage two servers at 9am, and 100 at 2PM”
8/2/18 9:39 AM PST
Today, we are going to spend some time defining observability.
8/2/18 9:39 AM PST
Software, by default, is opaque. To see what it’s doing, you have to write observation capability into it. This goes beyond logs and stepping through in a debugger - because you have to observe the live system, not your sandbox.
8/2/18 9:46 AM PST
The engineering view of who needs observability: Engineering Customs support teams Customers
8/2/18 9:53 AM PST
Structured logs are key to observability. You can track times, servers, customers - all the things you need to use to slice and dice one incident in a billion.
8/2/18 9:55 AM PST
Filtering by client/customer and service endpoint can show problems in stark contrast. Instead of intuition you can use sensors to see what the problem is. This slide illustrates specifically tracing how one customer was using @nylas API to get data from a mail server that was timing out. The customer couldn’t do anything about this, but, by identifying a specific mailbox that was having problems, the Nylas team was able to triage the issue quickly.
8/2/18 10:01 AM PST
How do you troubleshoot something that is bad on average? Take the data and group by hostname, or api, or customer - And you can more easily see trouble spots. The problem shown in this slide was caused by one host (out of dozens) that was configured slightly differently. It would take an extremely obsessive engineer to figure this out the old-fashioned way, by examining each server individually. After 2-3 “hmm this is working just right” many people would shrug and say “this is Schroedigger’s bug, cant replicate”.
8/2/18 10:21 AM PST
If you have a “fleet” of servers performing tasks, human oversight is insufficient for problem-solving. It can take hours to figure things out through log searches. Distributed systems need the right kind of instrumentation to even examine the problems. And there are multiple tools to use in different places.
8/2/18 10:22 AM PST
Christine also addresses some other things - like what might happen when the scale of your system is so large that keeping logs of everything is no longer cost-effective. “Everyone has docs on their system architecture and they’re always out of date”.
8/2/18 10:35 AM PST
The Pearl is a beautiful venue for a small conference (200 people). We are just getting ready for the OpenSpace part of this conference, where we can ask new questions
8/2/18 10:43 AM PST
O11ycon folks creating topics for the open session.
8/2/18 10:52 AM PST
Topics!
8/2/18 10:53 AM PST
Moar topics. Testing in prod! That’s going to be a fun set of discussions.
8/2/18 11:26 AM PST
Somebody was very prepared for setup this morning.
8/2/18 11:30 AM PST
Lots of engineers in the “tying observability to business goals talk”
8/2/18 12:40 PM PST
One of the neat-o art pieces decorating the venue, made by artist Alexis Laurent. Yes, this is a map of the London Underground.
8/2/18 12:42 PM PST
And now for the part of the conference where we learn from other people’s foul-ups.
8/2/18 2:37 PM PST
Testing in Production evolves into observability-driven development.
8/2/18 3:40 PM PST
Observabili-tea time.
8/2/18 3:46 PM PST
From BASIC to Containers and Kubernetes, a fireside chat with Charity Majors and Joe Beda. When your system is distributed enough to need Kubernetes, you are forced to think about observability as more than just SSHing into a system and tracing the code. “Right now Kubernetes is too exciting - we need to make it more boring”. Joe talks about the ongoing evolution of standardized practices.
8/2/18 3:57 PM PST
At Google, the people who write the software, run the software. This is very different from typical Enterprise approaches, where things are thrown “over the wall”. A healthy environment empowers developers, devops, and sysadmins rather than facilitating blame.
8/2/18 4:01 PM PST
What does “observability” mean? Charity: it’s about being able to see what’s happening in a system by looking at it from outside, without shipping new code.
8/2/18 4:25 PM PST
Presenting findings from the Open sessions. Lots of discussion around different business functions and what is observed, tools, and more.
8/2/18 4:32 PM PST
The Testing in Production to-do list.
8/2/18 5:14 PM PST
Thank you Honeycomb for putting on a conference filled with great info!
Google apps
Main menu