Date and Time: 2026-01-20 12:30p ET Hype for AI SRE agent: “SRE” seems to mean somebody who is doing the incident response * Incidents are only part of the job, even a very small part SRE confuses people * because it is a safety role. Has more to do with tracking what is changing in the system. That doesn’t show up in day-to-day. Demands a lot of independence… making your own goals about other people’s goals. * Understanding the graphs. What does normal look like? Experience report for entry into SRE was Observability: Biggest KPI SRE are measured by are MTTR * MTT detect (improved with good dashboards) * MTT escalate (targeted escalation to The Right Person, knowing who can fix it) * MTT fix (automation) * Blameless retro most powerful tool * Annotate time series data with deployments Like athletes training. We only see the games, not the practices Are SLOs the right way to look at this stuff? * If we look for cliffs in the timeseries graph Different experience of SRE * SRE organization had authority to halt deployments or remove apps from production to close vulnerabilities * Being consulted for the architecture helps reduce incidents early Chaos engineering approach * Many companies want reliability but nevertheless do not invest in it * Has always been difficult to figure out where the audience or market is for SRE tooling. * People understand they need reliability * Now trying to build a tool to get into the developers deploy path without being a block * Reliability appears as an interruption in the path to getting new functions to customers * Can we present context to developers about impact of code decisions on production Sometimes it’s a bigger payoff to listen to what the system is doing instead of creating more stuff Reliability needs to be in the design. Can’t add it on later. * No leader election system, then you cannot have a no-downtime deployment