Date and Time: 2026-06-16 12:30p ET

System design but for AI systems - what extra skills “Classic” SRE need to be aware of (Observability Gaps,Governance & Safety, Capacity Planning, Regression, etc). or Forward Deployed Engineers?
* Classic design ~ monthly users, storage; AI enabled ~ MLOps, LLMops, Data Model, regressions, token efficiency. Resources for learning?
* Forward Deployed Engineers – engineers assigned to customers. Somehow different from Solution Engineer, with observability and scaling responsibilities. Perhaps similar to Customer Reliability Engineer.
* Same as it ever was? Employers want employees to do more things plus AI.
* "Classic" software was deterministic and black and white for correctness, but with AI/LLM output there is variance. Do SLOs and SLIs work to measure AI response content?
* Enumerate broken user scenarios to learn desired behavior and anticipate possible conditions

how to get to a position to solve a ‘scaling’ issue or is that myth and the journey is as we solve tiny scale issues over time …. >> ok my real question is how to qualify for certain jobs which require having scaled something large
* How do you get "scale" experience without having scaled yet?
* You don't. You have your scale challenge experiences.
* Is there something different about scaling up AI deployments?

AI chatbots substitute partially for conversations with teammates, but fail to produce social outcomes like consensus
* teammates are missing out on a shared mental model of the topic under discussion which is useful context for future work on the project and end up with divergent
* chatbot implicitly gets a vote by participating in lieu of asking a human
* chatbot is aggregate knowledge for multiple people which may be valued more than local context
* accountability or intent for AIs; maybe transitively for the people for the humans who chose the chatbots

Vendors pushing AI SRE (Elastic Kubernetes Investigations Agent, New Relic SRE Agent). Any field reports?
* https://newrelic.com/platform/sre-agent
* https://www.elastic.co/observability-labs/blog/ai-powered-kubernetes-observability-elastic-mcp
* Linux Sysadmin AI: needs handholding and it failed how to install secureboot.
* Datadog watchdog anomaly detection → bits AI SRE https://www.datadoghq.com/dg/monitor/bits-ai-sre/ : is a hypochondriac. Can propose many possibilities without regard to cost.
* Decompensation – both in social mechanisms and technical ones. Provides ways to handle system stress, not obvious from raw source
** Fred Hebert writes in Decompensation and Cascading Failures  https://resilienceinsoftware.org/news/11454232
* When the AI is On-Call - Thinking Out Loud with Niall Murphy https://resilienceinsoftware.org/networks/events/253432
** AI tools are increasingly showing up in SRE and incident response workflows, but what does that actually mean for the humans in the system? Not the benchmarks, not the vendor promises: the real questions about expertise, tribal knowledge, collaborative sensemaking, adaptive capacity, and what happens when these things get widely deployed.
** In this session, you are invited to a facilitated conversation with Niall Murphy to think out loud together on the bigger questions: what do these products contribute to — or take away from — the human side of the sociotechnical system? If you've found yourself with more questions than answers about AI in your practice, this is the session for you.