Date and Time: 2026-02-17 12:30p ET

Should we alert on all or some environments?
* No. humans likely to already to be looking at environments under test.
* Yes. QA the logs and metrics and alerts
* Yes. Test everything bottom to top. Automate everything, including provisioning/deploying to environments. Allows comparison between environments
* Yes. Use different environments to test different risk vectors.

From the last meeting - someone suggested using RACI (responsible, accountable, consulted and informed matrix) chart for post Initial program deployment into production support
* business problem of unowned systems in production. Expect people to come and go.
* service definition platform registry for owners. HOTS (handover/handoff to support) doc describes plan during deploy to production
* Instead of abstract RACI, describe scenarios of when to contact teams or individuals
* have a bot to monitor for zombies (people left the company, or service hasn't been manually deployed recently)

Auto-documentation generation from code… and _from production_?
* Interest in an SRE-centric version of https://codewiki.google/ ?
* Gemini CLI has been used to look at source code and generate traditional docs (system architecture, flow diagrams)
* A team is experimenting with looking at services holistically including dependencies from production configs, logs, and incidents to generate a catalog of 1st party services, dependencies, and snippets about recent history of production
* This sounds more like dynamic observability than static documentation.
* This can be fed to AI agents for planning.
* Only 30% of tool calls into db were traditionally observable during incident response (70% were not!)

Disaster Recovery (Best practices, ideas, how to track this)
* Best tools from GCP, automated scraping network topology, db-centric
* Guides or books to use as reference aside from SRE Google
* https://cloud.google.com/blog/topics/developers-practitioners/how-google-sres-use-gemini-cli-to-solve-real-world-outages
* Tabletop exercise
* Requires business context of what is needed for the company and the local constraints on people and time and coordinating system states (eg: flush pipelines). One person used a big spreadsheet to project track all the tables, teams, people, and features.
* suggestion: require app developers to provide a services page listing dependencies, perhaps custom HTTP response codes to indicate partial functionality or more verbose metadata about expected degradation per dependency
* suggestion: prioritize functions and user stories, not entire services or datastores
* practice taking entire system down to speed up recovery time (eg: annual maintenance)
* Contact Artur later https://www.linkedin.com/in/arturmartins/ if you'd like to continue discussing DR

Staging environment - good or bad idea?
* 40 environments for compliance, (zero for production!) just to demonstrate environmental differences. depends on use case

Using Nexus https://www.sonatype.com/products/sonatype-nexus-repository - how does it reload libraries? Could this lead to potential silent build issues?
* Your team will put snapshots of libraries (artifacts) into Nexus.
* Apps can still fail at runtime with runtime issues and not compile time issues. These will only be detectable by exercising the application and verifying the _behavior_ you expect.