Frank Wiles, @fwiles @revsys Slides will be online later.
Smells Like Teen Systems: Advice for raising healthy happy systems and getting to DevOps nirvana
People are fearful of change. Must be small at first. Baby steps. Be agile — little a, not big A: be spiritual, not fundamentalist; mandating….just because you read it somewhere, doesn’t mean you must do it if it doesn’t work for your organization. Have ammunition: managers need data, explanations to make decisions.
Apply metrics mentality to:
- change requests
- trouble tickets and bugs
- deployments
- outages of the smallest magnitude
- interoffice political fights
- approved and denied requests for equipment or funds
- hires, fires, and quits
- $$; labor hours, etc
“We spend on average 19 hours per week requesting more information”
Guilt tripping — no other option to keep up.
“Once we put <insert system> in place, we realized we no longer needed that weekly meeting…”
DevOps: Develop Everything Visibly Automate Paranoid Services
DEV: Develop Everything Visibly: “Everything has to happen out in the open”
OPS: Operate/Automate Paranoid Services “Automate everything with ridiculous amounts of monitoring and metrics”
Everything is version-controlled. Log of why things happened.
Everything is tracked. Ticketing; Trello; Bugs; etc.
Even more visibility:
- Level 1: Team Chat. Like Slack. Email is for outsiders.
- Level 2: Chat Ops <– mmmmmbot!
- Level 3: Have some fun <– Fun bots
Chat ops suggestions
- Deployments and config changes
- Status summaries: bot check load db3
- Maintenance: bot start maintenance file-server-1
- Display Alerts and Warnings
- Server boot/shutdown messages
- Ops logs: bot log Upgraded redis to 2.8.19
- Resolutions: bot resolve ticket #8 Ended up just needing to restart Apache
- Common actions: bot restart apache on production
Tools: This is how we do it
- Python: scripting language {relatively easy to learn and readable; libraries for talking to everything} Lots of libraries: Fabric highly rec’d, shell scripting on steroids
- SaltStack: master & and then salt (minion) code. as simple or as complicated as you want; fast communication even among hundreds of systems (zeromq +aes); extensible via python; ability to return data to the master for monitoring or metrics purposes; simple to crazy complicated orchestration between systems. Examples of uses: Targeting (/srv/salt/top.sls); Pillars (/srv/pillar/* (config differences as data such as); templating
- Consul: service discovery and monitoring: health checks; discover services via DNS or HTTP REST apis; deadman health checks.
- ELK: Elastic Search/Logstash/Kibano <– fast log searching for when you don’t.
- “Logs that aren’t centralized are rarely checked and logs that aren’t searchable are never correlated” -Frank Wiles
- Graphana: for metrics visualization; pretty graphs.
- Don’t capture exceptions in your inbox; put in a system. Exception.io; Rollbar. Rollbar also tracks deployments.
- What to capture? As much you can store.
- general collectd system stats
- logins/signups/emails sent
- failed login attempts/emails bounced
- run time of crons and batch jobs
- backup run times and file size(s)
Resistance. Route around it. If you don’t work with the process….
Maverick Ricardo Semler {1993}
Turn resistance back on others, sometimes so it’s so cumbersome that it burdens their way of thinking.