Keynote at 2015 Kansas Linux Fest, hosted at Lawrence Public Library
Dave Lester @davelester
OSS Advocate at Twitter, Inc.
Apache Mesos and Aurora PMC Member
A lot of metadata are in tweets.
Twitter is a big proponent of Open Source — see their website. Some are on Github; some are not.
Front-end developing — Bootstrap; typeahead.js
Dave focuses on key infrastructure projects at Twitter: finagle, scalding; analytics and infrastructure
1. How is Twitter scaling?
What is scaling? See Wikipedia entry. Reaching beyond your current capacity — social and technical solutions.
Twitter numbers (2014):
- 500 million tweets /day
- 3.5 billion/week;
- 6000+ tweets/sec (steady state)
Twitter is “the pulse of the planet”. Can sometimes predict spikes (live, popular events, like the World Cup); sometimes can’t. Could throw 10x the servers at the problem OR improve scalability.
Remember the Fail Whale?
Previously, Twitter: ruby on rails, 200 engineers pushing code; needed a solution to isolate failure and isolate feature development
During 2010 World Cup — lots of issues keeping Twitter up; 2014 after scalability and OS projects, much more stable.
Breaking up monolithic applications into microservices. Common pattern among companies; see Groupon talk, “Breaking up the monolithic”
Today, building a distributed system.
2. Twitter’s Open Source infrastructure
- “Twitter Stack” including Apache Mesos, Aurora, Finagle
- Mesos: top-level software at Apache; began as research project at UC Berkeley; layer of abstraction between machines in a datacenter and applications that run: cluster manager & resource manager. Mesos actively monitors what’s happening across the cluster (Zookeeper). Addresses the problems of fault tolerance and resource efficiency and utilization.
- Design Challenges: each framework may have different scheduling needs; must scale to tens of thousands of nodes running hundreds of jobs with millions of tasks; must be fault-tolerant and highly available
- Master-Worker architecture + Zookeeper cluster
- Marathon scheduler
- A lot of #klf15 scalability preso is going over my head, but I do wonder what #kohails project could “get†from Mesos/Aurora scalability
- Why care about resource utilization? Fewer machines; less human resources.
- How to best reuse idling times? Early research
- Quasar — users specify performance target for applications instead of typical resource reservations; machine-learning used to predict resources usage and for cluster scheduling; research by Christina Delimitrou and Christos Kozyrakis at Stanford
- Google Borg — Google’s cluster management solution; AMP Lab, and John Wilkes spoke at MesosCon 2014.
- Aurora provides deployment and scheduling of jobs; rich DSL for defining services; health checking; one scheduler to rule them all: can manage both long-running services, as well as cron; can mark production and non-production jobs; production jobs can pre-empt non-prod jobs; has an additional priority system. Aurora has executor features — responsible for executive code on individual worker machines, sending status to Mesos when a task completes.
- Hundreds of separate services with different owners
- Managed by Site Reliability Engineer (SRE) teams
3. How and why OSS?
“many parts building on and amplifying each other” –Gordon Haff, Red Hat
Building an ecosystem.
Frameworks
Services: Aurora; Marathon; Kubernetes; Singularity
Big Data: Spark; Storm; Hadoop
Batch: Chronos; Jenkins
Framework bindings — C++, Java, Clojure, Haskell, Python, or write your own.
Resources for writing mesos frameworks — his slides will go online with links to this info.
Community > Code. Very very much true.
Let’s Scale in the Open: increased speed of innovation; more-reliable software; more-visible contributions and impact; broader peer group and sense of community.