The mission critical Google platforms are maintained by teams consisting of Google Site Reliability Engineers (SREs). SREs distinguish themselves from common system administrators by applying engineering and scientific methods for solving operational problems. Fundamentally, SRE is what you get when you treat operations as if it’s a software problem. Their philosophy is described in the book Site Reliability Engineering – How Google Runs Production Systems. The SRE book is an invaluable source of information and inspiration for teams responsible for the operation of mass deployed IT services of any kind.
In this blog post I distilled several key points that proved the most relevant for my team. Our primary task is the design and operation of a private database cloud for Oracle and Microsoft SQL Server databases. Operating by these principles over the last years not only has helped us achieve a high level of customer satisfaction, but it also significantly contributed to the creation of a very attractive working environment.
A lot of midsize and large organizations foresee a segregation of duties in the database environment. Accordingly, they define the following roles: “architect” , “engineer” and “administrator”. Usually, “architects” define blueprints. More often than not, they have only a modest hands-on experience. Then, “engineers” set-up a system in the lab, do testing on synthetic data sets and automate some database tasks. Normally, they don’t have access to the production servers. At the end of the chain are “administrators” who operate and maintain application databases.
I’ve never seen this model working well in practice, because the chasm between different roles is huge. Designed platforms often don’t meet customer needs and “engineers” are too far away from production problems. Furthermore, “administrators” often lack scripting skills. That means hiring more people to do the same tasks over and over again. Besides that, those roles tend to be placed in different organisation units with somewhat different objectives which gives rise to political discussions once the things go wrong or the ever increasing workload can’t be handled any more.
The “architect” role simply doesn’t exist at Google. Everyone in their engineering organization designs, implements and operates the system. Even SRE line managers are typically also technical contributors to their team, including being part of an on-call rotation.
Similarly, I prefer to assume the total ownership for the platform and carry the end-to-end reponsibility for it. In doing so, the team is empowered to obtain the full control over every critical aspect of the service delivery. We appreciate the immediate feedback loops and are motivated to fix the problems quickly as well as to put continous effort in platform improvement.
By design, the SRE teams are focused on engineering. Without constant engineering, operation load increases and teams will need more people just to keep pace with the workload. Engineering work is what enables the SRE organization to scale up sublinearly with service size and to manage services more efficiently than a pure operations team. Not only does the lack of automation inhibit scaling, but a repetitive work becomes toxic when experienced in large quantities. Furtheremore, recognizing automation opportunities is one of the best ways to prevent human errors and ensure the consistency over multiple systems.
In our environment, at the outset we automated the most critical and frequently executed tasks, like database creation, copy-on-write snapshot and rollback, database cloning, patching and upgrades etc. Over the last 6 years we have managed to automate almost every repetitive task and are constantly looking for gaps.
For automation we use object oriented Perl on the Oracle platform and Powershell on the SQL Server platform.
SRE teams use frameworks, because they provide multiple upfront gains in terms of consistency and efficiency.
We applied the same principles for the automation of the database tasks. There is a shared set of features which needs to be implemented for each and every task. The canonical examples of such features are logging, alerting, multi-threading for parallel processing and command line parameter parsing. The reimplementation of the common feature set would be a waste of engineering effort and time.
We chose Command Software Design Pattern for enforcing the unifomity on every automated process. The Invoker class performs all of the common operations and the Receiver class focuses just on the database processing. Not only does the concept enable the automation at a level not possible before, but it also facilitates the on-boarding of new contributors to the project, because developers can focus on implementation of the database logic without needing to care about common functionalities which are inherent to the framework.
In order to work at scale, teams must be self-sufficient. For this reason, when our automation framework matured, we started building a self-service layer around it, so that application teams can execute database tasks in a safe and a controlled manner. For example, an application developer can execute a complex database upgrade just by typing a single command and embed it as a substep into a more complex application release flow. The tasks are automated to the point that they don’t require any involvment by the engineers. For many applications which put emphasis on agility, this was a real game changer, because they never have to wait on the database team to complete a standard task. Besides that, it speeded up the test cycles, because the application developers can provision and refresh their own development databases at any time. Some typical processes which are exposed per self-service are database cloning, copy-on-write snapshots and rollbacks as well as database patching and upgrade. They are truly automatic, and only require engineer involvement if and when problems arise.
Improvisation and Reverse Engineering
The Google systems are among the most complex software artifacts that have ever been built. Because of the complexity and change velocity, sometimes the unexpected happens. At that point, Google SREs will have to improvise. Moreover, in the course of their jobs, they will come across systems they’ve never seen before, so they need to have strong reverse engineering skills. As a consequence, improvisation and reverse engineering are production services at Google.
In our database engineer lives we have to deal with externally developed dodgy applications, sometimes of dubious quality. At some point, an application accessing a database will inevitably break. In such moments, having improvising and reverse engineering skills, and therefore being able to quickly identify the cause of the problem, can mean a significant advantage over competition. For this reason, we have a broader view on the platform. A good database engineer will also have at least some understanding of the underlying operating system and the network protocols. By understanding the basic principles and knowing how to use diagnostic tools he will be able to figure out how the application behaves when it fails. Often we managed to find a good workaround for a critical production problem even before getting a response from the application software vendor. In the course of time, by being involved in many incidents, the team has also accumulated “production wisdom” specific to our organisation and application landscape.
Google hires SREs which have both development and system engineering skills. Candidates have to pass several rigorous peer interviews. At the end, a commitee decides whether to hire the applicant.
Although our bar in respect to the technical knowledge is set much lower than at Google, we still look for the following essential skills when hiring new employees: database administration experience (both Oracle and SQL Server), coding and troubleshooting skills. Furtheremore, curiosity for the technology and the ability to learn are also required.
In the database world the talent pool with the ideal skills mix is scarce. An alternative option that has worked very well for us is to hire a younger person with some working experience as a database administrator and a solid education. Although these young people haven’t gathered any working experience as software developers, they learned this discipline at the university. On their education path they also learned the engineering way of thinking, and by mastering a difficult curriculum they proved to have cognitive skills needed for quick learning. Fully fledged software development projects within our team provide career development opportunities for these talents. The project work provides much-needed balance to the operational work, and can provide job satisfaction for engineers looking for more versatile roles.
Some people in our team have also complementary skills, like project management, organisational and collaboration skills. These team members have the opportunity to assume the role of “technical owner” (also known as TO) who acts as the IT infrastructure account manager and consultant for a given application. Our chances of successful collaboration accross the organisation are improved by having more diversity in our team.
Google developed their own monitoring system which they dubbed Borgmon. What sets Borgmon apart from other monitoring tools is its architecture. Instead of executing scripts to detect system failures, it does mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup. The data points within a given period are stored in an in-memory database. Because collection is no longer in a short-lived process, the history of the collected data can be used for the alert computation as well. The alerting rules are stored and evaluated centrally outside of the monitored server.
If you ever get a chance to design a monitoring solution from scratch it might be worth to analyze the Borgmon architecture and adopt some of its basic principles. Never underestimate the overhead of managing threads and forked processes, particularly on a highly consolidated system. For example, take a look at the truss -c -f output for an Oracle Cloud Control Agent process:
function time(s) count ------------------------------------- lwp_park 22.465 173347 lwp_unpark 2.805 93786 ... forkx 87.612 1311 ... yield 11.617 190621 lwp_suspend 140.853 1115963 lwp_continue 8.329 1117835 ... -------- ------ sys totals: 304.582 3277516 usr time: 215.874 elapsed: 2106.830
In the example above, 15% of the elapsed time is the OS kernel CPU time spent just on thread management and forking!
Furthermore, there are open source monitoring solutions like Nagios which allow a lot of flexibility in design. To avoid scalability problems, evaluate other alternatives instead massively launching monitoring probes over ssh (or WINRS on MS Windows).
One trait that’s vital to the long-term health of an organization is how the people involved respond to an emergency. Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. The outages can put a lot of pressure on the operations staff. Therefore, it is important for on-call engineers to understand that they can rely on clear escalation paths and well-defined incident-management procedures.
It’s also important to make sure that everybody involved in the incident knows their role. The technical staff who is resolving the problem shouldn’t be trying to manage the incident and/or taking care of the communication to stakeholders at the same time. If the load on a given member becomes excessive so that he feels overwhelmed, that person needs to ask the planning lead for more staff. If appropriate, the incident managers can remove roadblocks that prevent technical staff from working most effectively.
A good collaboration tool will improve the quality of an incident-related communication. Atlassian Jira proved to be a good choice for us. It has several advantages over e-mail communication. First, all of the parties affected by the incident can subscribe to the communication channel and get the latest information. Second, the tool facilitates the coordination of different activities during the incident, specially in cases when people involved are spread accross different locations. Finally, the root cause analysis and the progress of bug fixing can be tracked over longer time periods.
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy: Site Reliability Engineering – How Google Runs Production Systems
David Hixson and Betsy Beyer: The Systems Engineering Side of Site Reliability Engineering, in ;login: vol. 40, no. 3, June 2015
Chris Jones, Todd Underwood, and Shyjala Nukala: Hiring Site Reliability Engineers, in ;login:, vol. 40, no. 3, June 2015.