Site Reliability Engineer(SRE) – Defined

DevOps is buzzword these days and everyone trying to hire DevOps people but very few heard or understands the role of SRE and those who knows it they are unable to differentiate between SRE vs DevOps so i am trying to put some clarity on it and more focus on SRE RoR. This blog is summarized version from multiple sources like Atlassian blogs, Google’s GCP SRE series by Seth Vargo and Liz Fong-Jones, SRE keyword founder Ben Treynor(Google) in his speech during SREcon14 and my own experience in SRE domain. This blog will be having more graphical representation instead of writing as it helps in better understanding as human mind can remember images easily in comparison to words.

SRE core responsibility lies in area of Reliability, capacity planning, monitoring, events, alerting, postmortems, operation handling and incident response management system but they use DevOps qualities/skill set to achieve their primary goal by writing Infrastructure as Code and Release Automation to avoid any error during deployment due to manual intervention.

 

SRE’s are decision maker during new code release management and help in scheduling new application code deployment by keeping multiple factors in mind so that they can deliver reliable solution without impacting services and SLA’s. Reliability takes higher priority in comparison to new version release as reliability is proportional to SLA and directly related to company reputation and metrics.

Google pioneered and is behind a growing movement called Site Reliability Engineering (SRE). SRE effectively ends the age-old battles between Development and Operations. It encourages product reliability, accountability, and innovation – minus the hallway drama you’ve come to expect in what can feel like Software Development High School.Google’s mastermind behind SRE, Ben Treynor, still hasn’t published a single-sentence definition, but describes site reliability as “what happens when a software engineer is tasked with what used to be called operations.”

SRE is having some extra responsibility like defining availability, level of availability and a plan about how to fix in case of failure so indirectly they have to maintain SLI,SLO and SLA’s.

  • SLI’s are some of the important factors that defines characteristics like Request latency, Batch Throughput and Failure per request.
  • Collection of SLI’s creates SLO and finally SLO drive your SLA’s so it is interlinked parameters that SRE team have to maintain to handle their core responsibilities.
  • SRE object should be to maintain SLA, Each organization and their services are having Service Level Agreement(SLA) with customers that defines about availability of service in terms of 9’s and SLA is derived from Service Level Objective(SLO) and further SLO is derived by Service Level Indicator(SLI)
  • Besides SLI, SLO and SLA there are some overhead and Toils that are part of day to day work and you need to keep eye on it and balance on them.Overhead are described as Email communication, meetings, Traveling and working on various expense reports and other such related work.
  • Where as Toil is bigger term in comparison to Overhead as Toil is the area where being a SRE and DevOps professional you need to observe and identify Toil areas and need to fix them asap as other wise they will become road block for your DevOps journey. So identifying Toils and fixing them should be SRE’s priority besides your Operation work.
  • So you need to differentiate between overhead, toils and project work. For an example as part of deployment process you have manual intervention to modify your AWS Auto Scaling Group(ASG) min/max/desired setting to spin up new instances with newer code/AMI and you should automate this part using writing deployment automation and should take it as automation improvement project that will permanently fix that toil.
  • Error budgets is one of the last important factor as part of SRE RoR, Error budgets are left over ratio given to SRE from defined service uptime mentioned in SLA. For an example most applications don’t achieve 100% uptime. So for each service, the SRE team sets a service-level agreement (SLA) that defines how reliable the system needs to be to end-users. If the team agrees on a 99.9% SLA, that gives them an error budget of 0.1%. An error budget is exactly as it’s named: it’s the maximum allowable threshold for errors and outages.
  • Error can be of Network, storage, CDN or any human error that effects your application availability so you can’t give excuse while negotiating on Error budget.
  • But there is a exception on when your application is dependent on some other vendor or third party application like ISP or Telecom Vendor then your Error budget is not effected and you can put blame on those 3rd party application and vendor to maintain your SLA with your clients.
  • So if you are over utilizing your Error budget that means your SRE’s are not taking things seriously and you have to reinforce core RoR to SRE team so that organization can maintain agreed SLA’s with customer.SLO , Error Budget can be mapped to uptime using following formula:
  • SRE are normally organized in following hierarchy to manage multiple production level app. Google is having more than 1300 SRE’s to manage their infrastructure and aligned to each independent service like GCU, Gmail, Google Search but they follow same RoR and above mentioned important factors to maintain SLA.Some of the key points as part of SRE role that everyone of us keep in mind are defined by Ben Treynor.SRE teams are actually staffed entirely with rock-star developer/sys-admin hybrids who not only know how to find problems, but fix them, too. They interface easily with the development team, and as code quality improves, are often moved to the development team if fewer SRE’s are needed on a project.In fact, one of the core principles mandates that SRE’s can only spend 50% of their time on Ops work. As much of their time as possible should be spent writing code and building systems to improve performance and operational efficiency.Basically, the development team handles 5% of all operations workload (handling tickets, providing on-call support, etc.). This allows them to stay closely connected to their product, see how it is performing, and make better coding and release decisions.In addition, any time the operations load exceeds the capacity of the SRE team, the overflow always gets assigned to the developers. When the system is working well, the developers begin to self-regulate here as well, writing strong code and launching carefully to prevent future issues.So in the end i will say SRE’s are important member in organization those who maintains your key metrics for your services and help organization in keeping service credits safe in their hand and help in creating reputation by maintaining high service uptime, reliability to service, delivering things fast, quick incident handling and working on permanent fix and helping organization in giving Root Cause Analysis (RCA) through postmortems and help in continuous improvement cycle.