Evolving Web Infra: The Rise of Site Reliability Engineer

When I first began my career in the early 1990s, the server hosting a website was often just a single tower tucked away under a webmaster’s desk, with a UPS for emergency power. If any issues arose, a simple server reboot usually did the trick. Since it functioned primarily as an online catalog, everyone was content letting the webmaster (that would be me) manage things on their own.

However, as the internet began to gain traction, we quickly realized the untapped potential of this “web thing.” We decided to expand its capabilities by incorporating code and implementing a database backend. This was not just any database, but one capable of both displaying information and accepting user inputs. To ensure reliability, we moved to cluster the database.

As our ambitions grew, so did the complexity of our infrastructure. We introduced a middle tier to encapsulate our business logic, adding another layer to our architecture. With a web server, database cluster, and middle tier in place, our setup became more sophisticated. We weren’t just content with this; we adopted modern web load balancing technology and deployed multiple web and middle tier servers, each clustered for redundancy. Security quickly became a top concern, leading us to segment our networks, place the web front end in a demilitarized zone (DMZ), and install robust firewalls.

Despite these efforts, we noticed performance issues. We incorporated index storage for SQL data, ensuring redundancy with a minimum of three servers. Yet, as our server count rose, so did the complexity of managing them.

Then, virtualization came into the picture. We migrated our physical servers to virtual ones hosted on clusters, inadvertently increasing our system’s complexity. Although server provisioning became faster, the infrastructure’s intricacy skyrocketed.

We didn’t stop there. Microservices, directory catalog servers, and containers—essentially more efficient versions of our servers—started running in their own clusters. Furthermore, some of these clusters were operating as virtual machines (VMs) on host servers, adding another layer of complexity.

To improve security and performance, we integrated web application firewalls, intrusion detection systems, caching servers, and Redis clusters. With each new addition, however, the risk of failure multiplied. Back in the day, the chance of a single server failing seemed minimal. Now, with numerous servers supporting our site, identifying and addressing failures has become a complex challenge. Departments, or “silos,” proliferated to manage the sprawling infrastructure, significantly increasing operational toil.

According to a survey by Google in 2023, 86% of companies reported that a single hour of downtime costs their business over $300,000, highlighting the critical importance of reliability and uptime for modern online services. This mounting complexity and the high cost of downtime have outstripped the capabilities of a lone webmaster, underscoring the necessity for Site Reliability Engineers (SREs). SREs specialize in building and maintaining reliable systems amidst this complexity. They focus on measurement, enhancing reliability, and breaking down silos to manage the intricacies of modern web ecosystems.

In essence, as systems have grown in complexity, driven by user and business demands for 24/7 uptime, global access, and stringent security, the need for dedicated engineers like SREs to ensure their reliability and stability has become paramount. The 2021 DevOps Research and Assessment (DORA) State of DevOps report indicates that high-performing organizations implementing SRE practices are more likely to meet their reliability targets, reflecting the growing importance of SRE in the industry.

Now, the question isn’t just whether you’re aware of SRE; it’s whether you’re ready to embrace the practices and principles that define the future of reliable, scalable, and efficient web services.