Hurricane Sandy no disaster for Netflix
Netflix (Nasdaq: NFLX) apparently learned a lesson from others who have been impacted by cloud outages and was thus able to withstand a super-cloud of a different sort when superstorm Sandy--aka the Frankenstorm--roared up and over the East Coast Monday.
As a result of its preparedness for the storm, the streaming media superpower saw a viewing spike of about 20 percent among East Coast customers "with New York City, Washington, D.C., Boston, Philadelphia and Baltimore showing a lot of activity," a service spokesperson told GigaOM. "Initially we saw a lot of kids' titles being watched, a sign that the kids are staying home from school."
A big reason why the traffic might have spiked was that Netflix was able to continue operating its service during the hurricane, even as other cloud-based services experienced problems. That was because of lessons learned from Amazon.com (Nasdaq: AMZN) Web services outages, according to a blog posted by Jeremy Edberg, Netflix reliability architect, and Ariel Tseitlin, Netflix director of cloud solutions.
The bloggers boasted that while Amazon "experienced a service degradation," Netflix, "while not completely unscathed, handled the outage with very little customer impact."
According to a Netflix timeline, the first problems experienced by other websites were noticed around 8:30 a.m. Monday but "found to have no impact on our systems." Some Netflix customers "started to have intermittent problems" around 11 a.m. and, by 11:15 a.m., the problem became significant enough to open an internal alert and begin looking at what was causing it.
"We should have opened an alert earlier, which would have helped us narrow down the issue faster and let us remediate sooner," the bloggers wrote. As it was, Netflix techs were able to narrow down the problems to a single Availability Zone (AZ). Because of past experience with single zone outages--and drills developed to simulate such outages--Netflix knew it had to evacuate the zone and was able to do so "in just 20 minutes and completely restore service to all customers," the bloggers wrote.
The whole episode, they said, showed that Netflix had learned its lessons from past outages and had developed "a mindset for designing in resiliency at the start" for high availability services.
"One of the most important things that we do is we build all our software to operate in three Availability Zones. Right along with that is making each app resilient to a single instance failing," the bloggers said. Combined, those two factors "made zone evaluation easier for us. We stopped sending traffic to the affected zone and everything kept running."
While crowing about the company's quick success in restoring and maintaining service, the bloggers also noted that the storm provided a classroom to "look for opportunities to improve both the way our system is built and the way we detect and react to failure."