Innovation11 February 2016
Our journey to the cloud at Netflix began in August of 2008, when we experienced a major database corruption and for three days could not ship DVDs to our members. That is when we realized that we had to move away from vertically scaled single points of failure, like relational databases in our datacenter, towards highly reliable, horizontally scalable, distributed systems in the cloud. We chose Amazon Web Services (AWS) as our cloud provider because it provided us with the greatest scale and the broadest set of services and features. The majority of our systems, including all customer-facing services, had been migrated to the cloud prior to 2015. Since then, we've been taking the time necessary to figure out a secure and durable cloud path for our billing infrastructure as well as all aspects of our customer and employee data management. We are happy to report that in early January, 2016, after seven years of diligent effort, we have finally completed our cloud migration and shut down the last remaining data center bits used by our streaming service!
Moving to the cloud has brought Netflix a number of benefits. We have eight times as many streaming members than we did in 2008, and they are much more engaged, with overall viewing growing by three orders of magnitude in eight years:
The Netflix product itself has continued to evolve rapidly, incorporating many new resource-hungry features and relying on ever-growing volumes of data. Supporting such rapid growth would have been extremely difficult out of our own data centers; we simply could not have racked the servers fast enough. Elasticity of the cloud allows us to add thousands of virtual servers and petabytes of storage within minutes, making such an expansion possible. On January 6, 2016, Netflix expanded its service to over 130 new countries, becoming a truly global Internet TV network. Leveraging multiple AWS cloud regions, spread all over the world, enables us to dynamically shift around and expand our global infrastructure capacity, creating a better and more enjoyable streaming experience for Netflix members wherever they are.
We rely on the cloud for all of our scalable computing and storage needs — our business logic, distributed databases and big data processing/analytics, recommendations, transcoding, and hundreds of other functions that make up the Netflix application. Video is delivered through Netflix Open Connect, our content delivery network that is distributed globally to efficiently deliver our bits to members’ devices.
The cloud also allowed us to significantly increase our service availability. There were a number of outages in our data centers, and while we have hit some inevitable rough patches in the cloud, especially in the earlier days of cloud migration, we saw a steady increase in our overall availability, nearing our desired goal of four nines of service uptime. Failures are unavoidable in any large scale distributed system, including a cloud-based one. However, the cloud allows one to build highly reliable services out of fundamentally unreliable but redundant components. By incorporating the principles of redundancy and graceful degradation in our architecture, and being disciplined about regular production drills using Simian Army, it is possible to survive failures in the cloud infrastructure and within our own systems without impacting the member experience.
Cost reduction was not the main reason we decided to move to the cloud. However, our cloud costs per streaming start ended up being a fraction of those in the data center -- a welcome side benefit. This is possible due to the elasticity of the cloud, enabling us to continuously optimize instance type mix and to grow and shrink our footprint near-instantaneously without the need to maintain large capacity buffers. We can also benefit from the economies of scale that are only possible in a large cloud ecosystem.
Given the obvious benefits of the cloud, why did it take us a full seven years to complete the migration? The truth is, moving to the cloud was a lot of hard work, and we had to make a number of difficult choices along the way. Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data center and drop them in AWS. But in doing so, you end up moving all the problems and limitations of the data center along with it. Instead, we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company. Architecturally, we migrated from a monolithic app to hundreds of micro-services, and denormalized and our data model, using NoSQL databases. Budget approvals, centralized release coordination and multi-week hardware provisioning cycles made way to continuous delivery, engineering teams making independent decisions using self service tools in a loosely coupled DevOps environment, helping accelerate innovation. Many new systems had to be built, and new skills learned. It took time and effort to transform Netflix into a cloud-native company, but it put us in a much better position to continue to grow and become a global TV network.
Netflix streaming technology has come a long way over the past few years, and it feels great to finally not be constrained by the limitations we've previously faced. As the cloud is still quite new to many of us in the industry, there are many questions to answer and problems to solve. Through initiatives such as Netflix Open Source, we hope to continue collaborating with great technology minds out there and together address all of these challenges.
-Yury Izrailevsky, Stevan Vlaovic and Ruslan Meshenberg