Netflix problem? iPhone, Windows 10, Android users get help from super-fast Apache Druid

Whether you’re happy or annoyed with Netflix, a key driver behind the experience is the streaming giant’s use of open-source Apache Druid, a Java-based database that promises a new take on the data warehouse. 

Druid is widely-used among the world’s largest tech companies for big-data analytics, and Netflix uses it deliver updates that it hopes help rather than hinder its 170 million paid users who watch movies on PCs, Android phones, and iPads or iPhones. 

Netflix, which hosts most of its computing infrastructure in Amazon Web Services (AWS), is a major user of open-source technologies, including the popular programming language Python for its vast content distribution network. 

Last year, it also open-sourced its Metaflow library for Python, which it uses for building and deploying data-science workflows. More recently, it open-sourced AVIF, a potential replacement for the JPEG file-format that could be better for handling images for endpoints spanning smartphones, tablets, laptops, and smart TVs.       

Now it has revealed how it’s using Apache Druid to measure real-time logs from all these various devices to understand and quantify how well they handle browsing and playback. 

Netflix has been using Druid for a few years, revealing in 2016 that a Druid cluster indexes logs in real time to detect spikes in errors on different devices so that application owners can investigate and respond to issues as they arise.

While the analytics function depends on data from Netflix users’ TV, iOS, and Android devices, the company stresses that user details are anonymized.   

“This enables us to classify devices and view the data according to various aspects,” explains Netflix senior software engineer, Ben Sykes.  

“This in turn allows us to isolate issues that may only affect a certain group, such as a version of the app, certain types of devices, or particular countries.”

Like many software companies, Netflix employs A/B testing to assess how updates and changes impact various user groups. It uses the results to compare how the new version performs against the older version to tell whether users on different systems should get the update or not. 

According to Sykes, Netflix gets two million events per second, which creates challenges for a Druid database that generates over 115 billion rows each day.    

Sykes explains that Druid doesn’t follow the relational database structure of columns, which means there are no “joins” but rather groups that can be filtered by a datasource, such as time, dimensions, and metrics. 

The ultimate benefit is speed, which is essential for a service that needs to react to a massive number of users in near real time.  

“Everything in Druid is keyed by time. Each datasource has a timestamp column that is the primary partition mechanism. Dimensions are values that can be used to filter, query or group by. Metrics are values that can be aggregated, and are nearly always numeric,” explains Sykes

“By removing the ability to perform joins, and assuming data is keyed by timestamp, Druid can make some optimizations in how it stores, distributes, and queries data such that we’re able to scale the datasource to trillions of rows and still achieve query response times in the 10s of milliseconds.”

The design allows for faster queries, which are handled in parallel before being aggregated and sent back to the client as a result. 

Netflix’s experience lines up with Apache’s claims that Druid can offer lower latency for OLAP-style queries, streaming and teaching, as well as faster search and filtering compared with traditional data warehouses. 

To get a sense of how large Netflix’s data-warehousing requirements are, Sykes highlights that Netflix is currently querying a whopping 1.5 trillion rows in its Druid database to understand what teens, parents, and toddlers see when they tap the Netflix icon or log in from a browser. 

View original article here Source