TalkingData Case Study: Leading Data Broker in China Leverages Alluxio to Unify Terabytes of Data Across Disparate Data Sources

Zhitao Yan (TalkingData) Jun 25th, 2018

This is a guest blog from our friends at TalkingData.

Download or print the case study here

TalkingData is China’s largest data broker, reaching more than 600 million smart devices on a monthly basis. TalkingData processes over 20 terabytes of data and more than one billion session requests every day. TalkingData products are powered by its massive proprietary data set and provide services to over 120,000 mobile applications and 100,000 application developers. TalkingData serves a wide range of clients in both Internet and traditional industries, including leading enterprises in the financial services, real estate, retail, travel, and government sectors.

Data is at the heart of what we do at TalkingData, and the ability to access, store, and manage data from a wide variety of data sources easily and efficiently is critical to our business. We gather data from many disparate data sources, and also use different types of application frameworks to process the data. Data can be stored at the customer (on-premise or in the cloud) as well as in our own various storage systems. For any given department to perform data analytics or process data, they need to first make a request to central IT or the engineering department who determines where the data is located, and then fetches it from those various data sources into a centralized storage system. The combination of the above creates a complex environment that our data scientists and analysts need to manage on a daily basis.

Taking these factors into account, it became very clear to us that we needed a solution with the following characteristics:

  • Single platform to manage data across disparate data sources: Data is collected from hundreds of millions of devices as well as directly from customers. Therefore, the data typically spans a wide variety of storage systems including HDFS, AWS S3, and Ceph. We need a system that removes the complexity associated with managing the data across these disparate data sources.
  • High performance: The speed with which we can transform data into value directly impacts the top line. We are always on the lookout for technologies that can empower our infrastructure and provide a competitive advantage. Avoiding ETL would remove a time consuming step.
  • Integrated with the best of breed big data technologies: We serve customers across many different industries, as a result we work with various types of data from mobile data to financial data. We want to have the flexibility to use best of breed technology in order to match the right application to the required task.
  • Flexible deployment that allows it to be deployed on-premise, in the cloud, and hybrid: We fetch data from a variety of sources and data needs to be accessed seamlessly regardless of physical location. It is important that the system is able to work with these various deployment models.
  • Scalable: We are working with data at terabyte scale, and with the rise of IOT, we expect the volume of data to continue to grow exponentially. Therefore, it is important that the system is able to scale cost-effectively.

How TalkingData leverages Alluxio to unify data

We leverage Alluxio as a single platform to manage all the data across disparate data sources on-premise and in the cloud. Alluxio removes the complexity of our environment by abstracting the different data sources and providing a unified interface. Applications simply interact with Alluxio, and Alluxio manages data access to different storage systems on behalf of the applications. Alluxio effectively democratizes data access, allowing data scientists and analysts in various business units to accomplish their goals without needing to consider where the data is located or having to go to central IT or the engineering team to transfer or prepare the data.

enter image description here

Figure 1: TalkingData architecture with Alluxio unifying multiple storage systems

Alluxio meets all our needs as it is a single platform that manages data across disparate data sources at memory speed while providing a flexible deployment model. Additionally, Alluxio decouples storage from compute which enables us to scale each resource independently. With Alluxio, we can work with application frameworks and storage systems of our choosing without any of the complexity. For us, that is a game changer.

Our goal is to build a smart data marketplace where companies in any industry can purchase the exact type of data they need and directly get value from the data without having to process it because the data has already been processed and analyzed. We view Alluxio as a key enabling technology to achieving this goal. We have deployed Alluxio on hundreds of nodes in production and our plans are to expand to more use cases, products, and deploy at a larger scale.