Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio Tiered Storage

Thai Bui Feb 20th, 2019

In this article, Thai Bui from Bazaarvoice describes how Bazaarvoice leverages Alluxio to build a tiered storage architecture with AWS S3 to maximize performance and minimize operating costs on running Big Data analytics on AWS EC2. This blog is an abbreviated version of the full-length technical whitepaper (coming soon) which aims to provide the following takeaways:

  • Common challenges in performance and cost to build an efficient big data analytics platform on AWS.
  • How to setup Hive metastore to leverage Alluxio as the storage tier for “hot tables” backed by all tables on AWS S3 as the source of truth.
  • How to setup tiered storage within Alluxio based on ZFS and NVMe on EC2 instances to maximize the read performance.
  • Benchmark results of micro and real-world workloads.

Our Big Data Workload and Platform on AWS

Bazaarvoice is a digital marketing company based in Austin, Texas. We are a software-as-a-service provider that allows retailers and brands to curate, manage and understand user-generated content (UGC) such as reviews for their products. We provide services ranging from reviews hosting, collection and programmatic management of UGC content to deep analytics on consumer behaviors.

Our globally-distributed services serve and analyze UGC content for over 1900 biggest internet retailers and brands. Data engineering at Bazaarvoice comes with a unique challenge of being a medium-size company but handling data at massive Internet-scale. Take 2018 Thanksgiving weekend, for example, the combined 3-day holiday traffic generated 1.5 billion product page views, recorded $5.1 billion USD in online sales for our clients. In a month, our services host and record web analytics events for over 900 million internet shoppers worldwide.

To keep up with the web traffic, we host our services via Amazon Web Service. Especially, the big data platform relies on fully open-source Hadoop ecosystem utilizing tools such as Apache Hive, Spark for ETLs; Kafka, Storm for unbounded dataset and processing; Cassandra, ElasticSearch and HBase for durable datastore and long-term aggregations. The data generated by these various services will eventually be stored in S3 in optimized formats for analytics such as Parquet and ORC.

Why Alluxio

With AWS S3, we are able to scale out our storage capacity effortlessly, except that it exposes certain limitations for our needs and prevent us from further scaling “up”. For instance, you cannot access an S3 file faster than what S3 allows at a connection level. You cannot tell S3 to cache data in memory or be co-located with an application for faster analytics and repeated query. Worse, if you don’t provision enough hardware to process S3 dataset in parallel, then scanning 1TB of data is just too slow. S3 is also not open-source limiting our ability to customize it to fit our specific needs.

A simple solution to this problem could be to add more hardware and spend more money. However, with a limited budget in mind, we set out to find something better in late 2017 to solve the S3 scaling up challenge.

The goals were to accelerate data access on S3 via a tiered storage system because we realized that not all data access need to be fast: the access pattern of our workload is often selective and changing. As a result, we have the following goals in mind for this tiered storage:

  • It should be nimble: it could expand and contract as our business grows without data movement and ETL;
  • It should provide the same level of interoperability for most of our existing systems;
  • It should allow us to simply “scale up” by having better hardware or when we have the budget for it. Thus, the system needs to be highly configurable and relatively cheap to reconfigure operational-wise.

We picked Alluxio for meeting many of our criteria. Initial micro benchmarks have shown great performance improvement over accessing directly on S3 in Spark and Hive. For example, some of our tasks that scanned the same dataset repeatedly would reduce their execution time by 10-15x on the second run.

Architecture with Alluxio

architecture w alluxio

We integrated Alluxio with Hive Metastore as a basis for the tiered storage S3 acceleration layer. Clusters can be elastically provisioned with support for our tiered storage layer.

In each job cluster (Hive on Tez & Spark) or interactive cluster (Hive 3 on LLAP), a table could be altered to have their last weeks or months worth of data configured to use Alluxio, while the rest of the data still referenced data on S3 directly.

Since the data is not cached in Alluxio unless it is accessed via a Hive or Spark task, there’s no data movement and a table reconfiguration is extremely cheap, this has allowed us to be very nimble to adapt to the changing querying pattern.

For example, let us look at our page view dataset. A page view is an event recorded when an Internet user visited one of the pages of our client’s websites. The raw data is collected and converted into Parquet format for every new hour worth of data. This data is stored in S3 in a year, month, day and hour in a hierarchical structure:

s3://some-bucket/
 |- pageview/
   |- 2019/
     |- 02/
       |- 14/
         |- 01/
         |- 02/
         |- 03/
           |- pageview_file1.parquet
           |- pageview_file2.parquet

The data is registered to one of our clusters via Hive Metastore and available to be used in Hive or Spark for analytics, ETLs and various purposes. An internal system takes care of this registration task. It updates the Hive Metastore directly for every new dataset that it detected.

# add partition from s3. equivalent to always referencing “cold” data
ALTER TABLE pageview 
  ADD PARTITION (year=2019, month=2, day=14, hour=3)
  LOCATION ‘s3://<bucket>/pageview/2019/02/14/03’

In addition, our system is also aware of the tiered storage and a set of configuration provided for each specific cluster and table.

For instance, in an interactive cluster where analysts are analyzing the last few weeks of trending data, the pageview table with partitions less than a month old is configured to read directly from Alluxio filesystem. An internal system reads this configuration to mount the S3 bucket to the cluster via Alluxio REST API, then it automatically takes care of promoting/demoting tables and partitions using the following Hive DDLs:

# promoting. repeating query will cache the data from cold->warm->hot
ALTER TABLE pageview 
  PARTITION (year=2019, month=2, day=14, hour=3)
  SET LOCATION ‘alluxio://<address>/mnt/s3/pageview/2019/02/14/03’

# demoting. 1 month older data goes back to the “cold” tier.
# this protects our cache when people occasionally read older data
ALTER TABLE pageview 
  PARTITION (year=2019, month=1, day=14, hour=3)
  SET LOCATION ‘s3://<bucket>/pageview/2019/01/14/03’

This process happens asynchronously and continuously as newer data arrives, older data needs to be demoted, or configuration for the tiered storage has changed.

Micro and Real-world Benchmark Results

micro and real world benchmark results

Our Alluxio + ZFS + NVMe SSD read micro benchmark is performed on an i3.4xlarge AWS instance with up to 10 Gbit network, 128GB of RAM and two 1.9TB NVMe SSD. We mount a real-world, production S3 bucket to Alluxio and perform 2 read tests. The first test downloads about 5Gb of Parquet data recursively using AWS CLI to ramdisk (to measure only read performance).

# using AWS cli to copy recursively to RAM disk
$ time aws s3 cp --recursive s3://<bucket>/dir /mnt/ramdisk

The second test uses Alluxio CLI to download the same Parquet data also to ramdisk. This time, we perform the test 3 times to get the cold, warm and hot numbers as shown in the chart above.

# using AWS cli to copy recursively to RAM disk
$ time ./bin/alluxio fs copyToLocal /mnt/s3/dir /mnt/ramdisk

Alluxio v1.7 with ZFS and NVMe is taking about 78% longer to perform a cold read when compared to S3 (66s vs. 37s). The successive warm and hot read are 2.5 and 4.6 times faster than reading directly from S3.

However microbenchmarks don’t always tell the whole story, thus we will take a look at a few real-world queries executed by actual analysts at Bazaarvoice.

We rerun the queries on the tiered storage system (the default configuration) and S3 (by turning off the tired storage) on our production interactive cluster on a weekend. The production cluster is a 20 x i3.4xlarge nodes running Hadoop 3, Hive 3.0 on Alluxio 1.7 and ZFS v0.7.12-1.

s3 vs tiered storage query benchmark

Query 1 is a single table deduplication query with 6 columns to group by. Query 2 is a more complex 4 tables join query with aggregations and group by on 5 columns. Query 1 processes 95M input records and 8G of Parquet data from a 200G dataset. Query 2 processes 1.2B input records and 50G of Parquet data from several TB datasets.

The two queries simplified for readability and their query plans are shown below. Comparing to running the queries directly on S3, the tiered storage system has speeded up query 1 by 11x and query 2 by 6x.

# query 1 simplified
SELECT dt, col1, col2, col3, col4, col5 
FROM table_1 n
WHERE lower(col6) = lower('<some_value>')
   AND month = 1
   AND year = 2019
   AND (col7 IS NOT NULL OR col8 IS NOT NULL)
GROUP BY dt, col1, col2, col3, col4, col5 
ORDER BY dt DESC
LIMIT 100;

Conclusion

Alluxio, ZFS and the tiered storage architecture have helped us saved a significant amount of time for analysts at Bazaarvoice. AWS S3 is easy to scale out and by augmenting it with a tiered storage configuration that is nimble and cheap to adapt, we can focus on growing our business and “scale up” the storage as needed.

Today at Bazaarvoice, the current production config can handle about 35T of data in the cache and half a petabyte of data on S3. To grow tomorrow, we can simply add more nodes to enable more cached datasets, upgrade the hardware (i3.4xlarge -> i3.8xlarge -> i3.metal) for quicker localized access or a combination of both.