Store 1 Billion Files in Alluxio 2.0

Andrew Audibert Apr 9th, 2019

Alluxio 2.0 is designed to support 1 billion files in a single file system namespace. Namespace scaling is vital to Alluxio for a couple of reasons:

  1. Alluxio provides a single namespace where multiple storage systems can be mounted. So the size of Alluxio's namespace is the sum of the sizes of all mounted storages.
  2. Object storage is increasing in popularity, and object stores often hold many more small files compared with filesystems like HDFS.

In Alluxio 1.x, the namespace was limited to around 200 million files in practice. Scaling further would cause garbage collection issues due to the limit of the Alluxio master JVM heap size. Also, storing 200 million files would require a large memory footprint (around 200GB) of JVM heap.

To scale the Alluxio namespace in 2.0, we added support for storing part of the namespace on disk in RocksDB. Recently-accessed data is stored in memory, while older data ends up on disk. This reduces the memory requirements for serving the Alluxio namespace, and also takes pressure off of the Java garbage collector by reducing the number of objects it needs to deal with.

Setting up off-heap metadata

Off-heap metadata is the default in Alluxio 2.0, so no action is required to start using it. If you want to go back to storing all metadata in-memory, set

alluxio-site.properties

alluxio.master.metastore=HEAP # default is ROCKS for off-heap

The default location for off-heap metadata is ${alluxio.work.dir}/metastore, which is usually under the alluxio home directory. You can configure the location by setting

alluxio-site.properties

alluxio.master.metastore.dir=/path/to/metastore

In-memory metadata storage

While Alluxio has the ability to store all of its metadata on disk, this would result in reduced performance due to the relative slowness of disk compared to memory. To alleviate this, Alluxio keeps a large number of files in memory for fast access. When the number of files grows close to the maximum cache size, Alluxio evicts the less-recently-used files onto disk. Users can trade off between memory usage and speed by setting

alluxio-site.properties

alluxio.master.metastore.inode.cache.max.size=10000000

A cache size of 10 million requires about 10GB of heap.

Sizing

Memory

For the best performance, the metadata of the working set should fit within the master's memory. Estimate the number of files in your working set, and multiply by 1KB per file. For example, for 5 million files, estimate 5GB of master memory. To configure master memory, add a jvm option in conf/alluxio-env.sh

alluxio-env.sh

ALLUXIO_MASTER_JAVA_OPTS+=" -Xmx5G"

If you have the memory available, we recommend giving the master a 31GB heap in production deployments.

alluxio-env.sh

ALLUXIO_MASTER_JAVA_OPTS+=" -Xmx31G"

Update the cache size to match the amount of memory allocated:

alluxio-site.properties

# 31 million
alluxio.master.metastore.inode.cache.max.size=31000000

This will accommodate most workloads, and avoids the need to switch to 64-bit pointers once the heap reaches 32GB.

Disk

Alluxio stores inodes on disk more compactly than it stores them in memory. Instead of 1KB per file, it takes about 500 bytes. To support 1 billion files stored on disk, provision an HDD or SSD with 500GB of space, and make sure the rocksdb metastore is using that disk.

alluxio-site.properties

alluxio.master.metastore.dir=/path/to/metastore

Conclusion

Alluxio 2.0 significantly improves metadata scalability by taking advantage of disk resources for cold metadata. This prepares Alluxio to handle larger namespace volumes as people mount more storages into Alluxio and make greater use of object storage. Stay tuned for a followup blog on the technical details of how we leverage RocksDB for storing our metadata.