What’s new in Alluxio 1.1 Release

Gene Pang Jun 21st, 2016

Alluxio 1.1 release includes many great features and improvements from the community. Alluxio would not be what it is today without the growing open source community, and we would like to thank everyone involved in this project. With the Alluxio 1.1 release, the community has continued to grow at a rapid pace, to reach over 250 contributors to Alluxio - nearly 3x growth over the last year!

1.1_release-contributors

The Alluxio 1.1 release brings many new features and improvements, and in this post, we will highlight a few of the developments: performance improvements, access control features, and usability and integration improvements.

Performance Improvements

Users can leverage Alluxio to achieve tremendous performance gains. In Alluxio 1.1, performance benefits are further augmented.

Alluxio Master Metadata Scalability

One of the major improvements in Alluxio 1.1 is optimized metadata scalability in the Alluxio Master. Efficient and scalable management of the master metadata improves the overall performance of the entire Alluxio system, and enables greater number of concurrent users and applications interacting with Alluxio. In Alluxio 1.1, the file system metadata uses an efficient locking strategy with fine-grained, read-write locking. This enables more users to access the metadata concurrently, resulting in greater scalability. Alluxio 1.1 also improves how the master writes the journal, by preventing performing journal I/O while holding locks. Finally, Alluxio 1.1 enables multiple journal entries to be batched and flushed at the same time, which can greatly improve the throughput of the metadata operations.

The master metadata changes in Alluxio 1.1 improves the scalability of metadata operations. Here are a few experiments which exhibit the scalability improvements.

Listing Directories

In this experiment, many threads are concurrently listing various directories in Alluxio. Below is the global throughput of operations during the experiment, for versions 1.0.1 and 1.1.

1.1_release-list_directories

This chart shows that the Alluxio master in Alluxio 1.1 can support approximately 7 times greater throughput over Alluxio 1.0.1! This is due to the increased concurrency enabled by fine-grained locking.

Creating Empty Files

Alluxio 1.1 also improves the performance for metadata updates, which boosts the performance for applications creating files in Alluxio. In this scenario, many threads are creating empty files (files containing no data) in Alluxio. Since the files contain no data, this experiment stresses updating the metadata in the Alluxio Master. Below is the global throughput of empty files created during the experiment, for versions 1.0.1 and 1.1.

1.1_release-create_files_local

The results show that with the improvements in Alluxio 1.1, the master is able to support approximately 1.8 times greater throughput over Alluxio 1.0.1.

Next, the same experiment was performed but with the master configured to write to a remote journal (remote HDFS). The following chart shows the throughput results with Alluxio writing to a remote journal.

1.1_release-create_files_remote

The Alluxio Master in version 1.1 shows significantly greater throughput over 1.0.1, over 23 times greater! This is due to the journaling changes made in version 1.1.

Alluxio Worker Scalability

In addition to the Alluxio Master developments, changes were made to the Alluxio Worker in order to improve scalability. The Alluxio Worker stores and manages all the data, so it is involved in all reads and writes of data. Therefore, increasing the performance of the Alluxio Worker will improve the performance of all applications reading and writing files. In earlier versions of Alluxio, the performance of the worker would degrade as the worker managed more and more data. While investigating this behavior, the main culprit was how the worker managed the block metadata during operations. As the number of blocks grew on the worker, managing the worker block metadata became more expensive. Therefore, changes were made in the worker to prevent accessing the block metadata when unnecessary.

To evaluate the improvements, we experimented with the Alluxio worker by continually writing data into the worker. As files and blocks were being written to the worker, the response times of the writes were measured. Below is a chart showing the response time of writes, as the number of blocks grew on the worker.

1.1_release-worker_scalability

From this chart, in 1.0.1, the response times of writes grew as the number of blocks on the worker grew. This means writes became slower as the worker managed more data. With the changes to the worker in version 1.1, the response times stay low and constant. This shows that the worker can scale effectively as the worker manages more and more data.

Better Support for Random I/O (e.g., Parquet files)

Alluxio brings significant performance benefits when the data can be stored in memory, and applications can get direct access to that in-memory data. Therefore, being able to read data from Alluxio memory can greatly improve performance of applications. However, in earlier versions of Alluxio, a block would only be stored in Alluxio memory if the block was fully read by the application. This means if applications only read a block partially, or seeked around in the block (random I/O), the block would not be stored in Alluxio.

Alluxio 1.1 introduces a feature that enables storing the entire block in memory even if the block was only partially read. This is particularly relevant to Parquet files, since reading Parquet files typically involves random I/O. With Alluxio 1.1, files accessed with random I/O, such as Parquet files, can be automatically stored in Alluxio, which could greatly improve performance for applications that uses random I/O.

Access Control Features

Alluxio 1.1 introduces a set of access control features, which enable users to secure access to files and directories in Alluxio. Thanks to contributors from Intel, Alluxio has initial support for authentication and authorization. Alluxio now has the concept of users and groups. In addition to the user and group concepts, Alluxio 1.1 includes a file system permission model. This permission model is similar to the common POSIX permission model, so it should be familiar to most.

With the access control feature of Alluxio 1.1, each file or directory is associated with an owner, and a group. In addition to the owner and group association, each file or directory is associated with permissions, which control read, write, and execute actions. For a particular file or directory, the permissions can be set for the owner, group, and everyone else.

Alluxio 1.1 also includes command-line tools for managing the permissions of files and directories. The familiar commands chown, chgrp, chmod are added to the Alluxio shell.

Usability and Integration Improvements

Alluxio 1.1 provides easier usability and deployment.

Automatic Metadata Loading

In earlier versions of Alluxio, metadata from under file systems mounted to Alluxio had to be manually loaded into Alluxio. There was an explicit command to tell Alluxio to gather the metadata from the under file system, and store it in Alluxio. However, this could be cumbersome and confusing, since without invoking the loadMetadata command, files could not be found.

In Alluxio 1.1, the metadata loading is now seamless and done automatically. The metadata from the under file system will be loaded automatically on the first access of the file or directory. With this automatic metadata loading, it is no longer necessary to manually invoke a command to load metadata. This feature helps simplify how users use Alluxio.

Sudo-less Deployment

In previous versions of Alluxio, sudo access was required in order to try out Alluxio. With Alluxio 1.1, if you are deploying Alluxio on Linux OS, sudo is no longer required to try out Alluxio. In Alluxio 1.1, if sudo is not available but tmpfs is available (Linux), tmpfs will be used instead of mounting ramfs. However, using tmpfs does not guarantee in-memory access to data, so there are no performance guarantees. Using tmpfs can be a way to try out Alluxio in an environment without sudo access.

Simplified Configuration

Beginning from Alluxio 1.1, configuring the Alluxio system becomes simpler and easier to reason about. In earlier versions, there were many different places to derive configuration settings, it was not always intuitive to set the right configuration in conf/alluxio-env.sh. Now, Alluxio 1.1 eases this process by providing a more straightforward way to control Alluxio server or application configurations.

Environment variables can be used to set Alluxio server settings on each node, such as the IP address, port, ramdisk and etc, through the conf/alluxio-env.sh script on each node. This file conf/alluxion-env.sh is now very simple and no longer contains any logic and only contains a minimal set of environment variables. We expect setting configuration through environment variables to meet most basic needs and users may not even need to modify that file.

For all other configuration properties, users can set and modify properties files on each node. Site property files (e.g., alluxio-site.properties) controls most parameters and can be loaded automatically if they are in certain locations (e.g., ~/.alluxio/ or classpath) on that node. This provides advanced users a general approach to customize their servers and applications.

In addition, users can use application-specific configuration to configure Alluxio client behavior. For example, users can simply submit a Spark job with certain Alluxio properties using spark.executor.extraJavaOptions when running spark-submit. This configuration will be distributed to different Spark executors and take effect for the particular Spark job.

Google Cloud Platform

Thanks to contributors from Google, Alluxio 1.1 has integration with Google Cloud Platform. With Alluxio 1.1, an Alluxio cluster can be deployed on Google Compute Engine (GCE). This enables additional options for deploying Alluxio on different public clouds. In addition to being able to deploy Alluxio on GCE, Alluxio 1.1 also supports mounting data stored in Google Cloud Storage (GCS). This augments the ecosystem that integrates with Alluxio, and helps applications seamlessly access data in GCS.

 

This post covered many of the great developments of Alluxio from the community. You can download the latest version of Alluxio, try it out and visit the documentation for Alluxio 1.1. We would again like to thank the community for making Alluxio what it is today. The great improvements to Alluxio in this version is a direct result of the vibrant and growing open source community! We look forward to the future innovations in Alluxio!

Get Started with Alluxio

Get Started