In a distributed application architecture, compute and storage are typically decoupled. Doing so allows the two to be scaled independently, keeping costs proportional to utilization. This approach, however, has a detrimental effect on application performance because of the increased distance between your data and the applications that consume it.
At the same time, when data is stored across multiple storage systems, accessing that data becomes challenging. You either have to integrate each application with each storage system, which doesn’t scale; or data needs to be first collected and transferred to a common location, which increases both overhead and lead time.
Alluxio addresses these problems by creating a storage layer between your applications and storage systems. Alluxio bridges the gap between applications and different data sources by providing a unified way to access your data at memory speed, improving application performance by orders of magnitude.
What is Alluxio?
Alluxio exposes a unified filesystem distributed across the local storage media of one or more machines; ideally this would be the local RAM of the nodes in your compute cluster. Alluxio then pulls in the data from your various existing storage systems on demand.
Once Alluxio is in place, your data is centralized and applications have a single common interface and namespace for data access. And with your data in memory, data access is accelerated. Alluxio is thus billed as a memory-centric virtual distributed storage system. Let’s dig further into each of these characteristics:
Memory-centric - when dealing with remote storage systems, read/write throughput is generally limited by network and disk speed to 100 Mbps to 10 Gbps. With RAM however, throughput can reach 10 to 100 Gbps. Ideally, Alluxio is co-located with applications which in many cases allows reads and writes to be performed directly on local memory. We say ‘centric’ because you can also use SSDs and HDDs at the Alluxio layer with the tiered storage feature.
Virtual - Alluxio provides a single file system namespace (alluxio://) through which you can access a variety of underlying storage systems. An application only needs to communicate with Alluxio and then Alluxio communicates with the other filesystems that have been mounted to it. It also transparently persists data to under storage based on user configuration.
Distributed - Alluxio is designed to run on commodity hardware, which is easily scaled. For either standalone deployments or multi-node clusters, data is distributed across all the local storage that has been allocated to it. Alluxio provides the framework for nodes to perform read/writes, while the compute framework ensures good data locality.
Where Alluxio Fits
Alluxio holds a unique place in the big data ecosystem, residing between storage systems such as Amazon S3, Google Cloud Storage, EMC ECS, Apache HDFS, or OpenStack Swift and computation frameworks and applications such as Apache Spark or Hadoop MapReduce.
It manages data access and fast storage, facilitating data sharing and locality between jobs, regardless of whether they are running with the same computation engine. The result is significant performance improvement for big data applications while providing a common interface for data access.
Alluxio also bridges the gap between big data applications and various storage systems. Since Alluxio abstracts the integration of storage systems from applications, any under storage can back all the applications and frameworks running on top of Alluxio. Coupled with the ability to mount multiple storage systems, Alluxio serves as a unifying layer for any number of varied data sources.
How Alluxio Works
Alluxio can be deployed on a single host or a multi-node cluster. The ideal deployment for Alluxio is one in which workers, a concept we’ll cover farther down, are co-located with applications. This enables applications to have direct access to in-memory data.
To explain how Alluxio works, let’s walk through a read file operation. With Alluxio in place, all an application needs to do is invoke read(alluxio://<path>). From the application’s perspective, that’s it. The application only need the contents of the data, not where it originated. But under the hood there’s a little more going on.
If the file is in Alluxio and available locally, it can be read at memory speed. If it’s in Alluxio but on a different node, the client makes a remote call to the Alluxio worker on that node and the data will have to be read at the speed of the local network. If the file is not in Alluxio at all, Alluxio transparently determines where it’s stored and fetches it; once loaded, it will remain available for reuse. Incidentally, any output from the application may be written synchronously to Alluxio, under storage, or both depending on user configuration.
To accomplish the functionality described above, several components are involved, namely master, worker, and client.
Master - The Alluxio master is the process responsible for managing the global metadata of the system, for example, the file system tree. Clients interact with the master to read or modify this metadata. In addition, all workers periodically heartbeat to the master to maintain their participation in the cluster. The master does not initiate communication with other components; it only interacts with other components by responding to requests.
Alluxio may be deployed in one of two master modes, single master or fault tolerant mode.
Workers - Alluxio workers are responsible for managing local resources allocated to Alluxio. These resources could be local memory, SSD, and/or hard disk. Alluxio workers store data as blocks and serve read/write requests from clients. However, the worker is only responsible for the data in these blocks; the actual mapping from file to blocks is stored in the master.
The ideal deployment for Alluxio is one in which workers are co-located with applications. This enables applications to have direct access to in-memory data.
Clients - The Alluxio client provides users a gateway to interact with the Alluxio servers via a filesystem API. It initiates communication with the master to carry out metadata operations, and with workers to read and write data in Alluxio. Data that exists in the under storage but is not available in Alluxio is accessed directly through an under storage client.
Alluxio is an in-memory storage solution that unifies your data, brings it closer to the applications that consume it, and makes it available at speeds much faster than disk access typically allows. Because Alluxio sits on top of your existing storage solutions, no data migration and minimal code changes are required, making it easily pluggable. For further reading, jump into our feature documentation.
Alluxio storage - Alluxio can manage memory and local storage such SSDs
and HDDs to accelerate data access. If finer grained control is required, the tiered storage feature
can be used to automatically manage data between different tiers, keeping hot data in faster tiers.
Custom policies are easily pluggable, and a pin concept allows for direct user control.
Pluggable under storage - Alluxio persists in-memory data to the underlying storage system. This enables both fault tolerance and effective data management when multiple storage systems are in use. Popular storage backends including Amazon S3, Google Cloud Storage, OpenStack Swift, Apache HDFS, GlusterFS, and Alibaba OSS are supported.
Flexible file API - Our Alluxio Filesystem API is similar to that of the java.io.File class, providing InputStream and OutputStream interfaces and efficient support for memory-mapped I/O. We recommend using this API to get the best performance. Alternatively, we provide a Hadoop compatible interface, allowing existing Hadoop MapReduce and Spark programs to use Alluxio in place of HDFS without any code changes.
Web UI & CLI - Users can browse the file system easily through the web UI. Under debug mode, administrators can view detailed information of each file, such as block locations and under storage path. Users can also use the ./bin/alluxio fs command-line client to interact with Alluxio; for instance, to copy data to and from Alluxio.
For those that are new to Alluxio, a good place to start is installing Alluxio locally. We will also run through some basic operations.