Two Ways to Keep Files in Sync Between Alluxio and HDFS

Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace (see the left figure below).

However, where a file in the UFS is changed without going through Alluxio, the UFS namespace and the Alluxio namespace can potentially get out of sync. When this happens, a UFS Metadata Sync operation is required to synchronize the two namespaces (illustrated in the right figure).

In Alluxio 2.0, there are two ways to ensure the metadata sync between Alluxio and UFS.

Sync On-demand

Alluxio automatically caches metadata information from the UFS so that subsequent metadata operations such as listStatus (or ls) will not need to access the UFS. This reduces the latency of these metadata operations. However, sometimes the metadata of the underlying UFS can change without notifying Alluxio. When that happens, this cache needs to be invalidated.

Since version 1.7.0, Alluxio has provided an option alluxio.user.file.metadata.sync.interval which allows users to control how often this metadata cache gets refreshed. Anytime the client issues a metadata operation such as listStatus, it can specify the interval to be one of -1, 0 or a time value. When it is set to -1, Alluxio never fetches metadata information from the UFS. When it is set to 0, it always fetches metadata information from the UFS. When it is set to a time value, it will fetch the metadata information from the UFS if it has not done so in the recent past specified by the time value.

Here is an example.

$ alluxio fs ls -R -Dalluxio.user.file.metadata.sync.interval=0 /dir

This tells alluxio to always fetch the metadata information from the UFS.

One thing to note is that the Alluxio system never synchronizes with the UFS unless there is a client request to that UFS. This can cause problems because the first time a particular client accesses the UFS, the extra cost of accessing the UFS can cause a slowdown of the client request. This calls for a mechanism that will synchronize the Alluxio namespace and the UFS namespace in the background, or Active UFS sync.

Sync Proactively

Alluxio 2.0 preview release supports a new feature “Active UFS Sync”. It allows the users to specify a directory to be synchronized between Alluxio namespace and the UFS namespace, at a regular interval with a number of parameters to fine-tune that syncing behavior. Currently, Active UFS Sync is only supported between Alluxio and HDFS 2.7 or later. To use this feature, the user running Alluxio must be an HDFS admin user, in order to listen to the event stream HDFS provides.

To enable active sync on a directory, issue the following Alluxio command on a directory that is backed by HDFS.

$ alluxio fs startSync /syncedDir

You can also stop active sync on a directory by using the following command.

$ alluxio fs stopSync /syncedDir

Note the list of directories under active sync is remembered between master restarts. You can check which directories are under active sync by using the getSyncPathList command.

$ alluxio fs getSyncPathList

Optimizations

There are a few parameters to optimize the active UFS sync behavior.

Sync interval: Users can control the active sync interval by changing the alluxio.master.activesync.interval option, the default is 30 seconds.

Quiet period: To avoid syncing when the directory to be synced is under heavy modifications and adding more RPC workload to the UFS, active UFS Sync tries to only sync when the UFS is considered to be in a quiet period.

This quiet period is controlled by alluxio.master.activesync.maxactivity. Activity is a heuristic based on the exponential moving average of a number of events in a directory. For example, if a directory had 100, 10, 1 event in the past three intervals. Its activity would be 100/10*10 + 10/10 + 1 = 3. Property alluxio.master.activesync.maxactivity is the maximum number of activities in the UFS directory to be considered “quiet”. However, if we only sync during the quiet period, we may have to wait a long time and metadata can become stale in the Alluxio namespace. Property alluxio.master.activesync.maxage is the maximum number of intervals we will wait before synchronizing the UFS and the Alluxio space. The system guarantees that we will start syncing a directory if it is “quiet”, or it has not been synced for a long period (a period longer than the max age).

Conclusion

When using Alluxio, it is important to keep the Alluxio namespace and the UFS namespace consistent. This article describes two ways to perform this synchronization. The synchronization can happen with a client call to Alluxio (On-demand) or happen in the background (Active UFS Sync), each with its own unique advantages. On-demand UFS metadata sync happens only when a client calls Alluxio, therefore it allows administrators to precisely control when sync happens. Active UFS Sync happens in the background, hence it requires minimal configuration and management. Administrators can choose the right strategy based on the specific use case.