Developer Tip: Why Did My Job Fail with Error Message "Class alluxio.hadoop.FileSystem not found"?

Bin Fan Oct 30th, 2018

From time to time, a question pops up on the user mailing list referencing job failures with the error message "java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found". This post explains the reason for the failure and the solution to the issue when you see this error.

How To Speed Up Alluxio Metadata Operations Up To 100X

David Zhu Oct 16th, 2018

One of the major values Alluxio provides is a simple and unified interface to manage files and directories on different underlying storage systems. Alluxio acts as an intermediate layer and exposes a file interface for applications to interact with, even though the underlying storage system might be an object store that has a different interface. This blog describes our experience in speeding up Alluxio metadata operations using fingerprint and Alluxio under store bulk operations. These latest optimizations can be found in the 1.8.1 release.

New York Meetup Recap - September 2018

Gene Pang Sep 18th, 2018

On September 13th, we held our first New York City Alluxio Meetup! Work-Bench was very generous for hosting the Alluxio meetup in Manhattan. This was the first US Alluxio meetup outside of the Bay Area, so it was extremely exciting to get to meet Alluxio enthusiasts on the east coast! Continue reading...

A Better Big Data Ecosystem with Hadoop and Hitachi Content Platform (Part1)

Nick DeRoo (Hitachi Vantara) Aug 28th, 2018

In this guest blog from our friends in the Hitachi Content Platform team at Hitachi Vanatra, Nick DeRoo explores the challenges customers are facing with storing data long term in Hadoop, and discusses what the Hitachi Content team is doing with Alluxio to help customers solve these challenges.

Starburst Presto + Alluxio = Better Together

Eric Whitlow (Starburst Data) Aug 20th, 2018

Welcome Eric Whitlow from our friends at Starburst Data... With more companies using Presto for reporting and analytics, we here at Starburst are seeing more use cases around operational reporting. These types of queries need to be returned subsecond and usually involve a small subset of the dataset. Presto was designed from the ground up to offer interactive analytics using a massively parallel processing SQL engine that can combine data from multiple sources using a variety of connectors. As more and more companies discover the power of “separation of storage and compute” along with querying the data where it lies, it’s not wonder Presto is being asked to add even more functionality.

Announcing Alluxio v1.8.0

Neena Pemmaraju Jul 31st, 2018

We are excited to announce the release of Alluxio Enterprise Edition (AEE) and Community Edition (ACE) and Alluxio Open Source (AOS) v1.8.0. This release brings features and enhancements in Alluxio to simplify cloud adoption (and hybrid cloud, and migration from HDFS to object storage) for analytics and machine learning and improve useability.

Data Location Awareness: Optimize Performance and Lower Cost with Tiered Locality

Andrew Audibert Jul 24th, 2018

Caching frequently used data in memory is not a new computing technique, however it is a concept that Alluxio has taken to the next level with the ability to aggregate data from multiple storage systems in a unified pool of memory. Alluxio capabilities extend further to intelligently managing the data within that virtual data layer. Tiered locality uses awareness of network topology and configurable policies to manage data placement for performance and cost optimizations. This feature is particularly useful with cloud deployments across multiple availability zones. It can also be useful for cost savings in environments where cross-zone or cross-location traffic is more expensive than intra-zone data traffic.

Asynchronous Caching in Alluxio - High Performance for Partial Read Caching

Calvin Jia Jul 10th, 2018

An Alluxio cluster caches data from connected storage systems in memory to create a data layer that can be accessed concurrently by multiple application frameworks. This greatly improves performance for many analytics workloads. On-demand caching occurs when clients read blocks of data using a ‘CACHE’ read type from persistent storage systems connected to the Alluxio cluster. Prior to Alluxio v1.7, on-demand caching was on the critical path of read operations, requiring a full block to be read before the data was available for the application. Workloads which read partial blocks, for example SQL workloads, would be adversely affected on initial reads from connected storage. For example, when reading the footer of a parquet file, the client only requests a small amount of data, but the client reads the entire data block in order to cache it.

TalkingData Case Study: Leading Data Broker in China Leverages Alluxio to Unify Terabytes of Data Across Disparate Data Sources

Zhitao Yan (TalkingData) Jun 25th, 2018

TalkingData is China’s largest data broker, reaching more than 600 million smart devices on a monthly basis. TalkingData processes over 20 terabytes of data and more than one billion session requests every day. TalkingData products are powered by its massive proprietary data set and provide services to over 120,000 mobile applications and 100,000 application developers. TalkingData serves a wide range of clients in both Internet and traditional industries, including leading enterprises in the financial services, real estate, retail, travel, and government sectors.

Myntra Case Study: Accelerating Analytics in the Cloud for Customized Mobile E-Commerce

Deepak Batra Jun 12th, 2018

Myntra, a division of Flipkart, is a leading fashion retailer in India offering customers a wide range of merchandise through a mobile application. An analytics pipeline in Amazon Web Services (AWS) cloud processes customer data to make recommendations, present ads, and deliver other aspects of a tailored experience. Myntra deployed Alluxio to provide a virtual data layer connecting AWS S3 to the analytics pipeline to accelerate data access and enable faster customer response and interactive business intelligence.

Tencent Case Study: Delivering Customized News to Over 100 Million Users per Month with Alluxio

Can He (Tencent) Apr 8th, 2018

Tencent is one of the largest technology companies in the world and a leader in multiple sectors such as social networking, gaming, e-commerce, mobile and web portal. Tencent News, one of Tencent’s many offerings, strives to create a rich, timely news application to provide users with an efficient, high-quality reading experience. To provide the best experience to more than 100 million monthly active users of Tencent News, we leverage Alluxio with Apache Spark to create a scalable, robust, and performant architecture.

MOMO: Accelerating Ad Hoc Analysis with Spark SQL and Alluxio

MOMO Team Mar 20th, 2018

The hadoop ecosystem makes many distributed system/algorithms easier to use and generally lowers the cost of operations. However, enterprises and vendors are never satisfied with that, so higher performance becomes the next issue. We considered several options to address our performance needs and focused our efforts on Alluxio, which improves performance with intelligent caching.

Lenovo Case Study: Analytics on Data from Multiple Locations and Eliminating ETL

Neena Pemmaraju Mar 12th, 2018

Lenovo is an Alluxio customer with a common problem and use case in the world of data analytics. They have petabytes of data in multiple data centers in different geographic locations. Analyzing it requires an ETL process to get all of the data in the right place. This is both slow, because data has to be transferred across the network, and costly because multiple copies of the data need to be stored. Freshness and quality of the data can also suffer as the data is also potentially out of date and incomplete because regulatory issues prevent certain data from being transferred.

New Whitepaper: Structured Big Data Federation

Gene Pang Feb 28th, 2018

Enterprises are adopting big data technologies to analyze and derive insight from their growing volumes of structured and unstructured data. A familiar problem is the requirement to analyze data from multiple independent storage silos concurrently. In order to consolidate the data, large enterprises typically use custom solutions or build a data lake. These approaches present additional challenges and can be costly and time consuming. Alluxio helps organizations handle their big data by providing a unified view of all of the data in your enterprise – on premise, in the cloud, or hybrid. Applications access data using a standard interface to a global virtual namespace. Alluxio also employs a memory-centric architecture to enable data access at memory speed. With the combined unification and performance benefits, Alluxio can effectively provide big data federation for organizations by acting as a virtual data lake. We just published a whitepaper that goes into further detail, you can access it here: Structured Big Data Federation Using Alluxio.

Enabling Decoupled Compute and Storage with Alluxio

Calvin Jia Feb 5th, 2018

The primary appeal of a coupled compute-storage architecture, an architecture where the computation is happening on the machines where the data resides, is the performance possible by bringing the compute engine to the data it requires; however, the costs of maintaining such tight-knit architectures are gradually overtaking the performance benefits. Especially with the popularity of cloud resources, being able to independently scale compute and storage results in large cost savings and cheaper maintenance. This post explores the benefits Alluxio brings in these environments...

Accelerating Cloud Pipelines with Alluxio and Fast Durable Writes

Gene Pang Feb 4th, 2018

Processing and storing data in the cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage is a growing trend. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running data processing pipelines while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Recently, organizations have been deploying Alluxio to support various cloud-based pipelines, to improve performance and reduce costs.

Announcing the Release of Alluxio Enterprise Edition and Community Edition v1.7.0

Andrew Audibert Calvin Jia Gene Pang Adit Madan Feb 2nd, 2018

We are excited to announce the release of Alluxio Enterprise Edition (AEE) and Community Edition (ACE) v1.7.0. This release brings enhanced caching policies, further ecosystem integrations, and significant usability improvements. One highlight is the Alluxio FUSE API which provides users with the ability to interact with Alluxio through a local filesystem mount. Alluxio FUSE is particularly useful for integrating with deep learning frameworks such as Tensorflow. Learn more about using Alluxio for deep learning here, and stay tuned for additional articles highlighting our latest capabilities.

Flexible and Fast Storage for Deep Learning with Alluxio

Yupeng Fu Jan 30th, 2018

In the age of growing datasets and increased computing power, deep learning has become a popular technique for AI. Deep learning models continue to improve their performance across a variety of domains, with access to more and more data, and the processing power to train larger neural networks. This rise of deep learning advances the state-of-the-art for AI, but also exposes some challenges for the access to data and storage systems. In this article, we further describe the storage challenges for deep learning workloads and how Alluxio can help to solve them.

Kyligence leverages Alluxio to accelerate OLAP in the cloud

Shaofeng Shi Dec 1st, 2017

Alluxio enables effective data management across different storage systems through its use of transparent naming and mounting API. With Alluxio, KAP can gain a good balance between performance, cost and management effort in the Cloud.

Announcing the Release of Alluxio AEE v1.6.0 and ACE v1.6.0

Andrew Audibert Bin Fan Chaomin Yu Neena Pemmaraju Yupeng Fu Oct 11th, 2017

We are excited to announce Alluxio Enterprise Edition (AEE) 1.6.0 and Alluxio Community Edition (ACE) 1.6.0 releases. The AEE release brings a new embedded journal as well as enhancements in the areas of security and Fast Durable Write. In addition, both the AEE and the ACE releases bring new clients support (Amazon S3 API and Python Client), major usability improvements as well as enhanced integrations with the ecosystem.

Open Source Alluxio 1.5.0 Release Highlights

Adit Madan Andrew Audibert Bin Fan Jiri Simsa Jul 5th, 2017

Open source Alluxio 1.5.0 has been released with a large number of new features and improvements, particularly focused on ecosystem accessibility and compatibility.

Announcing the Release of Alluxio AEE v1.5.0 and ACE v1.5.0

Neena Pemmaraju Jun 26th, 2017

We are excited to announce Alluxio Enterprise Edition (AEE) 1.5.0 and Alluxio Community Edition (ACE) 1.5.0 releases. The AEE release brings enhancements in the areas of security, multi-tenancy as well as working with multiple under-stores. In addition, both the AEE and the ACE releases bring major usability and performance improvements as well as enhanced integrations with the ecosystem.

Alluxio and Mesosphere partner to enable fast on-demand analytics with Alluxio and DC/OS

Amelia Wong Mar 13th, 2017

Today, we’re excited to announce our partnership with Mesosphere to enable fast on-demand analytics with Alluxio via Mesosphere’s DC/OS in one-click. This partnership is a natural extension of the synergy between Alluxio and DC/OS. Alluxio, the world's first system that unifies data at memory speed, allows enterprises to manage and analyze data stored across disparate storage systems on premise and in the cloud at memory speed. Mesosphere brings enterprises the power of cloud native technologies, with the control to run on any infrastructure - datacenter or cloud...

What's new in Alluxio 1.4.0

Adit Madan Calvin Jia Jiri Simsa Feb 8th, 2017

Alluxio 1.4.0 has been released with a large number of new features and improvements. This blog highlights some stand out aspects of the release.

Arimo Leverages Alluxio’s In-Memory Capability, Improving Time-to-Results for Deep Learning Models

Arimo Team Nov 25th, 2016

Deep learning algorithms have traditionally been used in specific applications, most notably, computer vision, machine translation, text mining, and fraud detection. Deep learning truly shines when the model is big and trained on large-scale datasets. Meanwhile, distributed computing platforms like Spark are designed to handle big data and have been used extensively. Therefore, by having deep learning available on Spark, the application of deep learning is much broader, and now businesses can fully take advantage of deep learning capabilities using their existing Spark infrastructure.

Alluxio Launches Industry's First System to Unify Data at Memory Speed

Haoyuan Li Oct 24th, 2016

Today we’re excited to unveil our first products which enable organizations to turn data into value with unprecedented ease, flexibility, and speeds. We believe our new products will substantially advance Alluxio for both the community and our enterprise customers. In this blog, I will share with you the challenges that we see application developers and business line owners face today when working with big data, and show how Alluxio addresses these challenges.

Accelerating Data Analytics on Ceph Object Storage with Alluxio

Adit Madan Oct 16th, 2016

This is an excerpt from the Accelerating Data Analytics on Ceph Object Storage with Alluxio whitepaper. In addition to the reference architecture in this blog, the whitepaper provides a detailed implementation guide to reproduce the environment

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Calvin Jia Sep 1st, 2016

Alluxio is the world's first memory-speed virtual distributed storage system that bridges applications and underlying storage systems, providing unified data access orders of magnitudes faster than existing solutions. The Hadoop Distributed File System (HDFS) is a distributed file system for storing large volumes of data. HDFS popularized the paradigm of bringing computation to data and the co-located compute and storage architecture.

Alluxio Partners with Huawei to Deliver Big Data Storage Acceleration Solution

Neena Pemmaraju Aug 27th, 2016

We are excited to announce a big data storage acceleration solution with Huawei. This solution combines Huawei’s FusionStorage with Alluxio’s memory-speed virtual distributed storage system to dramatically enhance the speed and efficiency of big data analytics for the enterprise.

Effective Spark RDDs with Alluxio

Gene Pang Pei Sun Aug 24th, 2016

Organizations like Baidu and Barclays have deployed Alluxio with Spark in their architecture, and have achieved impressive benefits and gains. Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In this blog, we investigate how Alluxio can make Spark more effective, and discuss various ways to use Alluxio with Spark. Alluxio helps Spark perform faster, and enables multiple Spark jobs to share the same, memory-speed data.

Accelerating On-Demand Data Analytics with Alluxio

Calvin Jia Aug 19th, 2016

This is an excerpt from the Accelerating On-Demand Data Analytics with Alluxio whitepaper, which includes a detailed implementation guide in addition to this high level overview.

What’s new in Alluxio 1.1 Release

Gene Pang Jun 21st, 2016

Alluxio 1.1 release includes many great features and improvements from the community. Alluxio would not be what it is today without the growing open source community, and we would like to thank everyone involved in this project. With the Alluxio 1.1 release, the community has continued to grow at a rapid pace, to reach over 250 contributors to Alluxio – nearly 3x growth over the last year!

Introducing Alluxio Open Source Project Governance

Haoyuan Li May 30th, 2016

Alluxio, formerly Tachyon, began as a research project at UC Berkeley’s AMPLab in 2012. This year we announced the 1.0 release of Alluxio, the world’s first memory speed virtual distributed storage system, which unifies data access and bridges computation frameworks and underlying storage systems. We have been working closely with the Alluxio community on realizing the vision of Alluxio to become the de facto storage unification layer for big data and other scale out application environments.

Getting Started with Alluxio and Spark

Calvin Jia Apr 5th, 2016

Alluxio, formerly Tachyon, provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. For example, global financial powerhouse Barclays made the impossible possible by using Alluxio with Spark in their architecture. Technology giant Baidu analyzes petabytes of data and realized 30x performance improvements with a new architecture centered around Alluxio and Spark.

Alluxio, formerly Tachyon, is Entering a New Era with 1.0 release

Haoyuan Li Feb 14th, 2016

Alluxio, formerly Tachyon, began as a research project when I was a Ph.D. student at UC Berkeley’s AMPLab in 2012. At the time, Spark and Mesos were taking off. We saw what Spark and Mesos could do for compute and resource management respectively, while the storage piece of this story was missing.

Get Started with Alluxio

Download