Apr 16th, 2019
How to run deep learning frameworks like TensorFlow on remote storage efficiently? This article discussed different solutions.
Apr 16th, 2019
Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace (see the left figure below).
However, where a file in the UFS is changed without going through Alluxio, the UFS namespace and the Alluxio namespace can potentially get out of sync. When this happens, a UFS Metadata Sync operation is required to synchronize the two namespaces (illustrated in the right figure).
Apr 13th, 2019
Thrift served well as a fast and reliable RPC framework powering the metadata operations in Alluxio 1.x. Its limitation in handling streamed data has led us to a journey in search of better alternatives. gRPC provides some nice features that help us in building a simpler, more unified API layer. In this post, we discussed some lessons learned to move from Thrift to gRPC, including performance tuning tips that helped us achieve comparable performance for both one-off RPC calls as well as data streams.
Apr 12th, 2019
This is a recap of the Two Sigma and Alluxio joint meetup hosted in New York. Two Sigma is a leading hedge fund that leverages cutting edge technology to train their models with petabytes of data in on-premise storage. Special thanks to Two Sigma for hosting. Here are the slides from the presentation.
Apr 10th, 2019
China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements. The complicated the architecture of its incumbent computing platform created a lot of new challenges to effectively use resources.
Fortunately there is a new generation of distributed computation frameworks that can help China Unicom meet the enormous data challenges for a variety of business scenarios. To solve these problems, the company built a new software stack of Apache Spark, Alluxio, HDFS, Hive and Apache Kafka. They used Alluxio as the core component for a unified, memory-centric distributed data processing platform with consolidated resources, and improved computation efficiency.
Apr 9th, 2019
Alluxio 2.0 is designed to support 1 billion files in a single file system namespace. Namespace scaling is vital to Alluxio for a couple of reasons:
1. Alluxio provides a single namespace where multiple storage systems can be mounted. So the size of Alluxio's namespace is the sum of the sizes of all mounted storages.
2. Object storage is increasing in popularity, and object stores often hold many more small files compared with filesystems like HDFS.
Apr 1st, 2019
Alluxio VR client is the whole new way to enable easier and more efficient interaction with data, no matter the data is stored in on-premise, public cloud or hybrid cloud environment, and no matter data scientists and developers are at home, kitchen, bus stop or gym.
Mar 28th, 2019
In the early 2000s, big data was born, and technology companies were racing to create the next-gen compute frameworks or storage systems geared towards the requirements brought about by big data. By the time I was a first year Ph.D. student at UC Berkeley’s AMPLab in 2011, numerous advances in big data related technologies such as Apache Spark was emerging.
Mar 22nd, 2019
Apache Spark has brought significant innovation to Big Data computing, but its results are even more extraordinary when paired with Alluxio. Alluxio provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. Bazaarvoice uses the combination of Spark and Alluxio to provide a real time big data platform that has the ability to not only handle the intake of 1.5 billion page views during peak events like Black Friday, but also provide real time analytics against it ([read more](https://www.alluxio.com/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage)). At this scale, the gain in speed is an enabler for new workloads. We’ve established a clean and simple way to integrate Alluxio and Spark.
Mar 7th, 2019
We are thrilled and excited to announce the availability of Alluxio 2.0 Preview Release - the largest open source release with the most new features and improvements since the creation of the project. It is now available for download.
Mar 6th, 2019
Presto is an open source distributed SQL engine widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Alluxio is an open-source distributed file system that provides a unified data access layer at in-memory speed. The combination of Presto and Alluxio is getting more popular in many companies like JD, NetEase to leverage Alluxio as distributed caching tier on top of slow or remote storage for the hot data to query, avoiding reading data repeatedly from the cloud.
Feb 20th, 2019
In this article, Thai Bui describes how Bazaarvoice leverages Alluxio to build a tiered storage architecture with AWS S3 to maximize performance and minimize operating costs on running Big Data analytics on AWS EC2. This article aims to provide the following takeaways:
Common challenges in performance and cost to build an efficient big data analytics platform on AWS.
How to setup Hive metastore to leverage Alluxio as the storage tier for “hot tables” backed by all tables on AWS S3 as the source of truth.
How to setup tiered storage within Alluxio based on ZFS and NVMe on EC2 instances to maximize the read performance.
Benchmark results of micro and real-world workloads.
Feb 12th, 2019
The Alluxio sandbox is the easiest way to test drive the popular data analytics stack of Spark, Alluxio, and S3 deployed in a multi-node cluster in a public cloud environment. The sandbox cluster is fully configured and ready for users to run applications ranging from the hello-world example to the TPC-DS benchmark suite. Don’t take our word for it; kick off the benchmark yourself to see the performance benefits of running Spark jobs that interface through Alluxio on S3 compared to running Spark jobs directly on S3. It is extremely easy to request and launch a sandbox cluster as a playground for 24 hours at no cost to you.
Jan 22nd, 2019
Sometimes, end users see this error when using Alluxio with their applications. This error is a result of a recent feature called Alluxio Client-side Hadoop Impersonation. Since version v1.8.0, Alluxio enabled a client-side impersonation feature by default to better support using Alluxio in Hadoop environments. However, without the correct configuration, end users may see confusing errors.
In order to understand the error and how to fix it, we should first understand the feature and how it works.
Jan 17th, 2019
Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.
Jan 11th, 2019
As one of the world's leading online game company, Netease Games is the operator for many popular online games in China like "World of Warcraft" and "Hearthstone". Netease Games also has developed quite a few popular games on its own such as “Fantasy Westward Journey 2", "Westward Journey 2", "World 3", "League of Immortals”. The strong growth of the business drives the demand to build and maintain a data platform handling a massive amount of data and delivering insights promptly from the data. Specifically, everyday there is about 30TB of raw data collected in our data warehouse; the raw data is further processed to intermediate data in the operational data store (ODS) tables by ETL jobs and becomes several times larger. Given our data scale, it is very challenging to support high-performance ad-hoc queries to the data with results generated in a timely manner.
Reid Chan (Momo)
Dec 28th, 2018
The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices. This article provides the top 10 tips for performance tuning for real-world workloads when running Spark on Alluxio with data locality giving the most bang for the buck.
Nov 19th, 2018
As the amount of data being collected and analyzed by Enterprises continues to grow unabated, more attention is being placed on managing the cost of storing the data relative to performance. Hadoop provides a scalable and fast way of storing and analyzing data, however, the cost of storing data in Hadoop is typically higher compared to alternative technologies like Object Stores.
Oct 30th, 2018
From time to time, a question pops up on the user mailing list referencing job failures with the error message "java.lang.ClassNotFoundException: Class
alluxio.hadoop.FileSystem not found". This post explains the reason for the failure and the solution to the issue when you see this error.
Oct 16th, 2018
One of the major values Alluxio provides is a simple and unified interface to manage files and directories on different underlying storage systems. Alluxio acts as an intermediate layer and exposes a file interface for applications to interact with, even though the underlying storage system might be an object store that has a different interface. This blog describes our experience in speeding up Alluxio metadata operations using fingerprint and Alluxio under store bulk operations. These latest optimizations can be found in the 1.8.1 release.
Sep 18th, 2018
On September 13th, we held our first New York City Alluxio Meetup! Work-Bench was very generous for hosting the Alluxio meetup in Manhattan. This was the first US Alluxio meetup outside of the Bay Area, so it was extremely exciting to get to meet Alluxio enthusiasts on the east coast! Continue reading...
Nick DeRoo (Hitachi Vantara)
Aug 28th, 2018
In this guest blog from our friends in the Hitachi Content Platform team at Hitachi Vanatra, Nick DeRoo explores the challenges customers are facing with storing data long term in Hadoop, and discusses what the Hitachi Content team is doing with Alluxio to help customers solve these challenges.
Eric Whitlow (Starburst Data)
Aug 20th, 2018
Welcome Eric Whitlow from our friends at Starburst Data...
With more companies using Presto for reporting and analytics, we here at Starburst are seeing more use cases around operational reporting. These types of queries need to be returned subsecond and usually involve a small subset of the dataset.
Presto was designed from the ground up to offer interactive analytics using a massively parallel processing SQL engine that can combine data from multiple sources using a variety of connectors. As more and more companies discover the power of “separation of storage and compute” along with querying the data where it lies, it’s not wonder Presto is being asked to add even more functionality.
Jul 31st, 2018
We are excited to announce the release of Alluxio Enterprise Edition (AEE) and Community Edition (ACE) and Alluxio Open Source (AOS) v1.8.0. This release brings features and enhancements in Alluxio to simplify cloud adoption (and hybrid cloud, and migration from HDFS to object storage) for analytics and machine learning and improve useability.
Jul 24th, 2018
Caching frequently used data in memory is not a new computing technique, however it is a concept that Alluxio has taken to the next level with the ability to aggregate data from multiple storage systems in a unified pool of memory. Alluxio capabilities extend further to intelligently managing the data within that virtual data layer. Tiered locality uses awareness of network topology and configurable policies to manage data placement for performance and cost optimizations. This feature is particularly useful with cloud deployments across multiple availability zones. It can also be useful for cost savings in environments where cross-zone or cross-location traffic is more expensive than intra-zone data traffic.
Jul 10th, 2018
An Alluxio cluster caches data from connected storage systems in memory to create a data layer that can be accessed concurrently by multiple application frameworks. This greatly improves performance for many analytics workloads. On-demand caching occurs when clients read blocks of data using a ‘CACHE’ read type from persistent storage systems connected to the Alluxio cluster.
Prior to Alluxio v1.7, on-demand caching was on the critical path of read operations, requiring a full block to be read before the data was available for the application. Workloads which read partial blocks, for example SQL workloads, would be adversely affected on initial reads from connected storage. For example, when reading the footer of a parquet file, the client only requests a small amount of data, but the client reads the entire data block in order to cache it.
Zhitao Yan (TalkingData)
Jun 25th, 2018
TalkingData is China’s largest data broker, reaching more than 600 million smart devices on a monthly basis. TalkingData processes over 20 terabytes of data and more than one billion session requests every day. TalkingData products are powered by its massive proprietary data set and provide services to over 120,000 mobile applications and 100,000 application developers. TalkingData serves a wide range of clients in both Internet and traditional industries, including leading enterprises in the financial services, real estate, retail, travel, and government sectors.
Jun 12th, 2018
Myntra, a division of Flipkart, is a leading fashion retailer in India offering customers a wide range of merchandise through a mobile application. An analytics pipeline in Amazon Web Services (AWS) cloud processes customer data to make recommendations, present ads, and deliver other aspects of a tailored experience. Myntra deployed Alluxio to provide a virtual data layer connecting AWS S3 to the analytics pipeline to accelerate data access and enable faster customer response and interactive business intelligence.
Can He (Tencent)
Apr 8th, 2018
Tencent is one of the largest technology companies in the world and a leader in multiple sectors such as social networking, gaming, e-commerce, mobile and web portal. Tencent News, one of Tencent’s many offerings, strives to create a rich, timely news application to provide users with an efficient, high-quality reading experience. To provide the best experience to more than 100 million monthly active users of Tencent News, we leverage Alluxio with Apache Spark to create a scalable, robust, and performant architecture.
Mar 20th, 2018
The hadoop ecosystem makes many distributed system/algorithms easier to use and generally lowers the cost of operations. However, enterprises and vendors are never satisfied with that, so higher performance becomes the next issue. We considered several options to address our performance needs and focused our efforts on Alluxio, which improves performance with intelligent caching.
Mar 12th, 2018
Lenovo is an Alluxio customer with a common problem and use case in the world of data analytics. They have petabytes of data in multiple data centers in different geographic locations. Analyzing it requires an ETL process to get all of the data in the right place. This is both slow, because data has to be transferred across the network, and costly because multiple copies of the data need to be stored. Freshness and quality of the data can also suffer as the data is also potentially out of date and incomplete because regulatory issues prevent certain data from being transferred.
Feb 28th, 2018
Enterprises are adopting big data technologies to analyze and derive insight from their growing volumes of structured and unstructured data.
A familiar problem is the requirement to analyze data from multiple independent storage silos concurrently. In order to consolidate the data, large enterprises typically use custom solutions or build a data lake. These approaches present additional challenges and can be costly and time consuming.
Alluxio helps organizations handle their big data by providing a unified view of all of the data in your enterprise – on premise, in the cloud, or hybrid. Applications access data using a standard interface to a global virtual namespace. Alluxio also employs a memory-centric architecture to enable data access at memory speed. With the combined unification and performance benefits, Alluxio can effectively provide big data federation for organizations by acting as a virtual data lake.
We just published a whitepaper that goes into further detail, you can access it here: Structured Big Data Federation Using Alluxio.
Feb 5th, 2018
The primary appeal of a coupled compute-storage architecture, an architecture where the computation is happening on the machines where the data resides, is the performance possible by bringing the compute engine to the data it requires; however, the costs of maintaining such tight-knit architectures are gradually overtaking the performance benefits. Especially with the popularity of cloud resources, being able to independently scale compute and storage results in large cost savings and cheaper maintenance. This post explores the benefits Alluxio brings in these environments...
Feb 4th, 2018
Processing and storing data in the cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage is a growing trend. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running data processing pipelines while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Recently, organizations have been deploying Alluxio to support various cloud-based pipelines, to improve performance and reduce costs.
Feb 2nd, 2018
We are excited to announce the release of Alluxio Enterprise Edition (AEE) and Community Edition (ACE) v1.7.0. This release brings enhanced caching policies, further ecosystem integrations, and significant usability improvements. One highlight is the Alluxio FUSE API which provides users with the ability to interact with Alluxio through a local filesystem mount. Alluxio FUSE is particularly useful for integrating with deep learning frameworks such as Tensorflow. Learn more about using Alluxio for deep learning here, and stay tuned for additional articles highlighting our latest capabilities.
Jan 30th, 2018
In the age of growing datasets and increased computing power, deep learning has become a popular technique for AI. Deep learning models continue to improve their performance across a variety of domains, with access to more and more data, and the processing power to train larger neural networks. This rise of deep learning advances the state-of-the-art for AI, but also exposes some challenges for the access to data and storage systems. In this article, we further describe the storage challenges for deep learning workloads and how Alluxio can help to solve them.
Dec 1st, 2017
Alluxio enables effective data management across different storage systems through its use of transparent naming and mounting API. With Alluxio, KAP can gain a good balance between performance, cost and management effort in the Cloud.
Oct 11th, 2017
We are excited to announce Alluxio Enterprise Edition (AEE) 1.6.0 and Alluxio Community Edition (ACE) 1.6.0 releases. The AEE release brings a new embedded journal as well as enhancements in the areas of security and Fast Durable Write. In addition, both the AEE and the ACE releases bring new clients support (Amazon S3 API and Python Client), major usability improvements as well as enhanced integrations with the ecosystem.
Jul 5th, 2017
Open source Alluxio 1.5.0 has been released with a large number of new features and improvements, particularly focused on ecosystem accessibility and compatibility.
Jun 26th, 2017
We are excited to announce Alluxio Enterprise Edition (AEE) 1.5.0 and Alluxio Community Edition (ACE) 1.5.0 releases. The AEE release brings enhancements in the areas of security, multi-tenancy as well as working with multiple under-stores. In addition, both the AEE and the ACE releases bring major usability and performance improvements as well as enhanced integrations with the ecosystem.
Mar 13th, 2017
Today, we’re excited to announce our partnership with Mesosphere to enable fast on-demand analytics with Alluxio via Mesosphere’s DC/OS in one-click. This partnership is a natural extension of the synergy between Alluxio and DC/OS. Alluxio, the world's first system that unifies data at memory speed, allows enterprises to manage and analyze data stored across disparate storage systems on premise and in the cloud at memory speed. Mesosphere brings enterprises the power of cloud native technologies, with the control to run on any infrastructure - datacenter or cloud...
Feb 8th, 2017
Alluxio 1.4.0 has been released with a large number of new features and improvements. This blog highlights some stand out aspects of the release.
Nov 25th, 2016
Deep learning algorithms have traditionally been used in specific applications, most notably, computer vision, machine translation, text mining, and fraud detection. Deep learning truly shines when the model is big and trained on large-scale datasets. Meanwhile, distributed computing platforms like Spark are designed to handle big data and have been used extensively. Therefore, by having deep learning available on Spark, the application of deep learning is much broader, and now businesses can fully take advantage of deep learning capabilities using their existing Spark infrastructure.
Oct 24th, 2016
Today we’re excited to unveil our first products which enable organizations to turn data into value with unprecedented ease, flexibility, and speeds. We believe our new products will substantially advance Alluxio for both the community and our enterprise customers.
In this blog, I will share with you the challenges that we see application developers and business line owners face today when working with big data, and show how Alluxio addresses these challenges.
Oct 16th, 2016
This is an excerpt from the Accelerating Data Analytics on Ceph Object Storage with Alluxio whitepaper. In addition to the reference architecture in this blog, the whitepaper provides a detailed implementation guide to reproduce the environment
Sep 1st, 2016
Alluxio is the world's first memory-speed virtual distributed storage system that bridges applications and underlying storage systems, providing unified data access orders of magnitudes faster than existing solutions. The Hadoop Distributed File System (HDFS) is a distributed file system for storing large volumes of data. HDFS popularized the paradigm of bringing computation to data and the co-located compute and storage architecture.
Aug 27th, 2016
We are excited to announce a big data storage acceleration solution with Huawei. This solution combines Huawei’s FusionStorage with Alluxio’s memory-speed virtual distributed storage system to dramatically enhance the speed and efficiency of big data analytics for the enterprise.
Aug 24th, 2016
Organizations like Baidu and Barclays have deployed Alluxio with Spark in their architecture, and have achieved impressive benefits and gains. Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In this blog, we investigate how Alluxio can make Spark more effective, and discuss various ways to use Alluxio with Spark. Alluxio helps Spark perform faster, and enables multiple Spark jobs to share the same, memory-speed data.
Aug 19th, 2016
This is an excerpt from the Accelerating On-Demand Data Analytics with Alluxio whitepaper, which includes a detailed implementation guide in addition to this high level overview.
Jun 21st, 2016
Alluxio 1.1 release includes many great features and improvements from the community. Alluxio would not be what it is today without the growing open source community, and we would like to thank everyone involved in this project. With the Alluxio 1.1 release, the community has continued to grow at a rapid pace, to reach over 250 contributors to Alluxio – nearly 3x growth over the last year!
May 30th, 2016
Alluxio, formerly Tachyon, began as a research project at UC Berkeley’s AMPLab in 2012. This year we announced the 1.0 release of Alluxio, the world’s first memory speed virtual distributed storage system, which unifies data access and bridges computation frameworks and underlying storage systems. We have been working closely with the Alluxio community on realizing the vision of Alluxio to become the de facto storage unification layer for big data and other scale out application environments.
Apr 5th, 2016
Alluxio, formerly Tachyon, provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. For example, global financial powerhouse Barclays made the impossible possible by using Alluxio with Spark in their architecture. Technology giant Baidu analyzes petabytes of data and realized 30x performance improvements with a new architecture centered around Alluxio and Spark.
Feb 13th, 2016
Alluxio, formerly Tachyon, began as a research project when I was a Ph.D. student at UC Berkeley’s AMPLab in 2012. At the time, Spark and Mesos were taking off. We saw what Spark and Mesos could do for compute and resource management respectively, while the storage piece of this story was missing.