One of the valuable features of the BlueData EPIC platform is IOBoost. IOBoost is an EPIC software component that improves the I/O performance to physical storage devices in order to maximize the performance of the Hadoop jobs running in its virtual clusters. The key design goals of IOBoost were one: to provide the storage performance boost without requiring modification of existing Haoop applications, and two: leverage the data backup/replication/high availability features of the underlying storage device so that users would not need to learn new skills or put in place new processes in order to guarantee the security and durability of their data. I have been investigating the Tachyon Memory Centric Distributed Storage System in order to see how the Tachyon solution complements IOBoost. Tachyon has been showing some truly impressive performance numbers compared to native HDFS. I was looking for a way to provide our customers with a combination of the best features of both Tachyon and IOBoost.
I found that the primary difference between Tachyon and IOBoost is the purpose of the cache itself:
IOBoost is designed to improve the performance of a running Hadoop application. It does this by taking advantage of knowledge about how the application will access its data and changing the behavior of its cache in order to meet these data access needs. Take two trivial examples. If the application is in a map phase it is very likely reading data sequentially. Therefore a read ahead cache would be beneficial. If the application is writing data to HDFS, it is likely to be doing so sequentially and a write behind cache would be beneficial. IOBoost is layered on top of an existing physical file system or other storage device, called a DataTap. Any data written to IOBoost is eventually pushed to the physical storage of the DataTap. Once the data has been pushed to the DataTap storage, no metadata within IOBoost is necessary to read the file contents directly from the native storage device of the DataTap. The IOBoost cache is transparent and not designed to be durable. It follows the durability semantics of the HDFS file system itself where the only time a write to an open file is considered persistent is when the file has been successfully closed or synced to disk. If the IOBoost cache is lost, all synced data can be recovered from the underlying file system.
Tachyon on the other hand is designed to be more than a data cache. It is a true in-memory file system. The Tachyon file system is backed by a persistent storage device, known as an UnderFS, whose functionality is comparable to a DataTap used by IOBoost. However, in addition to caching data in memory, a Tachyon Distributed Storage System persistently stores data in a manner in which that data can be recovered quickly in the event of a hardware failure regardless of the type of UnderFS. Tachyon accomplishes this by storing some additional information about the data. This additional information describes the “lineage” of the data, in other words how more recent versions of the data can be regenerated from previous versions of the data. Tachyon’s use of data lineage is described in numerous other blogs. Storing this additional data, called metadata, means that once a file is written to a Tachyon file system it can be accessed only through the Tachyon file system. The file data can no longer be accessed directly from the UnderFS. If the Tachyon metadata is lost, it may be difficult or impossible to recover the data directly from the UnderFS. In version 0.6.0 of Tachyon there is a way to “import” files on an existing UnderFS file system into Tachyon, but no automatic way to export new or modified files out of Tachyon and back into the UnderFS.
As mentioned above, the first design goal of IOBoost is that it does not become the storage device of record, but rather leaves that responsibility to the DataTap/UnderFS. In order for Hadoop applications to take maximum advantage of the Tachyon Distributed File System, the applications must be modified to use the Tachyon API. This is contrary to the second design goal of IOBoost. The “run-any-application” feature of Hadoop clusters on EPIC, is a big benefit to our customers. I also found that currently there is no C library calls for accessing a Tachyon file system. The Java jar file must be used. In order to get maximum performance, the IOBoost system uses C libraries to communicate with its backing DataTap storage devices.
I recently met with Haoyuan Li(Hy), the architect and lead developer of Tachyon, and discussed these issues with him. We are working with Hy to formalize these projects and get them scheduled into the Tachyon roadmap. Specifically we want to drive the following projects:
– writing a C library for accessing a Tachyon file system similar to the C API for HDFS
– enhance Tachyon to support the periodic or continual export of data back to the UnderFS in order to permit the data to be accessed both through the Tachyon file system and directly from the UnderFS. Albeit that the data will be delivered faster when accessed via Tachyon as opposed to being accessed via the UnderFS.
– lastly, and we know this one is quite a challenge, enhance Tachyon to determine the data lineage transformations without having to modify each Hadoop application to use the Tachyon API
We believe adding these features to Tachyon will increase its already considerable value as a performance enhancing layer of the Hadoop infrastructure. If you agree, please consider joining us
You can watch the BlueData with Tachyon product demonstration here.