Efficient Data Sharing

One of the main overheads when executing UDFs in SparkR is

the time spent serializing input for the UDF from the JVM and then deserialzing it in R. This process is also repeated for the data output from the UDF and thus adds significant overhead to the execution time. Recent memory management improvements have intro- duced support for off heap storage in Spark and we plan to investigate techniques to use off heap storage for sharing data efficiently between the JVM and R. One of the key challenges here is to develop a storage format that can be parsed easily in both languages. In addition to the serialization benefits, off heap data sharing can help us lower the memory overhead by reducing the number of data copies required.

Don't use plagiarized sources. Get Your Custom Essay on
 Efficient Data Sharing
Just from $13/Page
Order Essay

RELATED WORK A number of academic and commer, projects have looked at integrating R with Apache Hadoop. SparkR follows a similar approach but inherits the functionality and performance benefits of using Spark as the execution engine. The high level DataFrame API in SparkR is inspired by data frames in R, dplyr and pandas. Further, SparkR’s data sources integration is similar to pluggable backends supported by dplyr. Un- like other data frame implementations, SparkR uses lazy evalua- tion and Spark’s relational optimizer to improve performance for distributed computations. Finally, a number of projects like Dis- tributedR, SciDB, SystemML have looked at scaling array or matrix-based computations in R. In SparkR, we propose a high-level DataFrame API for structured data processing and integrate this with a distributed machine learning library to provide support for advanced analytics.

and commercial projects have looked at integrating R with Apache Hadoop. SparkR follows a similar approach but inherits the functionality and performance benefits of using Spark as the execution engine. The high level DataFrame API in SparkR is inspired by data frames in R, dplyr  and pandas. Further, SparkR’s data sources inte- gration is similar to pluggable backends supported by dplyr. Unlike other data frame implementations, SparkR uses lazy evalua- tion and Spark’s relational optimizer to improve performance for distributed computations. Finally, a number of projects like DistributedR, SciDB, SystemML  have looked at scaling array or matrix-based computations in R. In SparkR, we propose a high-level DataFrame API for structured data processing and integrate this with a distributed machine learning library to provide support for advanced analytics.

ORDER NOW »»

and taste our undisputed quality.