Overview of Spark and R

In this section we first provide a brief overview of Spark and R, the two main systems that are used to develop SparkR. We then discuss common application patterns used by R programmers for large scale data processing.

Apache Spark Apache Spark is a general purpose engine for large scale data processing. The Spark project first introduced Resilient Distributed Datasets (RDD), an API for fault tolerant computation in a cluster computing environment. More recently a number of higher level APIs have been developed in Spark. These include ML- lib, a library for large scale machine learning, GraphX, a library for processing large graphs and SparkSQL  a SQL API for analytical queries. Since the above libraries are closely integrated with the core API, Spark enables complex workflows where say SQL queries can be used to pre-process data and the results can then be analyzed using advanced machine learning algorithms. SparkSQL also includes Catalyst, a distributed query optimizer that improves performance by generating the optimal physical plan for a given query. More recent efforts [9] have looked at developing a high level distributed DataFrame API for structured data processing. As queries on DataFrames are executed using the SparkSQL query optimizer, DataFrames provide both better usability and performance compared to using RDDs. We next discuss some of the important characteristics of data frames in the context of the R programming language.

Don't use plagiarized sources. Get Your Custom Essay on
Overview of Spark and R
Just from $13/Page
Order Essay

and taste our undisputed quality.