Spawning R Workers

The second part of SparkR’s design consists of support to launch

R processes on Spark executor machines. Our initial approach here was to fork an R process each time we need to run an R function. This is expensive because there are fixed overheads in launching the R process and in transferring the necessary inputs such as the Spark broadcast variables, input data, etc. We made two optimizations which reduce this overhead significantly. First, we implemented

Don't use plagiarized sources. Get Your Custom Essay on
Spawning R Workers
Just from $13/Page
Order Essay

1 # Query 1 2 # Top-5 destinations for flights departing from JFK. 3 jfk_flights <- filter(flights, flights$Origin == “JFK”) 4 head(agg(group_by(jfk_flights, jfk_flights$Dest), 5 count = n(jfk_flights$Dest)), 5L) 6

7 # Query 2 8 # Calculate the average delay across all flights. 9 collect(summarize(flights,

10 avg\_delay = mean(flights$DepDelay))) 11

12 # Query 3 13 # Count the number of distinct flight numbers. 14 count(distinct(select(flights, flights$TailNum)))

Queries used for evaluation with the flights dataset

support for coalescing R operations which lets us combine a number of R functions that need to be run. This is similar to operator pipelining used in database execution engines. Second, we added support for having a daemon R process that lives throughout the lifetime of a Spark job and manages the worker R processes using the mcfork feature in parallel package. These optimizations both reduce the fixed overhead and the number of times we invoke an R process and help lower the end-to-end latency.


and taste our undisputed quality.