Parallel Computing
Process Interaction
- Shared Memory (SM): Java Threads, CUDA
Further consider EREW CREW CRCW ERCW!
And ways to deal with contention...
deal with race conditions through locks,
semaphors and monitors.
Note: EREW SM could be considered MP system.
- Message Passing (MP):
Further consider network topology.
Star, grid, hypercube.
- Implicit Interaction: Functional programming, no side effects. F#, scala spark, map reduce
Problem Decomposition
- Task Parallelism: MIMD MISD: Threads, message passing systems
- Data Parallelism: MIMD SIMD Example: CUDA
- Implicit Parallelism: F#, scala spark
- SISD: Classical computer
- SIMD: Program replication
- MIMD: Program partition
- MISD: Fault tolerance
Analyzing Parallel Algorithms
P1,...,PN processors, N = number of processors
n = size of input
Running time: t(n)=number of steps (parallel or sequential) from start
to completion of algorithm
speedup = O(t_sequential(n))/O(t_parallel(n))
speedup = 1 means no advantage behind parallel algorithm
speedup = N means optimal advantage taken of each processor
would like speedup = N, this means that adding N
processors resulted in N times faster computation
speedup > N possible!! Why?
cost = parallel running time * number of processors used
kind of like the amount of work done.
work = distance * force
cost = time * processors
Think of this as how much computational effort
was involved in the algorithm.
want cost to be same as cost of sequential algorithm
efficiency = running time of fastest sequential algorithm/cost of parallel
Models NC
- Big open question: NC = P? Important in the same way P=NP is important.
- Won't discuss this now!! Possibly come back to this later!
Example
Problem: Array a of n integers. determine if x is in a.
- Sequential Algorithm
int [] a = new int[VERY LARGE!!!]
Consider SISD running time.
public void search(int x){
for(i=0;i
- Parallel Algorithm 1
EREW SM SIMD/MP SIMD algorithm
SIMD Message Passing algorithm could easily be shared memory
Deposit message for processor i in message[i]
1. P1 reads x
2. Parallel do
for(int j=0;j
- Parallel Algorithm 2
EREW SM SIMD/MP SIMD algorithm
SIMD Message Passing algorithm could easily be shared memory
Deposit message for processor i in message[i]
Processors P1,...,PN
1. P1 reads x
2. P1 sends x to P2.
3. P1 and P2 send x to P3,P4
4. P1,P2,P3,P4 send x to P5,P6,P7,P8
...
// Question: How many parallel steps?
// check: Each Pi is responsible for n/N of a
check to see if x is in my part of a
// gather result
1. PN/2+1,...,PN sends result back to P1,...,PN/2
2. P1,...,PN/2 computes
result = result || received result
3. repeat of 1 for right Pi
4. repeat of 2 for right Pi
...
// Question: How many repeats of 1. and 2.
k. result is in P1
Running time: t(n)=log_2(N)+n/N+log_2(N) = O(n/N)
speedup = O(sequential algorithm)/O(parallel algorithm)
= O(n)/O(n/N) = N
cost = parallel running time * number of processors used
cost = N * O(n/N) = n
This is optimal.
Apache Spark
Setup
Setup with Docker docker-compose.yml
docker stack deploy -c docker-compose.yml spark
docker container ls # looking for master!
docker exec -it CONTAINER_ID /bin/bash
Setup on your system
- Install apache spark (just download), needs jdk
- Install sbt
- Install scala (optional)
Architecture
- Implicit Interaction, Implicit Parallelism: scala functional language, functional software + spark api
- Goal: Apply lots of compute to lots of data.
Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run.
Extract!!
./bin/spark-shell # Creates spark context
val textFile = sc.textFile("README.md")
textFile.count() // Number of items in this Dataset
textFile.first() // First item in this Dataset
textFile.take(100)
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
# val mean not modifiable
textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
// access to Java API
import java.lang.Math
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
val words = textFile.flatMap(line => line.split(" ")).groupBy(identity)
// pull all parts of the RDD into master
words.collect()
words.count()
// wordCounts.reduce((a,b)=>a+b) // cant do that!!
words.map((x)=>1)
words.map((x)=>1).reduce((a,b)=>a+b)
linesWithSpark.cache()
linesWithSpark.count()
linesWithSpark.count()
- Self contained application
# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.11/simple-project_2.11-1.0.jar
...
Lines with a: 46, Lines with b: 23
RDD
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Operations are either Transformations or Actions
./bin/spark-shell --master local[4]
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
val distFile = sc.textFile("data.txt")
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
lineLengths.persist()
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
counts.collect()
SQL Datasets and DataFrames
- A Dataset is a distributed collection of data.
- A DataFrame is a Dataset organized into named columns.
import org.apache.spark.sql.SparkSession
val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
val df = spark.read.json("examples/src/main/resources/people.json")
df.show()
df.printSchema()
df.select("name").show()
df.select($"name", $"age" + 1).show()
df.filter($"age" > 21).show()
df.groupBy("age").count().show()
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
Playing With Genomic Data
val genome = sc.textFile("/home/arnold/courses/csc492/sequence2")
val c = genome.map(_=>1)
c.reduce(_+_)
val d = genome.map((_,1))
val d = genome.map(x=>x.split(" ")(3))
val e = d.map(x=>(x,1))
e.take(100)
e.reduceByKey((x,y)=>x+y)
val f = e.reduceByKey((x,y)=>x+y)
f.collect()
Using Dataframes, Datasets and Spark SQL
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
import spark.implicits._
val genomeds = spark.read.format("csv").option("sep", " ").option("inferSchema", "true").option("header", "true").load("/home/arnold/courses/csc492/sequence2")
genomeds.printSchema()
val genomeds = spark.read.format("csv").option("sep", " ").option("inferSchema", "true").option("header", "false").load("/home/arnold/courses/csc492/sequence2")
genomeds.printSchema()
genomeds.groupBy("_c3").count().show()
val result = genomeds.groupBy("_c3").count()
result.createOrReplaceTempView("v")
spark.sql("select * from v").show()
References