Parallel Computing

Models (Interaction/Problem Decomposition)

Process Interaction

Shared Memory (SM): Java Threads, CUDA
Message Passing (MP):
Implicit Interaction: Functional programming, no side effects. F#, scala spark, map reduce

Problem Decomposition

Task Parallelism: MIMD MISD: Threads, message passing systems
Data Parallelism: MIMD SIMD Example: CUDA
Implicit Parallelism: F#, scala spark

Models (Flynn's Taxonomy)

SISD: Classical computer
SIMD: Program replication
MIMD: Program partition
MISD: Fault tolerance

Analyzing Parallel Algorithms

P1,...,PN processors, N = number of processors n = size of input Running time: t(n)=number of steps (parallel or sequential) from start to completion of algorithm speedup = O(t_sequential(n))/O(t_parallel(n)) speedup = 1 means no advantage behind parallel algorithm speedup = N means optimal advantage taken of each processor would like speedup = N, this means that adding N processors resulted in N times faster computation speedup > N possible!! Why? cost = parallel running time * number of processors used kind of like the amount of work done. work = distance * force cost = time * processors Think of this as how much computational effort was involved in the algorithm. want cost to be same as cost of sequential algorithm efficiency = running time of fastest sequential algorithm/cost of parallel

Models NC

Big open question: NC = P? Important in the same way P=NP is important.
Won't discuss this now!! Possibly come back to this later!

Example

Problem: Array a of n integers. determine if x is in a.

Sequential Algorithm
Parallel Algorithm 1
Parallel Algorithm 2 EREW SM SIMD/MP SIMD algorithm SIMD Message Passing algorithm could easily be shared memory Deposit message for processor i in message[i] Processors P1,...,PN 1. P1 reads x 2. P1 sends x to P2. 3. P1 and P2 send x to P3,P4 4. P1,P2,P3,P4 send x to P5,P6,P7,P8 ... // Question: How many parallel steps? // check: Each Pi is responsible for n/N of a check to see if x is in my part of a // gather result 1. PN/2+1,...,PN sends result back to P1,...,PN/2 2. P1,...,PN/2 computes result = result || received result 3. repeat of 1 for right Pi 4. repeat of 2 for right Pi ... // Question: How many repeats of 1. and 2. k. result is in P1 Running time: t(n)=log_2(N)+n/N+log_2(N) = O(n/N) speedup = O(sequential algorithm)/O(parallel algorithm) = O(n)/O(n/N) = N cost = parallel running time * number of processors used cost = N * O(n/N) = n This is optimal.
Apache Spark

Setup
Setup with Docker docker-compose.yml docker stack deploy -c docker-compose.yml spark docker container ls # looking for master! docker exec -it CONTAINER_ID /bin/bash Setup on your system
1. Install apache spark (just download), needs jdk
2. Install sbt
3. Install scala (optional)
Architecture
- Implicit Interaction, Implicit Parallelism: scala functional language, functional software + spark api
- Goal: Apply lots of compute to lots of data.
Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run.
Extract!!
./bin/spark-shell # Creates spark context val textFile = sc.textFile("README.md") textFile.count() // Number of items in this Dataset textFile.first() // First item in this Dataset textFile.take(100) val linesWithSpark = textFile.filter(line => line.contains("Spark")) # val mean not modifiable textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) // access to Java API import java.lang.Math textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) val words = textFile.flatMap(line => line.split(" ")).groupBy(identity) // pull all parts of the RDD into master words.collect() words.count() // wordCounts.reduce((a,b)=>a+b) // cant do that!! words.map((x)=>1) words.map((x)=>1).reduce((a,b)=>a+b) linesWithSpark.cache() linesWithSpark.count() linesWithSpark.count()
Self contained application # Your directory layout should look like this $ find . . ./build.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala # Package a jar containing your application $ sbt package ... [info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar # Use spark-submit to run your application $ YOUR_SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/scala-2.11/simple-project_2.11-1.0.jar ... Lines with a: 46, Lines with b: 23
RDD
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Operations are either Transformations or Actions ./bin/spark-shell --master local[4] import org.apache.spark.SparkContext import org.apache.spark.SparkConf val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) val distFile = sc.textFile("data.txt") val lines = sc.textFile("data.txt") val lineLengths = lines.map(s => s.length) val totalLength = lineLengths.reduce((a, b) => a + b) lineLengths.persist() val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b) counts.collect()
SQL Datasets and DataFrames
- A Dataset is a distributed collection of data.
- A DataFrame is a Dataset organized into named columns.
import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._ val df = spark.read.json("examples/src/main/resources/people.json") df.show() df.printSchema() df.select("name").show() df.select($"name", $"age" + 1).show() df.filter($"age" > 21).show() df.groupBy("age").count().show() df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show()
Playing With Genomic Data
val genome = sc.textFile("/home/arnold/courses/csc492/sequence2") val c = genome.map(_=>1) c.reduce(_+_) val d = genome.map((_,1)) val d = genome.map(x=>x.split(" ")(3)) val e = d.map(x=>(x,1)) e.take(100) e.reduceByKey((x,y)=>x+y) val f = e.reduceByKey((x,y)=>x+y) f.collect() Using Dataframes, Datasets and Spark SQL import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate() import spark.implicits._ val genomeds = spark.read.format("csv").option("sep", " ").option("inferSchema", "true").option("header", "true").load("/home/arnold/courses/csc492/sequence2") genomeds.printSchema() val genomeds = spark.read.format("csv").option("sep", " ").option("inferSchema", "true").option("header", "false").load("/home/arnold/courses/csc492/sequence2") genomeds.printSchema() genomeds.groupBy("_c3").count().show() val result = genomeds.groupBy("_c3").count() result.createOrReplaceTempView("v") spark.sql("select * from v").show()
References

Parallel Computing

Models (Interaction/Problem Decomposition)

Models (Flynn's Taxonomy)

Analyzing Parallel Algorithms

Models NC

Example

Apache Spark

Setup

Architecture

Extract!!

RDD

SQL Datasets and DataFrames

Playing With Genomic Data

References