CSC409 Scalable Computing

Introduction

Basic Tasks

Storage
Computation

Distributed Systems

Example

Google
Facebook
Yahoo
URL Shortener: https://bitly.com/

Distributed Systems

See Distributed systems for fun and profit which supplied the outline for this part of this lecture.

Goals of Distributed Systems

Scalability

is the ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.

Size scalability adding more nodes should make the system linearly faster; growing the dataset should not increase latency
Geographic scalability it should be possible to use multiple data centers to reduce the time it takes to respond to user queries, while dealing with cross-data center latency in some sensible manner.
Administrative scalability adding more nodes should not increase the administrative costs of the system (e.g. the administrators-to-machines ratio).

Performance

is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.

Low latency measured as response time for a given piece of work.

Definition
Latency: The state of being latent; delay, a period between the initiation of something and the occurrence.
Examples
This is an online concept, stream of requests. Can measure average or max or min. percentile may be best for online applications but there still may be problems with that. See for example, How NOT to Measure Latency.

Examples

99% have less than 850ms latency at 9:30, what about the other 1%?
High throughput (rate of processing work), measured as Transactions Per Second. This is potentially an offline/batch concept.
Low utilization of computing resource(s)

Availability

the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable.

Availability = uptime / (uptime + downtime)
Availability is a measure of fault tolerance.

Example

Availability %          How much downtime is allowed per year?
--------------------------------------------------------------
90% ("one nine")        More than a month
99% ("two nines")       Less than 4 days
99.9% ("three nines")   Less than 9 hours
99.99% ("four nines")   Less than an hour
99.999% ("five nines")  ~ 5 minutes
99.9999% ("six nines")  ~ 31 seconds

Fault tolerance

ability of a system to behave in a well-defined manner once faults occur

Improving Performance

Performance Improvement Words of Wisdom

Simple platform optimization (no dev $ required!!):

Better faster CPU
CPU with more cache
CPU with more cores (let OS take care of parallelizing)
Faster, more RAM
Faster, larger disk, SSD but be aware ... Network is the new storage bottleneck
Faster, larger network

Algorithm optimization: O(large) -> O(small)

Datastructure optimization

http://bigocheatsheet.com/
Lists to arrays/trees
Arrays to dictionaries

Programming language optimization

Python -> Java -> C
programming language benchmarks

Performance Tuning: C/Java

Don't optimize/tune early.
Profile first, then tune.
- C profilers
- Java Profilers: JProfiler, jvisualvm . jvisualvm is part of the JDK, so you proably already have it installed.
Example
Some different algorithms for determining Primality are in Primes.java
Local is better
Less I/O is better
Loop Jamming
Loop unrolling
Inline small function calls
Use processor efficient datatypes: int instead of float, unsigned int instead of int.
Compare to 0 instead of i<m (count down)

Example

perf.c
Perf.java

References

Optimizing C (lots of this applies to Java as well): 1 2 3 4 5
Java Performance Tuning

Multithreadding:

Speeding up Disk I/O

Monitoring Overall System Performance

DB/Persistant Store Optimization:

The best I/O is no I/O
Use RDBMS
Change Schema
Compress data
Partition data based on use case
Improve queries, use indexes, optimize schema
Mirror database
Change type of database

Time/Space tradeoffs:

Spend some extra space to save some time
Dynamic Programming vs Recursion

Example
subsetsum.py

Distributed Systems

Add more...

Add more cpu: GPU
Add more hosts
Add connectivity
Add more storage
But...

Diminishing Returns

Expensive Hardware Doesn't Pay

Replication vs Partitioning

Replication and Partitioning

Example: Cassandra

Architecture

Many nodes cooperate to store data.
SCALABLE STORAGE in terms of size (=total data stored), latency and throughput.
Client connects to any node in the cluster. The connected node is the coordinator for requests.
Coordinator hashes keys to find main node responsible for data as well as knows replicant nodes.
Load balancing done without any central server, since every node can act as coordinator and has the same hash function as well as knowledge of the replication strategy.
Replicated data results in FAULT TOLERANCE+AVAILABILITY
Sacrifice Consistency for eventual, tunable consistency tradeoff with Availability.

READ CONSISTENCY LEVELS

ALL Returns the most recent data from all replicas.
QUORUM Returns the most recent data from the majority of replicas.
TWO Returns data from two near replicas.
ONE Returns data from the nearest replica.

WRITE CONSISTENCY LEVELS

WRITE CONSISTENCY LEVELS
ALL Confirmed write on all replicas.
QUORUM Confirmed write on majority of replicas.
TWO Confirmed write on two replicas.
ONE Confirmed write on one replica.

See also A Primer on Database Replication