CSC409 Scalable Computing
Website
syllabus
Introduction
Basic Tasks
Distributed Systems
Example
Google
Facebook
Yahoo
URL Shortener: https://bitly.com/
Distributed Systems
See Distributed systems for fun and profit
which supplied the outline for this part of this lecture.
Goals of Distributed Systems
Scalability
is the ability of a system, network, or
process, to handle a growing amount of work in a capable manner
or its ability to be enlarged to accommodate that growth.
- Size scalability adding more nodes should make the
system linearly faster; growing the dataset should not increase
latency
- Geographic scalability it should be possible to use
multiple data centers to reduce the time it takes to respond to
user queries, while dealing with cross-data center latency in
some sensible manner.
- Administrative scalability adding more nodes should
not increase the administrative costs of the system (e.g. the
administrators-to-machines ratio).
Performance
is characterized by the amount of useful work
accomplished by a computer system compared to the time and
resources used.
-
Low latency measured as response time for a
given piece of work.
Definition
Latency: The state of being latent; delay, a
period between the initiation of something and the
occurrence.
This is an online concept, stream of requests. Can
measure average or max or min.
percentile may be best for online applications but there
still may be problems with that. See for example,
How NOT to Measure Latency.
Examples

99% have less than 850ms latency at 9:30, what about the
other 1%?
- High throughput (rate of processing work), measured
as Transactions Per Second. This is potentially an
offline/batch concept.
- Low utilization of computing resource(s)
Availability
the proportion of time a system is in a
functioning condition. If a user cannot access the system, it is
said to be unavailable.
- Availability = uptime / (uptime + downtime)
- Availability is a measure of fault tolerance.
Example
Availability % How much downtime is allowed per year?
--------------------------------------------------------------
90% ("one nine") More than a month
99% ("two nines") Less than 4 days
99.9% ("three nines") Less than 9 hours
99.99% ("four nines") Less than an hour
99.999% ("five nines") ~ 5 minutes
99.9999% ("six nines") ~ 31 seconds
Fault tolerance
ability of a system to behave in a
well-defined manner once faults occur
Improving Performance
Performance
Improvement Words of Wisdom
Simple platform optimization (no dev $ required!!):
- Better faster CPU
- CPU with more cache
- CPU with more cores (let OS take care of
parallelizing)
- Faster, more RAM
- Faster, larger disk, SSD but be aware ...
Network is the new storage bottleneck
- Faster, larger network
Algorithm optimization: O(large) -> O(small)
Datastructure optimization
- http://bigocheatsheet.com/
- Lists to arrays/trees
- Arrays to dictionaries
Programming language optimization
Performance Tuning: C/Java
- Don't optimize/tune early.
- Profile first, then tune.
Example
Some different algorithms for determining Primality are in
Primes.java
javac Primes.java
java -Xverify:none Primes
jvisualvm
# Now connect to the running instance of Primes in jvisualvm
- Local is better
- Less I/O is better
- Loop Jamming
- Loop unrolling
- Inline small function calls
- Use processor efficient datatypes: int instead of float,
unsigned int instead of int.
- Compare to 0 instead of i<m (count down)
Multithreadding:
Speeding up Disk I/O
Monitoring Overall System Performance
DB/Persistant Store Optimization:
- The best I/O is no I/O
- Use RDBMS
- Change Schema
- Compress data
- Partition data based on use case
- Improve queries, use indexes, optimize schema
- Mirror database
- Change type of database
Time/Space tradeoffs:
- Spend some extra space to save some time
- Dynamic Programming vs Recursion
Distributed Systems
Add more...
- Add more cpu: GPU
- Add more hosts
- Add connectivity
- Add more storage
- But...
Diminishing Returns

Expensive Hardware Doesn't Pay
Replication vs Partitioning
Replication and Partitioning

Example: Cassandra
Architecture

- Many nodes cooperate to store data.
- SCALABLE STORAGE in terms of size (=total data stored),
latency and throughput.
- Client connects to any node in the cluster. The connected node is the
coordinator for requests.
- Coordinator hashes keys to find main node responsible for data
as well as knows replicant nodes.
- Load balancing done without any central server, since every node can act as coordinator and has the same hash function
as well as knowledge of the replication strategy.
- Replicated data results in FAULT TOLERANCE+AVAILABILITY
- Sacrifice Consistency for eventual,
tunable consistency tradeoff with Availability.
READ CONSISTENCY LEVELS
ALL Returns the most recent data from all replicas.
QUORUM Returns the most recent data from the majority of replicas.
TWO Returns data from two near replicas.
ONE Returns data from the nearest replica.
WRITE CONSISTENCY LEVELS
WRITE CONSISTENCY LEVELS
ALL Confirmed write on all replicas.
QUORUM Confirmed write on majority of replicas.
TWO Confirmed write on two replicas.
ONE Confirmed write on one replica.
See also A Primer on Database Replication