CS4HS Workshop: Algorithms and Data Structures
July 13–14, 2011
Department of Computer Science
University of Toronto
François Pitt <francois.pitt@utoronto.ca>
What to expect
Goal
- Review some useful algorithm design techniques and data structures.
Topics
- Motivated by programming contest questions.
- Selected from standard techniques and data structures.
- Open to adjustments based on interest and prior knowledge.
- Covered through general discussion and working on problems — on paper.
General references
- Wikipedia
is a good place to get started.
- Introduction to Algorithms
(2nd or 3rd edition) by
Cormen, Leiserson, Rivest & Stein
is known as "the bible" for algorithms and data structures!
However, beware: it is an awesome reference, but not
a good book for novice programmers to pick up and learn from.
Contact me!
- Please don't hesitate to contact me (see my email address above)
if you have any questions about algorithms or data structures.
I would be very happy to help you out,
whether it is with technical details or
to discuss issues of pedagogy for this material.
- At the same time, I would also love to hear from you
if you have any interesting experiences or ideas to share
regarding the material or its pedagogy!
Outline
Algorithms
Data Structures
Brute Force Algorithms
Discussion
What is it?
- Conceptually, "brute force" = "try every possibility".
Advantage(s)?
- Obvious correctness of solution.
Disadvantage(s)?
- It can be tricky to generate every possibility exactly once.
- Often, it is inefficient — more on this later.
Try it!
- Try the
"Stacks of Blocks"
problem from the DWITE programming contest.
- Ignore issues of input/output for now:
suppose you already have the data stored
in an appropriate data structure of your choice.
Regroup
High-level solution (in Python)?
H, block_list = process_input("DATA4.txt")
subset_list = generate_subsets(block_list)
minimum = 0
for subset in subset_list:
if sum(subset) == H and (minimum == 0 or len(subset) < minimum):
minimum = len(subset)
generate_output(minimum, "OUT4.txt")
Difficulty?
- The only complicated part is how to generate every possible subset of
blocks.
Solution (in Python)?
def generate_subsets(block_list):
if block_list == []:
return [[]]
else:
# Generate the list of subsets for every block except the last.
first_half = generate_subsets(block_list[:-1])
# Make a copy of every subset in the first half, but with the
# last block added to it.
second_half = []
for subset in first_half:
second_half.append(subset + [block_list[-1]])
return first_half + second_half
Further examples
- "Floor Plan"
problem from the DWITE programming contest.
[Back to outline]
Graph Traversal Algorithms
Discussion
What is this about?
- Graphs G = (V,E):
- Vertex (or node) set V =
{v1,...,vn}.
- Edge (or arc) set E =
{e1,...,em}, where
ej = (u,v)
for u,v ∈ V.
- Directed graph:
(u,v) ≠ (v,u),
self-loops (v,v) allowed, e.g.,

- Undirected graph:
(u,v) = (v,u),
self-loops disallowed, e.g.,

- Terminology: u is neighbour of v
if (v,u) ∈ E — note: different from
(u,v) ∈ E.
- Traversal algorithm:
systematic way to "visit" each vertex in G.
Two standard traversals:
depth-first search (DFS),
breadth-first search (BFS).
Depth-first search
- Given graph G = (V,E)
and first node to visit.
- Initialization:
- Mark each node "undiscovered".
- Push first node onto empty stack.
- While stack not empty:
- Remove top node v from stack.
- If v is "discovered", go back to start of loop.
- Mark v "discovered".
- Visit v (application-specific).
- Push all undiscovered neighbours of v onto stack.
Breadth-first search
- Given graph G = (V,E)
and first node to visit.
- Initialization:
- Mark each node "undiscovered".
- Append first node to empty queue.
- While queue not empty:
- Remove front node v from queue.
- If v is "discovered", go back to start of loop.
- Mark v "discovered".
- Visit v (application-specific).
- Append all undiscovered neighbours of v to queue.
Try it!
- Try the
"Portals Redux"
problem from the DWITE programming contest.
- Ignore issues of input/output for now:
suppose you already have the data stored
in an appropriate data structure of your choice.
Regroup
High-level solution?
- Read input into a 2D array.
- Scan input to generate graph representation of connections:
- For each number,
"flood-fill" to find all portal entrances
and store in a list.
- For each portal exit,
"flood-fill" to find all numbers and portal entrances
and store in a list.
Result is adjacency list representation of graph:
list of neighbours for every node.
Adjacency lists for sample input:
- 1: [a, e]
- 2: [f]
- 3: [c]
- 4: []
- 5: [b]
- a: [2, f]
- b: [3, c]
- c: [4]
- d: [d]
- e: [3, c]
- f: [1, a, e]
- Perform BFS/DFS on graph to generate
lists of reachable points of interest — but modified so that points of interest cannot be used
as intermediary nodes.
- Print sorted lists, in order.
What's "flood-fill"?
- Recursive algorithm to find all points reachable from a position.
- Initialize a separate array
visited
with the same dimensions as the floor plan,
with values False everywhere.
Then, call flood_fill(position).
flood_fill(start):
visited[start] = True
if left(start) is not a wall and is not visited:
flood_fill(left(start))
if up(start) is not a wall and is not visited:
flood_fill(up(start))
if right(start) is not a wall and is not visited:
flood_fill(right(start))
if down(start) is not a wall and is not visited:
flood_fill(down(start))
(Really, this is just a DFS.)
Try it!
- Try Problem J5 (Knight Hop) from the
2010
CCC programming contest.
- Ignore issues of input/output for now:
suppose you already have the data stored
in an appropriate data structure of your choice.
Regroup
High-level solution?
- Perform BFS — not DFS — on the implicit graph defined by knight moves.
What's an "implicit graph"?
- A graph whose edges are not stored explicitly, but rather
computed from other information.
Why BFS and not DFS?
- Both are guaranteed to find all reachable nodes,
but BFS guarantees to do this using the smallest possible
number of edges — as required by the problem.
[Back to outline]
Dynamic Programming
Discussion
What is it?
- A general method for optimization problems whose solution exhibits
the following properties.
- Recursion:
- the problem can be solved by breaking it up into subproblems,
solving those subproblems recursively, and combining the
sub-solutions back into an overall solution.
- Optimal substructure:
- an optimal solution to the problem can always be constructed from
optimal solutions to subproblems, i.e.,
it is never necessary to solve some subproblems sub-optimally
in order to achieve an overall optimal solution.
- Simple subproblems:
- subproblems can be characterized precisely
using a fixed number of "indices".
- Overlapping subproblems:
- subproblems recur multiple times as part of
optimal solutions of the top-level problem.
Classic example: Fibonacci numbers
- Given n, compute Fn, where
F0 = 0, F1 = 1, and
for n ≥ 2, Fn =
Fn-1 +
Fn-2.
- Not an optimization problem, but it does exhibit
- Recursion:
- by definition!
- Optimal substructure:
- (not applicable).
- Simple subproblems:
- each subproblem is specified by the single value n.
- Overlapping subproblems:
- Fn depends on
Fn-1 and
Fn-2, which depend on
Fn-2 and
Fn-3 (twice) and
Fn-4, etc.
Why not just use recursion?
def fib(n):
if n == 0: return 0
if n == 1: return 1
return fib(n-1) + fib(n-2)
Problem!
fib(5):
fib(4):
fib(3):
fib(2):
fib(1)
fib(0)
fib(1)
fib(2):
fib(1)
fib(0)
fib(3):
fib(2):
fib(1)
fib(0)
fib(1)
- Notice how some calls are evaluated multiple times,
because of overlapping subproblems.
In fact, it can be shown that fib(1) will be called
an exponential number of times
(as a function of n) — this is called combinatorial explosion.
Solution
- Idea: store "solutions" to subproblems in a table,
so they can be looked up instead of recomputed.
- Table layout depends on problem, but will typically be an array.
- Details of what is stored for each solution matter:
don't want to run out of memory,
so usually don't store complete solution.
- Values are typically computed in a "bottom-up" fashion:
fill in solutions for small subproblems first,
and work our way up to the overall problem.
- In the case of Fibonacci numbers, we get the following algorithm.
def fib2(n):
L = [0, 1]
for i in range(n): # for i = 0,1,...,n-1
L.append(L[-2] + L[-1])
return L[n]
Try it!
- Try the
"Stacks of Blocks"
problem again.
- Ignore issues of input/output for now:
suppose you already have the data stored
in an appropriate data structure of your choice.
Regroup
High-level solution?
- Is it recursive?
- Yes: solution either does not use last block
(fewer blocks, same target height) or it uses last block
(fewer blocks, reduced target height).
- Does it have optimal substructure?
- Yes:
if optimal solution does not use last block,
then it is optimal for rest of blocks;
if optimal solution uses last block,
then it is optimal for rest of blocks
and reduced height.
- Are subproblems "simple"?
- Yes:
each subproblem characterized by
target height (range [0..H]) and
index of last block available (range [0..S]).
- Do subproblems overlap?
- Yes:
small subproblems (e.g., with S = 1)
must be evaluated multiple times as part of
solving larger problems.
Step 1: define a table of solutions.
- In general, table indexed by subproblem.
In this case,
use indices
h ∈ [0..H] and
s ∈ [0..S]
(so table is 2D-array with
H+1 rows and S+1 columns).
- In general, array values = quantity to optimize.
In this case,
table[h,s] =
minimum number of blocks that stack to exactly h,
using only blocks 1..s —
table[h,s] = 0
if h cannot be reached exactly
using only blocks 1..s.
Step 2: write recursive equations for table values.
- This step is the most important — and
difficult!
- Start with base cases — values
that can be computed directly.
In this case,
- table[0,s] = 0
for s = 0,...,S — no block required to reach height 0;
- table[h,0] = S+1
for h = 1,...,H — height h unreachable without any block.
- Next, write a "recurrence" (i.e., recursive equation)
for the general case.
In this case,
for h = 1,...,H
and s = 1,...,S:
- let bs = height(block s);
- if bs > h,
let table[h,s] =
table[h,s−1] — block s cannot be used;
- else,
table[h,s] =
min(table[h,s−1],
table[h−bs,s−1] + 1) — either don't use block s, or use it
and do the best possible with the remaining blocks.
Step 3: write the algorithm.
- This is comparatively simple:
use nested loops to compute the table values
according to the recursive equations from the previous step.
- Make sure that entries are filled in so that
by the time an entry is computed,
the other entries it depends on have already been filled in.
In this case,
simply use an outer loop over s with
an inner loop over h.
Step 4: reconstruct the solution.
- By the time the algorithm is done,
what information do we have?
The entry table[H,S] tells us
the smallest number of blocks that stack up to height H
(or is greater than S if this is not possible).
- This is exactly the answer required for the problem!
But in general, we may want to know what blocks to use
to actually reach height H.
- This information is not stored anywhere directly.
But it can easily be reconstructed, because of
the way the entries in the table are computed:
- if table[H,S] =
table[H,S−1],
then block S was not used;
- else, block S was used.
- So the solution can be reconstructed with a simple loop like this:
blocks = []
h = H
for s = S,S-1,...,1:
if table[h,s] != table[h,s-1]:
blocks.append(s)
h = h - height of block s
Try it!
- Try the
"Playlist"
problem from the DWITE programming contest.
- Ignore issues of input/output for now:
suppose you already have the data stored
in an appropriate data structure of your choice.
Regroup
High-level solution?
- Is it recursive?
- Yes: solution either does not use last song or it does
(in which case capacity is reduced accordingly).
- Does it have optimal substructure?
- Yes:
if optimal solution does not use last song,
then it is optimal (highest total rating) for rest of songs
and original capacity;
if optimal solution uses last song,
then it is optimal for rest of songs
and reduced capacity.
- Are subproblems "simple"?
- Yes:
each subproblem characterized by
capacity (range [0..C]) and
index of last song considered (range [0..N]).
- Do subproblems overlap?
- Yes:
small subproblems (e.g., with N = 1)
must be evaluated multiple times as part of
solving larger problems.
Step 1: define a table of solutions.
- For indices
c ∈ [0..C] and
n ∈ [0..N],
let table[c,n] =
maximum total rating of songs selected from 1,...,n
that fit within capacity c.
Step 2: write recursive equations for table values.
- Base cases:
- table[0,n] = 0
for n = 0,...,N — no song can fit within capacity 0;
- table[c,0] = 0
for c = 0,...,C — no song to pick.
- General case:
for c = 1,...,C
and n = 1,...,N:
- let sn = size(song n)
and rn = rating(song n);
- if sn > c,
table[c,n] =
table[c,n−1] — song n cannot be used;
- else,
table[c,n] =
max(table[c,n−1],
table[c−sn,n−1] + rn) — either don't use song n,
or use it and do best possible with remaining capacity.
Step 3: write the algorithm.
- Simple nested loops (outer loop on n)
to compute table values according to the recurrence.
Step 4: reconstruct the solution.
- Not necessary for problem,
but would be done just like for previous problem.
Further examples
- Making change problem (see definition below).
[Back to outline]
Greedy Algorithms
Discussion
What is it?
- Construct solution piece-by-piece.
- At each step, make best "local" decision — ignore trade-offs.
Example: making change
- Input:
-
Coin denominations
1 = d1 < d2 <
... < dN;
amount A.
- Output:
-
Exact change for A, using as few total coins as possible.
Algorithm
- Give as many coins of denomination
dN as possible,
then as many coins of denomination
dN−1 as possible,
...
Problem
- Works fine for "standard" denominations
(d1 = 1¢,
d2 = 5¢,
d3 = 10¢,
d4 = 25¢).
But does not achieve minimum number of coins for other denominations
(e.g., 1¢, 10¢, 25¢).
Solution?
- Dynamic programming will work, based on the following recursion:
- either optimal solution does not use any coin with value
dN,
- or optimal solution uses one coin with value
dN, along with
a minimum number of coins for amount
A − dN.
How do we know when greedy will work?
- Required property:
for every denomination dn and
for every amount a with
dn ≤ a
< dn+1,
one of the best ways of making change for a uses
at least one coin of denomination
dn.
- This can be checked, on a tedious case-by-case basis,
for the standard denominations
(1¢, 5¢, 10¢, 25¢).
Try it!
Regroup
High-level solution
- Translate input into a graph representation.
Careful! We want a graph where each pen is a "node",
the outside is also one "node", and there are "edges"
between any two pens that share an edge.
- Pick edges with smallest cost, until every pen is connected — skip edges between pens that are
already connected.
Minimum Spanning Trees
Definition
- Given graph with edge costs (see example graph below),
find subset of edges that connects all vertices
and whose total cost is minimum.
Kruskal's algorithm
- Initialization:
- Sort the edges by non-decreasing cost.
- Initialize an empty tree (subset of edges).
- Repeat n−1 times
(where n is the number of vertices):
- Select an edge with smallest cost
whose endpoints are not already connected,
and add it to the partial tree.
Prim's algorithm
- Initialization:
- Pick a start vertex s.
The entire graph is partitioned into two sides:
vertices connected to s, and the rest.
- Initialize an empty tree (subset of edges).
- Repeat n−1 times:
- Select a minimum-cost edge between some vertex connected to
s and some unconnected vertex, and add it to the
tree.
Runtime
- O(n m), where
m is the number of edges in the graph.
- Can be improved to O(n log n)
through the use of appropriate data structures.
Argument of correctness
- Why do we need this?
Think about making change problem,
or minimum-cost path problem for graph above,
starting at node a...
- Hinges on property that the choice made by greedy at each step
is guaranteed to lead to an optimal solution.
- Formally, argued through an "exchange argument".
[Back to outline]
Divide-and-Conquer Algorithms
Discussion
What is it?
- Recursive algorithm where solution to overall problem is obtained
by splitting it up into subproblems,
solving each subproblem recursively,
and combining sub-solutions back into overall solution.
Example: Mergesort
The Master Theorem
- If T(n) is the worst-case running time
of a divide-and-conquer algorithm, it usually satisfies a
"recurrence relation" (recursive equation) of the form:
T(n) = O(1)
if n ≤ K
T(n) =
a T(n/b)
+ O(nd)
otherwise
where a, b, d, K
are constants determined by the algorithm.
- Then,
T(n) =
O(nd)
if d >
logb a,
T(n) = O(nd log n)
if d =
logb a,
T(n) = O(nlogb a)
if d <
logb a.
Try it!
- Consider problem of multiplying two large integers,
given as sequences of bits:
X =
xn−1 xn−2 ... x1 x0
Y =
yn−1 yn−2 ... y1 y0
- Standard iterative algorithm involves repeated additions
of appropriately shifted copies of X,
according to bits of Y.
This takes time O(n2).
- Idea:
let X1 =
xn−1 ... xn/2,
X0 =
xn/2−1 ... x0,
and define Y1 and Y0
similarly.
(If n is odd, pad X and Y with
an extra 0 on the left before doing this.)
- Now we have
X =
2n/2 X1
+ X0
Y =
2n/2 Y1
+ Y0
where multiplication by a power of 2 indicates shifting of bits.
- Based on this,
X Y = 2n X1 Y1
+ 2n/2 (X0 Y1
+ X1 Y0)
+ X0 Y0
- Recursive algorithm:
multiply(X, Y):
if n == 1:
return x * y # multiplication of 1-bit integers
else:
set lists X1, X0, Y1, Y0 as above
p1 = multiply(X1, Y1)
p2 = multiply(X1, Y0)
p3 = multiply(X0, Y1)
p4 = multiply(X0, Y0)
return 2**n * p1 + 2**(n/2) * (p2 + p3) + p4
- Runtime?
Satisfies recurrence T(n) =
4 T(n/2) +
O(n).
So by Master Theorem, T(n) =
O(n2) — no better than iterative algorithm!
Regroup
Trick
- Consider product
(X1 + X0) (Y1 + Y0).
This computes p1 + p2 + p3 + p4
(using variables from the program above)
with only one recursive multiplication.
Resulting algorithm and runtime
- Algorithm:
multiply(X, Y):
if n == 1:
return x * y # multiplication of 1-bit integers
else:
set lists X1, X0, Y1, Y0 as above
p1 = multiply(X1, Y1)
p2 = multiply(X0+X1, Y0+Y1)
p3 = multiply(X0, Y0)
return 2**n * p1 + 2**(n/2) * (p2 - p1 - p3) + p3
- Runtime?
Satisfies recurrence T(n) =
3 T(n/2) +
O(n).
So by Master Theorem, T(n) =
O(nlog2 3).
- This is much better than the previous algorithm, because
log2 3 = 1.58...
How did this happen?
- Saving one recursive call propagates through all of the sub-calls
and results in substantial savings,
as long as this does not increase the time taken
outside recursive calls by more than a constant factor.
[Back to outline]
Hash Tables
Discussion
Direct-address table
- Problem: count frequencies of characters in a text file?
- Easy solution: use an array with 256 positions
(one for each possible character, assuming 8-bit ASCII)
and increment counts as each character is read.
- Feature: each character provides a direct address into the array.
Problem
- Consider analogous situation:
count frequencies of integer values in a data file?
- Direct-address table not practical anymore:
232 = 4,294,967,296 possible integer values
would require an array of that size — even if data file only contains a few hundred values!
Solution
- Use smaller array (hash table)
and hash function that maps each integer value
to a valid array index.
Hash functions
- In general, two-step:
arbitrary data ⇒ number ⇒ table index.
- Last step usually a simple scaling of the number
into the range [0..M−1]
(for a hash table with M slots).
- Term comes from cooking, where "hash" is all cut up and mixed up — a good hash function tries to "mix"
the bits in the representation of its input so that (ideally)
changing even just one bit of the input makes roughly half
the bits in the hash value different
(this property is called "avalanching").
Collisions
- No matter how good the hash function,
collisions (different inputs hashing to same slot)
are unavoidable.
- Closed addressing uses a "probing sequence" to look for
another slot that can store the value.
- Open hashing stores a simple linked list of values
at each slot.
Complexity
- Theoretically, can always degenerate
if every input happens to hash to the same slot.
- In practice, this never happens.
As long as load factor (fraction of table filled)
is kept low and hash function is reasonably good,
no more than a constant number of collisions is ever encountered
while searching for a value.
[Back to outline]
Heaps and Priority Queues
Discussion
Motivation
- A priority queue is a queue
(a first-in-first-out (FIFO) data structure)
where each element has a "priority" (a number)
and elements with smaller priority values move ahead of
elements with larger priority values — elements with the same priority are handled in FIFO order.
- A standard queue can be implemented simply and efficiently using
an array or a linked list.
But this does not work well for priority queues because of
the need to move elements around based on their priority.
Definition and properties
- A heap is a binary tree that satisfies two properties:
- The tree is complete, i.e.,
every level is full except (possibly) the last one,
and every leaf on the last level is as far left as possible.
- The values are heap-ordered, i.e.,
every value is less than or equal to its children.
- For example, the tree below is a heap:
but not this one (it is not complete):
nor this one (it is not heap-ordered):
- Intuitively, the heap-ordering property means that a heap
is partially ordered, with smaller values near the root
and larger values near the leaves — in fact,
the minimum value is guaranteed to always be at the root.
Implementation
- Binary trees are generally implemented using nodes
with pointers to children.
- But because heaps are complete trees,
a neat trick can be used to store all of the values
directly in an array, without any nodes or pointers:
store the root at index 1,
the root's children at 2 and 3,
etc.
This is pictured below:
- With this trick,
every heap node is represented as an index i,
and:
- parent(i) = i/2 (rounded down)
- left_child(i) = 2i
- right_child(i) = 2i + 1
(This can be modified to start at index 0,
though it makes the calculations a little more complicated.)
Operations and complexity
- To insert a new value into a heap:
- Add it to the end of the array
(at the first available leaf position
at the bottom of the tree),
and increment the heap size.
- This will preserve the "complete tree" structure of the heap.
- It may violate the heap-ordering property,
but this can be restored by
percolating the value up:
if the value is smaller than its parent,
swap it with its parent and repeat.
- To remove a value from a heap:
- Swap the value with the very last leaf — this preserves the "complete tree" structure — and decrement the heap size.
- Percolate the new value (at the position that was deleted)
up or down, depending on how it compares with its parent and
children, until heap-ordering is restored.
- Both operations take time O(log n)
because they travel along one path in the tree.
[Back to outline]
© Copyright 2011 François Pitt
<francois.pitt@utoronto.ca>
last updated at 17:13 (EDT) on Sat 16 Jul 2011