CSC 469F / CSC 2208F | Assignment 1 | Fall 2007
Due: 4 p.m., October 29, 2007.
- Introduction
- Assignment Logistics
- Submission Instructions
GOAL: To explore advanced synchronization and
cache effects through the development of a parallel memory allocator.
Overview
The allocation and management of blocks of virtual memory is a key
function required by nearly all modern programs. Multi-threaded
parallel programs are no exception, however, the use of a memory
allocator designed for sequential programs can become a serious
bottleneck limiting the performance and scalability of the
application. Fast operation and efficient use of memory are the
typical criteria by which a sequential allocator is evaluated;
parallel allocators must also take scalability into account.
The simplest way to adapt a sequential allocator for use by
parallel programs -- placing a single lock around the bodies of the
malloc and free functions -- serializes all memory
allocation and deallocation requests, limiting scalability. Even if
the allocator is designed to allow individual malloc and free requests
to proceed (mostly) in parallel, scalability problems can remain due
to cache effects.
In their paper on Hoard, Berger et al. list the following features that are required for
scalable and memory-efficient allocators (paraphrased from the paper):
- Speed: A parallel memory allocator should perform malloc and free
operations about as fast as a good serial allocator, thereby guaranteeing good
performance for single-threaded programs, or when multi-threaded programs run
on a single processor.
- Scalability: The performance of the allocator should scale linearly with the number
of processors in the system to ensure scalable application performance.
- False sharing avoidance: The allocator should not introduce false
sharing of cache lines. False sharing arises when threads on distinct
processors inadvertently share data on the same cache line. This can
happen if the allocator satisfies requests from threads on different
processors with blocks of memory that fall on the same cache line.
Two types of false sharing are of interest. Active false
sharing happens when there is no sharing in the application but
the allocator causes cache lines to be shared. Passive false
sharing happens when the application first spreads blocks on the
same cache line across multiple processors, the blocks are freed, and
the allocator subsequently allows threads on these processors to reuse
blocks that they freed. This pattern maintains the false sharing due
to future allocations long after the original sharing introduced by
the application has ended.
- Low fragmentation: Fragmentation is defined as the maximum amount
of memory allocated from the OS divided by the maximum amount of
memory required by the application. Excessive fragmentation can hurt
performance by causing poor data locality, thereby increasing cache
misses and in extreme cases, leading to paging. Parallel applications
will typically have higher memory demands, since multiple threads will
have some memory allocated at the same time. Some parallel allocators
however exhibit extreme fragmentation, increasing the memory
requirements from between a factor of P (the number of processors) to
unbounded consumption.
Your task in this assignment is to design and implement a parallel
memory allocator that (attempts to) provide the four features listed above.
1. Code Provided
We are providing a package with two sample serial allocators that have
been made thread-safe by the use of a single lock that surrounds all
malloc and free operations. The package also contains
some sample benchmarks that you can use to evaluate your allocator, and
some utility routines that you can and/or must use in your implementation.
The package can be found on cdf at /u/csc469h/fall/pub/A1/a1_distrib.tgz; the
package contains the following directories:
- allocators: contains one subdirectory for the implementation of each
allocator. Provided are (a) kheap - an allocator based on the OS/161 kernel heap allocator,
(b) cmu_malloc - an allocator developed for a CMU assignment using explicit free lists and
coalescing, and (c) libc_wrapper - thin wrappers around the standard libc implementations
of malloc and free
- benchmarks: contains one subdirectory for each benchmark.
Provided are: (a) threadtest - tests basic scalability of the
allocator, (b) cache-scratch - tests for passive false sharing, (c)
cache-thrash - tests for active false sharing, and (d) larson - simulates a server by having
each thread allocates and deallocate objects and then transfers some objects to other
threads to be freed. Each benchmark subdirectory (except larson) contains scripts to
run and graph the performance of the allocators provided, and a Makefile to create a
version of the benchmark linked to each of the different allocators. You will need to
modify these files to also build, run and plot your new allocator(s).
- util: contains utility routines for use in the benchmarks and allocators.
In mm_thread.c are operations related to threads (initializing a pthread_attr_t for
thread creation, determining the number of processors on a system, getting a thread id,
and pinning a particular thread to a particular CPU). In tsc.c are the cycle
counter routines from the last assignment. Finally in memlib.c are the routines
that your allocator must use to manage its heap.
- include: header files used by the benchmarks, allocators, and utility routines.
2. Specifications
Your allocator consists of the following three functions, which are declared in include/malloc.h:
int mm_init(void);
char *mm_malloc(size_t size);
void mm_free(void *ptr);
You will add a subdirectory to allocators for your
implementation, and fill in these function bodies in a file named
malloc.c. You will also need to modify the Makefile in the
allocators subdirectory to build your new allocator. Feel
free to add multiple new subdirectories as you experiment with
different designs. The only thing you will hand in, however, is the
malloc.c file for your final implementation. All your
code must be contained in the file named malloc.c; you are
not allowed to change any of the interfaces to the three functions
above. Of course you can define as many helper routines as you
like within this file.
Your allocator interacts with an arbitrary application program in
the following way. As part of its initialization phase, the
application calls your mm_init function to perform
initialization of the heap. You must allocate the necessary initial
heap area (starting with a call to mem_init to create the
chunk of memory that you can use), and initialize all structures that
you need. The application then makes a series of calls to
mm_malloc and mm_free. You may assume that any
programs we test with will always call mm_init before any
calls to mm_malloc/mm_free and before creating any threads.
You are allowed to use any of the functions from memlib.c
in your allocator:
- mem_init: Use this function to create the chunk of memory
that forms your heap. This functions uses the libc malloc to
obtain a 40MB chunk. All of the benchmark programs we use fit easily
within this space, but you may run into trouble if your allocator
allows unbounded fragmentation.
- mem_sbrk You use this
function to expand your heap area (within the overall 40MB limit!).
The lower and upper boundaries of the heap area are stored in the
global variables dseg_lo and dseg_hi respectively.
You are allowed to read these variables in your allocator, but you are
not allowed to modify them in any way. You must call
mem_sbrk to change the upper bound. This function accepts a
positive integer argument, which is the number of bytes by which to
increase the upper bound. The return value is the beginning of the
newly added heap area, or NULL if there wasn't enough memory left.
The interface of mem_sbrk is very similar to the
sbrk system call, and the general effect is the same.
However, you cannot decrease the heap area in size, only increase it,
so be careful how you call mem_sbrk. In effect, each time
you call mem_sbrk the value of dseg_hi is
incremented by the amount you request. It may be convenient for your
allocator to work with blocks of memory that are guaranteed to be
pagesize-aligned (or some other size that you choose). To help with
this, the initial heap is aligned on a page boundary, and you can
call mem_pagesize to find out the system page size.
- mem_usage: This is simply a shorthand that returns the current
size of the heap in bytes.
The functions in memlib.c are not generally thread-safe.
They are safe in the current allocators because they are only called
from within routines that are themselves protected by a lock. If
you use finer-grained locking, you must ensure that calls to
mem_sbrk are serialized by grabbing a suitable lock before
the call and releasing it after the return. You are not allowed
to modify the functions in memlib.c themselves.
The functions you are to implement are the following:
- mm_init: Before calling mm_malloc or
mm_free, the application program calls mm_init to
perform any necessary initializations, including the allocation of the
initial heap area. The return value should be -1 if there was a
problem with the initialization, 0 otherwise.
- mm_malloc: The mm_malloc routine returns a
pointer to an allocated region of at least size bytes. The
pointer must be aligned to 8 bytes, and the entire allocated region
should lie within the memory region from dseg_lo to
dseg_hi.
- mm_free: The mm_free routine is only guaranteed
to work when it is passed pointers to allocated blocks that were
returned by previous calls to mm_malloc. The
mm_free routine should all the block to teh pool of
unallocated blocks, making the memory available to future
mm_malloc calls.
The two allocators provided, kheap and cmu_malloc, both meet the
specification provided above, but they are deficient as parallel
allocators. Both will actively induce false sharing and neither is
scalable. The baseline sequential performance and memory utilization
of both could be improved.
3. Rules of Engagement
- As noted earlier, you must put all of your code in a file named malloc.c,
as this will be the only code that you hand in.
- You may not statically allocate large data structures for your heap. (For instance,
the original OS/161 kheap.c created a page-sized array of references to pages in the
heap as a global variable. In the implementation provided with this assignment, the
space for these references is also allocated from the heap.) In the extreme, all
state could be stored in the heap itself, at the beginning, with dseg_lo
providing the pointer to the heap's own data structures (This is the approach taken in
the cmu_malloc implementation). For this assignment, some globals variables are
acceptable, but more than 1K worth will be penalized.
- You may read any and all publications relating to parallel (or sequential) memory
allocation to help you understand the problem and your design. Here are a few suggestions
to get you started:
- You may not read the source code for any memory allocators other than the ones
provided in the assignment distribution. Hoard, Lea's allocator, serial and parallel
implementations of the libc allocators, and others are all publicly available. You must,
however, implement your own allocator; using ideas from the literature is fine, using
other people's code is not.
4. Grading Criteria
Your allocator will be graded on correctness (30%), performance (50%), and style (10%).
You must also provide a writeup describing your allocator, other alternatives that you
considered or tried, and an analysis of your allocator's performance. This writeup will
be worth the final 10%.
The performance of your allocator will be evaluated with respect
to the four features listed at the beginning of this document: speed
on sequential programs, scalability, false sharing avoidance, and
fragmentation. Scalability will be given slightly greater weight than
the other criteria; the exact formula to be used will be added later.
Since obtaining good measurements of your allocator's performance on a shared
SMP server will be difficult, if not impossible, we are setting up an 8-core timing
server for this assignment. The physical hardware of this machine is identical to
the CDF redwolf and greywolf servers, however access to it will be restricted to
ensure that your tests run with a minimum of interference. The timing server will
be available on Saturday. You will submit your malloc.c file to the server,
the provided benchmarks will be run, and you will be emailed the results. Details
on how to submit to the timing server will be provided once the server is operational.
What
- A report describing your allocator, alternatives considered or tried, and an
analysis of its performance.
- Your malloc.c file.
How
- Submit hardcopy of your report using the assignment drop box in BA 2220
- Submit your tar file using 'submit -c csc469h -a A1 malloc.c' on CDF.