Assignment 1: Concurrent Hash Tables

Check main assignment page for updates, corrections, and clarifications.

Please submit a printout of your code and documentation (comments in code explaining your strategy is fine, as is a separate document), using the dropbox in BA 2220.

Please also submit softcopy using the submit command on CDF (e.g. create a gzipped tarfile of all your code and any README or other documentation files, and then "submit -c csc469h -a A1 a1.tgz").

Objective

The goal is to look at synchronization and concurrency, and different ways of implementing it, with an eye on performance and correctness. To this end, you are asked to implement a concurrent hash table, intended to support a particular use described below.

Problem Description

For programs with data sets that exceed the capacity of physical memory, time stalled waiting for I/O (due to page faults) is a significant performance factor. One strategy to reduce the stall time is prefetching - predict the pages that will be needed and initiate the I/O ahead of time, such that no stall occurs when the page is actually used.

There are many ways to predict pages for prefetching; one approach relies on compiler analysis to automatically insert hints (in the form of prefetch system calls) into the application source code. Based on estimates of how long it takes to fetch a page from disk, the compiler analysis will attempt to insert prefetch instructions in the code such that they are expected to complete shortly before the original access to the page occurs. When this works successfully, the prefetches are effective, the stall time is hidden, and performance improves. When compiler-based prefetching is not successful, however, it can be very difficult to figure out what went wrong.

The main difficultly is that the prefetch requests are inserted into the program code by a static compiler analysis, but their behavior depends on dynamic interactions with the entire memory management system. Unexpected behavior could be the result of a flaw in the compiler or in the OS handling of the requests inserted by the compiler. Some possible problems include:

To gain insight into the causes of ineffective prefetching (and also to understand why something worked), we want to be able to track a page prefetch request from the point where the user-level application issues the prefetch, to the point where the application uses the prefetched page, or the prefetched page is evicted from memory (whichever comes first). During the "lifetime" of a prefetched page, it will go through several state changes in response to asynchronous events. The plan is to associate a small structure, called a prefetch record (or pf_rec), with each prefetched page. Every event that affects the page state will look up the page in a hash table (the page number is the hash key), and record some information in its prefetch record. The states, and the events that cause transitions are shown below:

 

Labels on arcs are events. In italics are responses that should be generated in handling an event, in addition to causing a state change. Not every possible transition is shown in this diagram, to reduce clutter and improve clarity. The states are:

  1. INITIAL - at the user-level, the application handles prefetch requests by passing work to a set of prefetching threads, which issue prefetch system calls. It is possible that the application hands off work to the prefetch threads and then touches a page that it wanted to prefetch before the prefetch thread even makes the system call. We would like to be able to observe this case, so we issue a special "SETUP" system call which initializes the state for the pages we expect to prefetch. (We do not initiate the prefetches themselves with this syscall because we want the main application thread to proceed to its regular work, and the prefeches to happen in parallel). Due to various limitation in the compiler analysis, we may attempt to prefetch the same page multiple times. A "SETUP" event for a page that already has a prefetch record should increment a counter in the existing record. Repeated "SETUP" events could occur with a prefetch record in any state but they do not cause state transitions. Exception A SETUP event when the pf_rec is in the RECLAIMED state will put the record in the INITIAL state again (and re-initialize the record). This transition is not shown in the diagram.
  2. INPROGRESS - A prefetch system call event for a page with a pf_rec in the INITIAL state will initiate I/O to bring the data into memory, and move the pf_rec into the INPROGRESS state.
  3. PF_READY - I/O for a prefetched page (where the prefetch started the I/O) has completed.
  4. BYPASSED - A page fault event occurred for a page with a pf_rec in the INITIAL state, indicating that the prefetch system call was not issued before the application touched the missing page. The I/O will be started by the page fault handler in this case.
  5. NOPF_READY - I/O for a faulted page (where the fault handler started the I/O) has completed. The expected prefetch system call has still not been issued.
  6. PF_STALLED - A page fault event occurred for a page and the I/O was already started by a prefetch system call, but the prefetch was too late and has not completed yet.
  7. NOPF_STALLED - A prefetch system call has occurred for a page between the time that a page fault started the I/O, and the time when that I/O completes. Note that PF_STALLED and NOPF_STALLED are equivalent (both represent a page with a prefetch and a fault, and still waiting for I/O). The only difference is the order in which the fault and the prefetch occurred.
  8. RECLAIMED - A page reclaim event (page selected as victim by paging daemon) occurred for a page that still had a pf_rec. This is very bad news, as it suggests we wanted to prefetch the page (we setup the pf_rec), but the page has been accessed, brought into memory, and now replaced and we still haven't seen the prefetch system call.
  9. CLEANUP - A very temporary state of affairs where we know that the pf_rec is no longer needed (for example, because we've matched the prefetch with a fault and the I/O is done). In CLEANUP, all we have to do is free the pf_rec (and remove it from the hash table). Note that CLEANUP is the only state with multiple entries.

To help with the tracking, the compiler includes an identifier, or tag with every prefetch instruction that it inserts in the code. One prefetch instruction can be responsible for a large number of prefetch requests (for example, if it is added to the body of a loop). While it is useful to know that (for example) 50% of the prefetches issued were too late, it is even more useful to know that 100% of prefetches issued by prefetch 1 were effective, and 100% of the prefetches issued by prefetch 2 were too late. Thus, we will maintain counters on a per-tag basis.

For each tag, we have a structure that counts the number of times each state was entered, the number of times CLEANUP was entered from each prior state (i.e., what was the state immediately before CLEANUP), and the number of times a prefetched page was reclaimed without being used (transitions from PF_READY to CLEANUP due to RECLAIM events).

We are willing to tolerate some stale data because in the real implementation many things can effect the exact order of events. (Also, in the real implementation, we would collect timestamps on all the transitions and maintain histograms of the durations of events to help understand events that happen close together. For example, if prefetches complete at nearly the same time as the page is touched, some will be counted as "PF_STALLED" and some will be "PF_READY" depending on the exact order of events. The time data would show that the "PF_READY" prefetches were not ready long before they were used, and the "PF_STALLED" ones were not late by much -- postprocessing of the data would probably group both into a "Nearly Correct" category. You do not have to worry about these details for the assignment)

We are not willing to tolerate the loss of events, or gross mis-categorization, however. For example, if a prefetch request has started (a prefetch record has been initialized), a page fault event should not be able to conclude that the page was not prefetched (unless the prefetch completed and the page has already been reclaimed). Similar, under no circumstances should two handlers initiate an I/O for the same page at the same time.

Basic Design

We will simulate a prefetching system using a "driver" process and a "handlers" process. The "handlers" represent the operating system responses to various events. The "driver" controls the simulation, and represents all the possible event sources (including system calls from the "application", I/O completion notices from the disk drives, and page replacements caused by the paging daemon). The "driver" and the "handlers" communicate with each other using System V IPC (msgsnd, msgrcv, etc.) to send messages for each event.

The "handlers" process consists of one or more threads to handle each type of event. Specifically, there is one handler thread for events associated with the main user-level application (SETUP system calls and page faults), one or more handler threads for prefetch requests, and a handler thread representing the paging daemon that deals with page reclamation events.

The "driver" process sends messages encoding an event to the appropriate handler thread. Messages will include a page number (if relevant, additional information like a prefetch tag and an action to be taken by the handler will be included). Using the page number, each handler will lookup the prefetch_record in the shared hash table. If one exists, the appropriate fields in the record will be updated (mostly this means updating the state field), and counters associated with the prefetch tag may also be updated. If there is no prefetch_record, most handlers will update counters of non-prefetched events (only the SETUP event will initialize a record if one does not already exist).

Some events are logically dependent on others (an "IODONE" cannot happen before the page is prefetched or faulted; a page cannot be reclaimed before it is either faulted into memory or the prefetch completes). To enforce this, your event handlers should send messages back to the simulation driver to let the driver know that dependent events can proceed.

Traces will end with a special "STOP" event, sent to the handler threads. When all the handler threads have exited, the parent thread in the handlers process will print overall statistics gathered during the run.

Synchronization:

There are may "correct" ways to deal with concurrent access to the hash table. There are also several types of updates that you should think about. For example, updates to the hash table structure itself (insertion of new prefetch_records and deletion of prefetch_records) and maintaining the integrity of the table vs. updates to the content of the prefetch records themselves, vs. updates to the per-tag counters. Some (but certainly not all) possibilities include:
  1. A global lock on the entire table, held whenever any access or update is made to the table or the prefetch_records stored in it.
  2. A lock for each bucket in the hash table, held whenever any access or update is made to the table, or the prefetch_records stored in it.
  3. Locks (global or per bucket) are held during insertions, deletions, and searches of the hash table, but searches return a pointer to a prefetch record, still stored in the table; no table locks are held during updates of the record itself (atomic updates are used instead). This will require some discipline to decide when it is safe to delete a prefetch_record. Natural quiescent states exist for each thread at the end of processing each event, so you might consider trying RCU's deferred deletion, or you might try a reference counting strategy.
  4. Locks (global or per bucket) are held during insertions, deletions, and searches of the hash table, but a successful search removes the prefetch_record from the table and returns it, so no other references can be obtained while the record is being updated. (This one actually isn't correct according to our requirements... Why?)

Start by getting the simulation working with the simplest possible locking discipline (the big giant lock on the entire hash table), to make sure you understand the states, events, and the actions to take in handling each event. Then try to improve the synchronization scheme.

Evaluation

We'll provide some sample traces for testing. We will evaluate your code on these, and other, traces. The primary criterion is correctness (defined as all events accounted for). A secondary criterion is synchronization overhead - for full marks you must improve the synchronization strategy beyond the big giant lock approach.

Starter Code

On CDF at ~csc469h/fall/pub/A1/a1_starter.tar we provide a framework that implements the driver, the thread creation and messaging for the handlers, and defines the prefetch record structure (pf_rec), the states, events, and message actions, as well as a possible hash table structure (which you are free to change), and the summary counters. Everything else is up to you.

Cautions

System V IPC message queues are kernel structures. There are system limits on the number you can create, and the maximum number of bytes in the queue, which you cannot change. Currently, the driver/handlers framework uses a single queue for all messages, with different handlers receiving different types of messages. Both use blocking sends and receives. If the message queue fills up, deadlock could result (driver blocks waiting to receive a specific kind of message but handler cannot send it because the queue is full, or vice versa). The driver currently yields (sched_yield) after sending out each event, to give the handlers a chance to run. If, however, you run into deadlocks due to full message queues, you may need to take other actions (for example, use more than 1 message queue, or have the handlers pull multiple messages off the queue at one time, and queue up pending work in a local data structure).


Last modified: Tue Oct 17 00:56:32 EDT 2006