CSC-2232: Some project ideas


     

    One of the main themes of this class is that in order to model or simulate a system accurately and obtain realistic results, it is crucial to understand the real-world characteristics of system parameters, such as workload or failure behavior. In practice, that is often hard because such data can be very difficult to obtain. In particular, when it comes to information related to system failures, companies are often very reluctant to share data. In my research, I have been able to obtain a number of interesting data sets from a range of large sites. Many of these provide interesting material for course projects. Below is a summary of data that is available and different directions for course projects.


    Data sets

    • Data on node outages collected over nearly a decade on more than 20 large high-performance computing clusters at Los Alamos National Labs. These are logs created by administrators and contain information such as when the outage started, when it was repaired, which node was affected, what was the root cause, etc.
    • Data on the physcial layout of the machines in the machine room for many of the clusters at Los Alamos National Labs (to be used in conjunction with the data in the bullet above).
    • Workload information for several large clusters at Los Alamos National Labs. These are basically job logs that contain information on when a job was started, how many nodes (and which nodes in the cluster) the job was running on, completion time of the job, and exit code (i.e. did the job complete normally, did it die because of a system problem, was it killed, etc).
    • Event logs for a number of large clusters at several organizations, including some of the world's fastest supercomputers. These data sets contain log entries of system errors, but also entries of other events in the system.
    • Data on file system statistics at 3000 plus workstations at Los Alamos National Labs.
    • Storage system traces Various types of traces from storage systems provided by SNIA.
    • The Usenix failure data repository A repository with a number of different traces related to system errors and failures.
    • Failure Trace Archive Another repository with failure traces.
    • Think of interesting data that you could collect yourself!

    Ideas for projects

    Project type 1: Measurement study of failure and workload data

    Each of the data sets listed above provides interesting material for measurement studies. For example, questions one could ask about node outage include the following:

    • How often do nodes fail?
    • What is the common root cause?
    • Are there many correlated failures (several nodes going down at the same time)?
    • Are failures correlated in time, i.e. does a high failure rate in one hour mean that also the following hour has a higher probability of failure?
    • Are some machines in a cluster more likely to fail than others?
    • Are there symptoms that precede a node failure, e.g. messages in the error log, that could be used to predict a failure?
    • How do failures correlated with "sensor values", e.g. temperature, utilization, etc?
    • How do node outages correlate with workload (job logs)?
    • Are some areas in the cluster/machine room more likely to fail? Are nodes close together more likely to fail? (The physical layout information can be used for this).
    • What other questions can you think off?

    Similarly, one could ask almost the same questions in the context of system errors (rather than node outages/failures). E.g. how often do errors happen, are they correlated across nodes, are they correlated in time, etc.

    Project type 2: Use the data to build better tools, algorithms, system designs, ...

    • Construct tools for visualizing operational data (errors, failures, job logs) to support system administrators.
    • Construct and evaluate novel tools for system administration (e.g. configuration, diagnosis, etc.).
    • Pick a sample application and evaluate its performance under real data and then compare with know results on synthetic data. A sample application could for example be algorithms for replica placement in a cluster or methods for error log pre-processing. Are there any relevant applications in your own research?
    • Which aspects of the data are interesting with respect to your own research? Can you look at some of your research questions in the context of this data?

    Project type 3: Collect your own measurements

    Instead of using one of the existing data sets you could also collect your own data and analyze it. Some examples include the following:

    • Some of the tools that have been developed to collect the data sets above have been made publicly available. That includes for example the tool used to gather filesystem statistics. You could use this tool and collect and analyze data on one of the clusters at UofT.
    • Collect and analyze interesting operational data with local admin group or cluster you have access to.
    • While we have talked mostly about performance and reliability of systems, another important aspect is cost. The total cost of ownership for today's systems is actually dominated by human cost (administrators managing the machines). Can you work with local admin groups (e.g. by "shadowing" admins) to construct and analyze breakdown of where time goes.
    • What other types of data can you think of that might be interesting? What is relevant for your own research?