Subject: "Bounty proposal: `preload'ing the desktop" WHO Behdad Esfahbod http://behdad.org/ behdad behdad.org NAME preload - an intelligent readahead agent SPEC VERSION 1.0 SUBMITTED TO Google Summer of Code, June 2005 MENTORING ORGANIZATION Fedora Core SYNOPSIS preload is an intelligent agent that monitors applications that a user runs, and from analyzing this data, predicts what applications the user might run in the near future, and fetches those binaries and their dependencies (shared objects) into memory for faster startup times. When run by the root user, preload takes into account probabilities that individual users may login now, and proceeds with prediction for faster login times. preload is desktop-environment agnostic. GOALS The goal here is to make application startup times shorter. We exploit the fact that most users have a very static set of applications that they use on a regular basis. For me for example, these applications are gnome-terminal, firefox, gaim, xmms, and skype. I start gnome-terminal and firefox as soon as I login and start gaim and xmms soon after. Add to these the GNOME desktop processes like nautilus, gnome-panel, panel applets, and shared objects they depend on, and you get a few hundred megabytes of binary objects that can be happily prefetched into page cache when GDM is waiting for me to login. preload makes this happen. That should drastically decrease the login time, and may have positive effects later on too. PROJECT DETAILS There are two main parts to preload, the data gathering agent, and the predictor agent. The two agents may or may not be implemented in the same process. The data gathering agent is started when a user logs in. A good place to start it is to drop it into /etc/X11/xinit/xinitrc.d. When started, it will gather information about running applications regularly, say once a minute, load permitting. The list of running applications is taken from the output of xlsclients(1), and for each application, the list of its maps is fetched from /proc. The latter can be performed lazily and results would be cached: There's no need to fetch maps for xmms every single minute, only once per run. Anyway, this data is gathered and sorted out and dumped into a database (a binary file format) for use by the predictor agent. There are a bunch of ways to make the database compact in size. Data acquisition can be stopped when screensaver is detected to be running, to not credit the running applications too much during idle times. The predictor agent is an intelligent agent that upon reading the gathered data and the list of currently running applications, predicts which application is most probably going to be run first in the coming minutes. We model the problem as a Bayesian system, and assign probabilities to each application, and then considering these probabilities and shared objects that each application depends on, we sort the shared objects in a priority list to be readahead, and perform that. Enough care is taken for the agent to not get in the way of the user. It monitors user activity, system load, data and page cache memory size, etc, before deciding to jump in. It is worth noting that the predictor agent has an exponentially fading memory, such that if a user's habits change and he starts using epiphany instead of firefox, after a few days the predictor understands that although firefox has been running every single session for a few months now, but that's part of history and epiphany is to be expected more often. The problem can be seen as an stochastic Markov chain whose states are members of the power set of the set of all applications that the user may run. So for example the Markov chain would tell us that most of the time the user starts at the state corresponding to the set of running applications {gnome-session}, and from that state almost always goes to the state corresponding to the set of running applications {gnome-session, gnome-settings-daemon, metacity, gnome-volume-manager, gnome-panel, nautilus, eggcups, wnck-applet, gweather-applet, stickeynotes-applet, multiload-applet, clock-applet, ...}, and from that state, most of the time goes to the state that has the same set of applications union {firefox, gnome-terminal}, or some other times simply {xmms}... If we can build this Markov chain, we can proceed with predictions, but training this Markov chain based on our training data is something better not tried. This model is too complex for the problem and will surely overfit. To overcome this shortcoming, we make independency assumptions and model every two applications in a four-state Markov chain. So for example we have a Markov chain for the pair of applications firefox and gnome-terminal, another for firefox and gnome-panel, another for firefox and xmms, and each of these Markov chains models the correlation between the state of the two applications involved. When we want to predict the probability that firefox is run, we simply multiply (using log-probabilities, we add) the individual probabilities from individual Markov chains that firefox forms with all other involved applications. For example, if gnome-terminal is run but not firefox, the Markov chain modeling these two applications will tell us that most of the time when gnome-terminal is launched, firefox is launched soon after... The above description is a simplification of the model I have in mind. In the more sophisticated model, it is not only a single probability that is assigned to edges in the Markov chains, but the time that takes for the state to change is fit with a probability distribution, so we can have an idea whether the state change is going to happen soon, and we take this into consideration when predicting. The probability distribution that we fit most probably would be a memory-less distribution, Poisson is a good candidate. Finally, the whole model is subject to change, should experiments show that it falls short. When preload is run as the root user (by GDM or another DM, or during the boot process), it takes into consideration the probabilities that each user may show up and login, and from there continues with normal prediction. Again, being a Bayesian system, it doesn't pick one user, but all active users, with probabilities. So for example if both active users of the machine run GNOME desktop and firefox, these applications and their dependencies will be fetched first, and after that applications used by the most active user of the machine. To summarize, preload is a generic readahead agent. It helps the page cache warm up in a learnable manner. A GNOME-specific readahead agent can be written without digging into machine learning, by simply looking at what gnome-session is going to run, but the approach proposed here works for all applications and desktops and even a mixture of them. I believe that an agent like what is proposed here can drastically decrease the login time, should it have had a few seconds to fetch stuff before the user logs in (but that is almost always the case, be it the time the user is typing user/password, or simply the login screen is sitting there waiting for the user to come back.) However whether it has any measurable effects after the login peek is passed and the cache is warmed up is something to wait and see. But then again, decreasing the login time is enough a motivation to experiment. The bad news is that simply monitoring what applications are currently running is too small information to predict what the user is going to run next, but this level of genericity is one of the design goals and should not be sacrificed. PROJECT SCHEDULE A prototype will be coded using Bash and Python scripts, using the basic model described here. This should not take more than two weeks. For the next five weeks the model will be improved, until the desired level of prediction is achieved. Alternate models will be examined in this period. After working out the model and not later than eight weeks from the beginning of the project, an efficient implementation in C will be started and finished in the remaining time. TARGETED BIO I am a Graduate student in Computer Science. Since I passed a graduate Machine Learning course last Fall, I have been thinking about using ML techniques in Open Source software. When thinking about ways to reduce the login time, I came up with the ideas presented here, and I think they are worth giving a try. I have been programming since 1990, and developing Open Source software since 2000.