Restructuring Legacy Software to Reduce Build Time and Improve Productivity

Large legacy C/C++ software systems typically consist of header files (.h files) and implementation files (.c files). Header files typically contain declarations of the symbols used in the implementation files. By including header files through pre-processing directives (#include statements) dependencies between files are created. Ideally an implementation file includes only the declarations that it will use. However, a header file can be included by multiple files and as such may contain declarations and definitions that are not used by all implementation files that include it. In such cases, false dependencies are created. Another problem is that symbols may be declared in more than one place. As systems evolve, such redundant declarations tend to become common.

Redundancies and false dependencies do not affect the functionality of a system, but they do affect the efficiency of the development process. The longer the build process takes, the longer developers have to wait to integrate their changes. Large software systems that contain millions of lines of code may take several hours to build. Redundancies increase the size of the code and may cause inconsistencies. A false dependency between an implementation file and its header exacerbates the problem by causing unnecessary compilation of the implementation file when an independent part of the header file has changed. This problem is particularly important in light of the popularity of the sync-and-stabilize development paradigm [1], where software systems undergo frequent, often daily, builds.

Traditional approaches to improving the efficiency of the build process focus on removing false target dependencies in make files. However, this approach does not consider the internal details of implementation files. In this project, we take a novel preprocessing approach to the removal of redundant declarations and false dependencies based on analysis of program units inside header files. The main steps of this approach are:

Construct a program units graph and a file dependency graph.
Partition the graphs to remove redundancies and false dependencies between files.
Reorganize the files into directories to reduce coupling and improve cohesion.

Step 2 provides an initial restructuring of header files which eliminates redundancies and false dependencies[3], but at the cost of producing a great number of header files. This step makes the system more difficult to maintain. We created a softgoal dependency graph based on the NFR framework [4, 5, 6] in order to investigate the tradeoff between improving maintainability and improving productivity. We used this graph to come up with an operationalization in which the number of header files was controlled by organizing the software system’s files into directories and refactoring the header files into these directories. We applied this algorithm to several case studies. In one study, the affect of header restructuring on the public domain text editor Vim (Vi improved)[2] was studied. Applying this technique alone, we found that time for build with "-O2" option turned on was reduced by 56%, or a 2x speedup.

What is New since CASCON03

Recently we have verified that this compilation tuning technique works orthogonal to other tuing techniques used for parallel compilation (e.g. DISTCC) and compilation cache (e.g. CCACHE, precompiled headers option): having 8 CPUs, the speed up rocketed to 40x when the above three techniques are applied together, while the net speedup of our technique on top of the other techniques is up to 8x. The major reason for the super-linear speedup is that the network traffic of sending the precompiled form among machines in the compiler farm is reduced 3x by our technique. The previous approach reported at CASCON 2003, however, was based on a heavy weight fact extraction using abstract syntax graph (see the Datrix C/C++ schema). Our recent development can replace such a cubesome fact extraction with the existing parser plus a small overhead (light-weighted), and can generate the restructured code on-the-fly. The overhead of doing the precompilation for restructuring is now smaller than the saved compilation time if the code base is just compiled once. If the code based is to be compiled N times, the precompilation overhead can be divided by N times. We have tried both GCC and Intel C/C++ compiler and verified that the precompilation results are compatible and independent to the choice of C/C++ compilers. The new algorithm is implemented using the GCC 3.4.0 parser, see the technical report. Due to the confidential reason, we can not disclose the name of the industrial software for IBM.

Resources

Removing false code dependencies (PDF) , the related results for VIM 6.1.

"Light-weight Fine-grain C/C++ precompilation" and the related results for VIM 6.2.

References

1: M. A. Cusumano and R. W. Selby. How Microsoft builds software. Communications of the ACM, 40(6), June 1997.
2: Bram Moolenaar. Vim 6.1, http:
3: Y. Yu and H. Dayani-Fard. and J. Mylopoulos. Removing False Code Dependencies to Speedup Software Build Processes. (PDF) In Proceedings of the CASCON, 2003 , Toronto, Canada, Oct 2003.
4: H. Dayani-Fard. Quality-based software release management. PhD thesis, Queen’s University, 2003.
5: L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos. Non-Functional Requirements in Software Engineering. Kluwer Academic Publishing, 1999.
6: L. Tahvildari and K. Kontogiannis Requirements-Driven Software Re-engineering Framework. WCRE’01, 2001, pp. 71--80.