Root - A History and Evaluation of System R
© Zhou Qingqing 2002, 2005
http://www.cs.toronto.edu/~zhouqq
[Note: This review was written on 2002 and there are various misstatements in
it. For the sake of history, I keep it without any editorial work. -- When I
reread it, it is interesting to know they have encountered "convoy problem" in
more than 20 years ago. It is called "context switch storm" nowadays, which
hurts performance, especially for SMP machines. Now (2005.10.22) PostgreSQL
hackers are struggling to solve it, though the cause of the problem might be
different.]
This paper gives us an overview of the history of system R, which is the first
integrated relational system guided by E. F. Codd's relational model. Authors
conclude the observations and lessons learnt during the implementation of system
R.
กก
Reference
[1] Chamberlin, et al. "A history and evaluation of system R", Communications of
ACM, Vol. 24, No. 10, October 1981, pp. 632-646.
[2] http://www.mcjones.org/System_R/ - A website to memorize the history events
and the people of system R working in IBM San Jose research center (now Almaden
center). There are many big names in the people list. For instance, a great
picture you must have a look:
http://www.mcjones.org/System_R/gray.html
กก
Basic points
* Goals of System R
It is important to design the practical and convincing goals for building a
system. The goals of it include:
- SQL interface;
- Both transaction and ad-hoc query support;
- Dynamic (updatable) environment support;
- Concurrent operations support;
- Recovery support;
- Views and authorization;
- Performance;
We have to praise for the keen eyesight of the System R designers. Almost all
the fundamental aspects of DBMS have been considered as a whole in it.
* Basic stages of System R
The central goal of system R is the data independence. E. F. Codd presents the
relational model to solve this problem and system R is the first integrated
database system based on system R. The significance of data independence and
relational model are not rephrased here which have been introduced by another
review on Codd's paper.
There are 3 stages in System R:
- Stage 0 (1974 - 1975) - single user initial prototype
- Stage 1 (1976 - 1977) - multi-user and function enhanced prototype
- Stage 2 (1978 - 1979) - evaluation
* Stage 0
In this stage, the main purpose of system R is trying to demonstrate the
usefulness of relational model by building it. The main achievements include:
- Single user strategy. This is important in stage 0 because (1) actually
unified data model is the key concept of relational model distinguished from
other data models like hierarchical or network model. And this could be verified
in the single user prototype. (2) This strategy alleviates the task of stage 0;
- Use XRM as the storage component. Though people found that it is not very
suitable for the real relational system (for instance, it stores every attribute
of a single relation separately and associated them by TID links which degrades
the performance heavily), but for the same reason of choosing single user
implementation, this is good in stage 0;
- Treat system catalog as common tables, which simplify the design and
implementation;
- Identify query optimization as a challenging problem, esp. the Join cost. This
is still not completed solved till today;
Stage 0 has a good start of System R and boosts the project.
* Stage 1
In this stage, the researchers try to make the system a full functional,
multi-user DBMS achieving the following goals:
- Separate storage layer and query layer (and other non-core function
components);
- RSS (research storage system) as storage layer supporting multi-users;
- Lock system (see another review of paper by by J.Gray, esp. pay attention to
the "predicate locks" problem);
- Recovery system: it utilizes the "shadow page + check point" technology to do
recovery. Finally, they find it is painful to maintain the information needed
for the recovery and they suggested that only log with WAL protocol. Note: In
another paper by Mohan who represented the ARIES method to do recovery handles
this problem better by the method system R suggested. This paper is also a
classic one collected by Stonebraker's book.
- DSS (relational data system) as optimizing SQL processor layer;
- Compilation approach: the basic idea is separate the SQL parse and execution
stage, so that a SQL can be (1) pre-compiled into binary code; (2) repeatedly
evaluate a SQL without redundant parse. One more word about this is that System
R maintains some "independencies" to assure the pre-compiled execution plan is
correct since some index or tables can be updated in the fly. From here we may
guess out why the ODBC standard has: SQLExecDirect() and SQLParse()/SQLExec()
two different methods of evaluating a SQL;
- Join is still a headache. There are two methods presented, the first way is
treat two relations in join as inner relation and outer relation and the SQL
engine holds the former in the main memory; The second way is sort before join.
Actually, these methods are still active in the nowadays join processing;
- Views and access control;
* Stage 2
The task for this stage is to evaluate of system R by users and many function
enhancement and improvements from users' experiences.
- SQL language: add EXISTS, LIKE predicate, outer join type. Outer join type is
a counter part of inner join. The latter is the ordinary join we commonly use.
The former is the join that will return the results with all left/outer relation
of the join operand together with the attribute set NULL. You can find left
outer join, right outer join or full outer join;
- Optimizer is incomplete: this is not a fault of system R. It is the inherent
hard problem of storage system and SQL language. An extreme example is that a
user writes a 3-SAT query, then no reasonable time algorithm could be found for
the evaluation of this query. IMO, the main difficulties lie on: (1) the
tradeoffs between updates and query w.r.t the maintenance of statistics
information; (2) the inherent difficulty of different dimensionality storage
mapping; (3) the inherent hardness of relational model based on the first order
logic. Note: We can feel the hardness of this problem in another paper appeared
in PODS "An Overview of Query Optimization in Relational Systems" by S.
Chaudhuri addressed the current state of optimizer technology after 20 years of
System R.
- Convoy phenomenon: this is the state that most CPU time is used in switching.
In a DBMS system, if we use thread instead of process to do execution, this will
be alleviated since thread context switch is trivial. A solution to this problem
is use non-preemptive schedule policy mentioned in Blasgen's paper "The convoy
phenomenon".
กก
Conclusion
System R is cool! :-)
กก