University of Toronto - Spring 2000
Department of Computer Science
CSC 228: File Structures and Data Management
Advice on Assignment 2
Diane Horton
In office hours, I have seen many students who are confused about how to
represent a node in a file, how to represent it in memory, and how to read
and write between the file and memory. This posting is intended to help you
with this.
Representing a node in a file
Let's consider a node from one of the linked lists in the URL file.
What does it contain? Some URLs, and a "pointer" to the next node.
How are these things represented in the file? This is completely up to
you, and here are some of the issues you'll have to decide on:
- Will the URLs each take a fixed amount of space in the file, or will
they use just as much space as they need?
- Will the pointer be a byte offset or a RRN?
- In either case, the pointer is just a number.
Will it be written as a sequence of characters, or in binary?
- Will there be any delimiters between the fields of a node, or
at the end of a node (to separate it from the one that falls next
in the file)?
Whatever choices you make, try to picture each node as a sequence of
bytes, and to picture the nodes as falling one after another in the
file. In fact, draw yourself a picture of this, as I did in class.
Note that the URL file does not contain a single linked list;
there is one linked list for each keystring in the index file.
Representing a node in memory
How do you
store the contents of a node in memory?
You simply need to:
- Define some variables that are appropriate to hold the information
in a node.
For example, you could use an int to store a RRN,
or a string to store a URL. Or perhaps an array of strings to store
the bunch of URLs that are in a single node.
- Write a function for reading a node into these variables, and
one to write from them to a file.
The output function will have to produce output that conforms to your
design for the file layout.
And the input function will have to expect to find data in the file
to be in that format.
As a result, the input function will be able to read any node that the
output function wrote.
Now that you have some variables and two functions that are all related,
they are an obvious candidate to glue together in a class.
You might call it URLFileNode. Now a URLFileNode knows how to read and
write itself, and any code that uses such nodes doesn't have to think about
the grungy details.
You could even overload the "<<" and ">>" operators, so that you can
say things like:
fstream fs;
URLFileNode node;
...
fs >> node;
By the way, using a struct to glue the variables together isn't as nice.
Some details regarding reading and writing to a file
You can "write" a bunch of stuff at once
You know by now that you can use "write" to write something out in binary.
Eg:
fs.write (&i, sizeof(int));
But "write" can also be used to write out composite structures, such as
arrays.
You simply tell it what memory address to start at, and how many bytes to
write out. For example, if you had an array of 100 integers called "A",
you could say:
fs.write (A, 100*sizeof(int));
This works because the contents of an array are guaranteed to be contiguous
in memory.
(You don't explicitly take the address of the array because an array name
can be used as a pointer to its first element.)
You can even use "write" to write out a struct or an object.
For example, I could say:
fstream fs;
URLFileNode node;
...
fs.write( &node, sizeof(URLNode) );
This will write out, in binary, the instance variables inside node.
Note that you should never try to compute the size of an object by adding
up the sizes of its instance variables, because some extra memory may be
used just to make things line up with convenient boundaries in memory.
Watch out if there are any pointers inside!
There is one major limitation with using "write":
Whether you're writing a single variable, an array, a struct, or an object,
if it contains a pointer to something in memory, all that will be written
to the file concerning that thing is the pointer itself.
For example, if you have an array of "char *"s and you write it out,
all that will go into the file is the pointers, not the character arrays
that they point to. So the information about those character arrays is
lost as soon as the program ends.
On top of that, the pointers that were written to the file are completely
useless the next time you read them in,
because they point to memory that you were using the last time you
ran the program. Who knows what's there now?!
So if you have an array, struct or object that contains a pointer,
you'll have to create a function that writes out it's components individually,
making sure to follow any pointers and print out the thing pointed to,
rather than the pointer itself.
Any or all of this can still be written in binary format, it just
has to be done piece by piece.
If you are using an object of a class that you didn't write, you should
assume it contains pointers and don't write it using a single "write"
with sizeof the classname.
Instead, use one of the class's own output functions.
What about "empty" spaces?
Some students were concerned about what would happen if they used "write"
on something that hadn't been completely filled with values.
As a simple example, imagine we have an array of 100 ints and we've put
values into only the first 36 slots.
If we use "write" to write out the entire array -- all 100 slots -- there
is no problem. The last 64 slots have something in them,
it's just meaningless junk, and that's what will get written out.
We can later read the array back into memory and we'll get the
first 36 slots filled with our integers, and the last 64 filled with
exactly the same meaningless junk.
This is not a problem, as long as we know not to look past the first 36
slots; and we needed to know this even when we were just dealing with
the array in memory.
Of course, once we write the array out to the file and stop the program,
we've lost track of the number 36.
So we probably would want to write
it to the file also.
Miscellaneous
At any one time, there should be only one node from the index file
in memory, and one node from the URL file.
You should absolutely not read the whole binary tree (or any of the linked
lists) into memory -- they will not fit.
This means that there is no need for any field that is a "URLFileNode *"
in your URLFileNode, for example.
Back to the
St. George csc228 home page