University of Toronto - Spring 2000
Department of Computer Science

CSC 228: File Structures and Data Management

Advice on Assignment 2

Diane Horton

In office hours, I have seen many students who are confused about how to represent a node in a file, how to represent it in memory, and how to read and write between the file and memory. This posting is intended to help you with this.

Representing a node in a file

Let's consider a node from one of the linked lists in the URL file. What does it contain? Some URLs, and a "pointer" to the next node. How are these things represented in the file? This is completely up to you, and here are some of the issues you'll have to decide on:

Will the URLs each take a fixed amount of space in the file, or will they use just as much space as they need?
Will the pointer be a byte offset or a RRN?
In either case, the pointer is just a number. Will it be written as a sequence of characters, or in binary?
Will there be any delimiters between the fields of a node, or at the end of a node (to separate it from the one that falls next in the file)?

Whatever choices you make, try to picture each node as a sequence of bytes, and to picture the nodes as falling one after another in the file. In fact, draw yourself a picture of this, as I did in class. Note that the URL file does not contain a single linked list; there is one linked list for each keystring in the index file.

Representing a node in memory

How do you store the contents of a node in memory? You simply need to:

Define some variables that are appropriate to hold the information in a node.
For example, you could use an int to store a RRN, or a string to store a URL. Or perhaps an array of strings to store the bunch of URLs that are in a single node.
Write a function for reading a node into these variables, and one to write from them to a file. The output function will have to produce output that conforms to your design for the file layout. And the input function will have to expect to find data in the file to be in that format. As a result, the input function will be able to read any node that the output function wrote.

Now that you have some variables and two functions that are all related, they are an obvious candidate to glue together in a class. You might call it URLFileNode. Now a URLFileNode knows how to read and write itself, and any code that uses such nodes doesn't have to think about the grungy details. You could even overload the "<<" and ">>" operators, so that you can say things like:

      fstream fs;
      URLFileNode node;
      ...

      fs >> node;

By the way, using a struct to glue the variables together isn't as nice.

Some details regarding reading and writing to a file

You can "write" a bunch of stuff at once

You know by now that you can use "write" to write something out in binary. Eg:

     fs.write (&i, sizeof(int));

But "write" can also be used to write out composite structures, such as arrays. You simply tell it what memory address to start at, and how many bytes to write out. For example, if you had an array of 100 integers called "A", you could say:

     fs.write (A, 100*sizeof(int));

This works because the contents of an array are guaranteed to be contiguous in memory. (You don't explicitly take the address of the array because an array name can be used as a pointer to its first element.) You can even use "write" to write out a struct or an object. For example, I could say:

      fstream fs;
      URLFileNode node;
      ...

      fs.write( &node, sizeof(URLNode) );

This will write out, in binary, the instance variables inside node. Note that you should never try to compute the size of an object by adding up the sizes of its instance variables, because some extra memory may be used just to make things line up with convenient boundaries in memory.

Watch out if there are any pointers inside!

There is one major limitation with using "write": Whether you're writing a single variable, an array, a struct, or an object, if it contains a pointer to something in memory, all that will be written to the file concerning that thing is the pointer itself.

For example, if you have an array of "char *"s and you write it out, all that will go into the file is the pointers, not the character arrays that they point to. So the information about those character arrays is lost as soon as the program ends. On top of that, the pointers that were written to the file are completely useless the next time you read them in, because they point to memory that you were using the last time you ran the program. Who knows what's there now?!

So if you have an array, struct or object that contains a pointer, you'll have to create a function that writes out it's components individually, making sure to follow any pointers and print out the thing pointed to, rather than the pointer itself. Any or all of this can still be written in binary format, it just has to be done piece by piece. If you are using an object of a class that you didn't write, you should assume it contains pointers and don't write it using a single "write" with sizeof the classname. Instead, use one of the class's own output functions.

What about "empty" spaces?

Some students were concerned about what would happen if they used "write" on something that hadn't been completely filled with values. As a simple example, imagine we have an array of 100 ints and we've put values into only the first 36 slots. If we use "write" to write out the entire array -- all 100 slots -- there is no problem. The last 64 slots have something in them, it's just meaningless junk, and that's what will get written out. We can later read the array back into memory and we'll get the first 36 slots filled with our integers, and the last 64 filled with exactly the same meaningless junk. This is not a problem, as long as we know not to look past the first 36 slots; and we needed to know this even when we were just dealing with the array in memory. Of course, once we write the array out to the file and stop the program, we've lost track of the number 36. So we probably would want to write it to the file also.

Miscellaneous

At any one time, there should be only one node from the index file in memory, and one node from the URL file. You should absolutely not read the whole binary tree (or any of the linked lists) into memory -- they will not fit. This means that there is no need for any field that is a "URLFileNode *" in your URLFileNode, for example.

Back to the St. George csc228 home page