CSC148 Introduction to Computer Science
Assignment 2, B+Trees, Lists, TestDriven Development, M-Way intersection

Due:
  • TestBPTree.py :
    • Wednesday March 27, 11:50 PM 10% Bonus
    • Thursday March 28, 11:50 PM, 0% Bonus
    • Friday March 29, 11:50 PM, -10% Bonus
  • BPTree.py, TestBPTree.py : Thursday April 4, 11:50 PM
  • BPTree.py, TestBPTree.py, WebSearch.py, TestWebSearch.py : Monday April 8, 11:50 PM, up to 48 hours late with no penalty.
Late penalty: BPTree.py (Thursday April 4) can't be handed in late, in case people need solutions, for part 3.
Hand in: Electronic submit here. Make sure to include members.txt.
Groups: Work in groups of size at most three.
Environment:We will test your code under python in the lab.
  1. [30 Marks] Test Driven Development: Writing Tests
    Lecture on Monday March 25 will cover this.
    Due: Wednesday March 27, 11:50 PM

    This part of the assignment has you write test cases for the supplied BPTree.py classes. If you remember, the Test Driven Development approach has you ...

    1. Frame the classes: Frame out the collection of classes and their methods first.
    2. Write test cases: next, you cement your understanding of these classes and their methods in a set of unittests. Your test cases should be exhaustive. This means that you should be very surprised to find that code with bugs could pass your test cases. Make sure you clearly document the test cases so that the TAs can easily see what you are testing.
    3. Write code: Finally, you write code, running tests as you go to make sure your code conforms to your understanding of what it should be doing.

    What you will do...

    1. Frame the classes: I have already started this for you in BPTree.py. You can add methods, but do not change the signatures of the existing methods.
    2. Write test cases: For python, we will use pyunit. This also has the effect of driving your development of correct code. You should do the following for each class C: create a class TestC for appropriate method m in C: add a test_m to TestC See the code I have started for you TestBPTree.py. You will hand in these test cases early. Your early test cases will be marked.
    3. Write code: Complete the BPTree classes so that they pass your test cases. (see part 2)

  2. [40 Marks] Writing BPTree classes
    Lecture on Wednesday March 27 will cover this.
    Tutorial on Thursday March 28 will cover this.
    Due: Thursday April 4, 11:50 PM

    Hand in your completed BPTree code, which passes all of your tests. For those that do not finish BPTree.py, I will hand out a solution on midnight. No late assignments accepted.

    Marking specifics: BPTree.py (40 marks)

  3. [30 Marks] Loading your BPTree and executing searches
    Due: Monday April 8, 11:50 PM

    Note: Starter code is BPTree_soln.py, and WebSearch.py. You are now finally in a position to use your BPTree to do something useful. We will use it to conduct some simple searches. First load the BPTree with keys that look like (keyword, documentNumber). Because of the way that the BPTree works and the way tuples compare, these will appear in sorted order in the BPTree. So for example, if 'canada' appears in documents 5, 55, 17, 22, 109, 33, then the keys

    ('canada', 5), ('canada', 55), ('canada', 17), ('canada', 22), ('canada', 109), ('canada', 33) will appear in the BPTree in the following order ... ('canada', 5), ('canada', 17), ('canada', 22), ('canada', 33) ('canada', 55), ('canada', 109) To find all documents that have 'canada' in them, you search for ('canada', -1) and then walk the keys until you run out of ('canada',?) keys. The result would be [5,17,22,33,55,109].

    To find documents that have all of the keywords 'canada', 'history', '1722', find the intersection of the lists for 'canada', 'history' and '1722'. You might want to take a look at python iterators as a neat, efficient way to do all of this. Create an iterator for each of the keywords you are looking for and then intersect them, in much the same way as one might merge them. Remember, they will appear in documentNumber order.

    You should hand in WebSearch.py as well as TestWebSearch.py, which verifies the functioning of your WebSearch class. An instance of WebSearcher is initialized with a collection of (keyword,documentNumber) pairs. It uses this to build a BPTree. Once built, the WebSearcher can respond with queries such as ws.intersect([ws.search("canada"), ws.search("history"), ws.search("1722")]). You may have to add a few methods to your BPTree to get this done well. You can get started building this class early, by simulating the BPTree search results, returning sorted lists of integers, and then finding the intersection of these lists.

    Marking specifics: BPTree.py (in case you added modifications), WebSearch.py (25 marks)

    Marking specifics: TestWebSearch.py (5 marks)

  4. Questions and Answers

    Question:
    Do I have to write a lot of code for BPTree.py?
    Answer:
    My BPTree.py, which does not include stats, is 175 lines long. Your starter BPTree.py is 128 lines long, though I have added some additional comments to your starter code. The solution is simple, clear and obvious, though not necessarily easy to develop. Think hard, draw pictures, write little, but correct code.
    Question:
    Are the tests cases in your TestBPTree.py complete?
    Answer:
    No, for example, you might want to check that the pointer returned from a split leaf points to an instance of BPTreeLeaf. I did not check the capacity of the newly created split leaf.
    Question:
    How do I let you know the members of our group?
    Answer:
    Please fillin and submit members.txt. One submission/group please.
    Question:
    Hi, Could please explain what setUP method is suppose to test in the TestBPTree.py.
    Answer:
    This is for setup of some structures before executing individual test cases. The way I have organized the tests so far, you do not need it.
    Question:
    Question about The insert method in BPTreeNode. When inserting do we insert the key as a new node, or follow the right pointer then insert the key into a leaf?
    Answer:
    For insert, the BPTreeNode has to have something underneith it. The simplest would be a BPTreeLeaf. One approach is the following: You can setup a scenario by holding onto a BPTreeLeaf, continually inserting until the leaf splits and then place the key and pointers in a new BPTreeNode. You can now continue testing by executing inserts on the BPTreeNode.
    Question:
    I am just wondering how would you know the three of us are in a group. Do we submit the same work?
    Answer:
    One submit/group please. Please fill out and include members.txt with your submission..
    Question:
    Hi professor i have a couple of questions for u regarding A2, first in test b tree there some codes such as class TestBPTreeNode(unittest.TestCase): def setUp(self): pass def test_init(self): pass def test_insert(self): pass
    1. They have only pass, do we have to write codes or test cases for this pass codes in the first part of the assgigment?
    2. Second question, do we have to make additional cases if yes how many for each of three clases given?
    3. and third do we have to fix all codes that u already given in test bst tree or they are already working codes ??
    Thanks in advance.
    Answer:
    1. Yes, write test cases to ensure that the complete BPTree modules classes work properly. All classes, as many methods as make sense.
    2. Enough to break down the world of inputs in a reasonable way and sample each category to have confidence that code that passes the tests really works.
    3. My code is a start, you probably want to change and improve it.
    Question:
    Hi professor: Question about insert method of BPTreeNode: def insert(self, key): ''' insert key into this subtree return (None, None) if there is nothing to promote return (pkey, ppointer) if self splits ''' # figure out which pointer we should follow, that is, follow # self._pointers[index] where # self._keys[index-1]<=key<self._keys[index] # ask that subtree to insert the key # if the subtree promoted (pkey, ppointer) we have to insert_here pass Are the subtrees same as leaves?
    Answer:
    A BPTreeNode has something below it. They may all be nodes or they may all be leaves. The subtree is all of the things are at this Node and below.
    Question:
    Can you explain method M from test_M in TestBPTree.py?
    Answer:
    Please see the supplied BPTree.py.
    Question:
    The code mentions pointers, what does that mean in terms of the code?
    Answer:
    Here is an example, draw the picture associated with this and I think you will understand. p3=BPTreeLeaf(['X','Y','Z'],None,7) p2=BPTreeLeaf(['P','Q','R'],p3,7) p1=BPTreeLeaf(['F','G','H'],p2,7) p0=BPTreeLeaf(['A','B','C'],p1,7) pointers=[p0,p1,p2,p3] keys=['F','P','X'] bptn=BPTreeNode(keys, pointers, 7)
    Question:
    My code raises an exception at certain places. I want to test that the exception is raised at the appropriate time, how do I do this?
    Answer:
    try: # do something I expect to cause an exception except MyException as e: pass # everything is as it should be else: assert... # code should have raised an exception alternatively, take a look at assertRaises
    Question:
    Do we need to test all methods of all BPTree classes?
    Answer:
    Yes.
    Question:
    so if the tree is like following: BPTree Node |3| Node |2| |5| Leaf |1| |2|3| |4| |5|6| Is this tree's height 3? (2 layer of Node+ Leaf)
    Answer:
    The tree you drew above has height=3, numNodes=3, numLeaves=4, numKeys=6.
    Question:
    If i read the return value of stats, the order changed from (height, number of nodes, number of keys, number of leaves)to(self.height(), self.numNodes(), self.numLeaves(), self.numKeys()). Which one is correct? (height, numNodes, numKeys, numLeaves)? or (height, numNodes, numLeaves, numkeys)? def stats(self): ''' return a tuple consisting of (height, number of nodes, number of keys, number of leaves) in the tree ''' ''' you may want to create separate height, numNodes, numLeaves, numKeys methods and collect their information to return stats, for example, return (self.height(), self.numNodes(), self.numLeaves(), self.numKeys()) this way, all partners can work on a separate problem. We won't mark efficiency of this just that the code is simple, clear, obvious and works. ''' pass
    Answer:
    Lets use ... return (self.height(), self.numNodes(), self.numLeaves(), self.numKeys()
    Question:
    What do you mean by ... class BPTreeNode(object): ... def keys(self): ''' return a list of all keys in the leaves of this subtree ''' pass ... it seems to conflict with the definition in BPTreeLeaf.
    Answer:
    You are correct. The comment in BPTreeNode is not correct. A better comment would be, Return a list of all keys in the linked list of leaves which are greater than or equal to the least key in this subtree. In other words, starting at the least key in this subtree, ending at the end of the linked list. The intention is, if you start at the root of the tree, that you get all of the keys in the tree.
    Question:
    Is there an error in the provided constructor for BPTree?
    Answer:
    There was, the new, empty leaf did not have its capacity initialized. The corrected code is below and now in the file. class BPTree(object): def __init__(self, capacity): ''' create an empty BPTree where each node has capacity keys and capacity+1 pointers. Each leaf has capacity keys. ''' self._tree=BPTreeLeaf([],None, capacity) self._capacity=capacity
    Question:
    How come the BPTree.stats() method takes no arguments.
    Answer:
    This is an error, it should have self as an argument... def stats(self): ''' return a tuple consisting of (height, number of nodes, number of leaves, number of keys) in the tree ''' ''' you may want to create separate height, numNodes, numLeaves, numKeys methods and collect their information to return stats, for example, return (self.height(), self.numNodes(), self.numLeaves(), self.numKeys()) this way, all partners can work on a separate problem. We won't mark efficiency of this just that the code is simple, clear, obvious and works. ''' pass
    Question:
    Any other advice about testing?
    Answer:
    Remember to re-submit working test cases Advice for testing: for init: minimal testing, just make sure that the state of the created structure is correct. Only interesting for the BPTree class, where you should verify that the _tree pointer points to an empty leaf of the same capacity, with _next=None. Other inits, just check that the keys, pointers, next, capacity are as was placed. BPTreeNode.insert_here: they can fake the pointers and keys (just place any lists of appropriate size). They should check... (range=2,6/7)x(insert=first, middle, last)x(split=yes, no) put in keys, pointers, run test and then verify return and contents of resulting BPTreeNodes A similar approach works for BPTreeLeaf.insert The difficult test is BPTreeNode.insert. Setup a minimal tree, for example, BPTreeNode with a set of BPTreeLeaves. Then insert and verify the return and the picture that results. The verification should ensure that the contents of the BPTreeNode/Nodes are correct, this will mean, verifying that they point to the correct BPTreeLeaves as well.
    Question:
    When creating a new BPTreeNode or BPTreeLeaf shouldn't you copy the arguments, rather than setting them? I check for this in my Unit tests and your code fails.
    Answer:
    If you work from the assumption that the BPTreeNode and BPTreeLeaf classes will not copy the argument lists, the original code will run correctly and faster. If you assume that the lists will be copied, you can use the this alternate solution. The difference between this is ... apps0:~/public_html/148/13s/assignments/02> diff BPTree_soln.py BPTree_soln2.py 45,46c45,46 < self._keys=keys # in sorted order < self._pointers=pointers --- > self._keys=keys[:] # in sorted order > self._pointers=pointers[:] 126c126 < self._keys=keys # in sorted order --- > self._keys=keys[:] # in sorted order
    Question:
    Can you post the modified WebSearch.py code from Fridays lecture.
    Answer:
    Done.
    Question:
    Can you post examples where I go through a generator (the result of yield) one element at a time.
    Answer:
    Done, see the lecture notes. I have added an advanced example. This is the idea behind the union or intersection code you need to write in the third part of the assignment. The code I have posted only handles a pair of generators. You will need to handle a general list of generators. This is hard code to write well, and in general. See ... yield4.py and yield5.py. These run in python3.2 and above, not in python 2.
    Question:
    1. In your 'canada' example: ('canada', 5), ('canada', 17), ('canada', 22), ('canada', 33) ('canada', 55), ('canada', 109) To find all documents that have 'canada' in them, you search for ('canada', -1) and then walk the keys until you run out of ('canada',?) keys. The result would be [5,17,22,33,55,109]. What do you mean by the "-1"? Does the search method need another parameter in order to perform the search? 2. What is returned when you search for a key that is not in the tree?
    Answer:
    1. The lower level search methods need the tweak to search for the first in the list.
    2. Essentially the empty list.
    Question:
    Do we need to test for invalid capacity inputs in the test_init method of BPTree? (E.g. If a user tried to create a BPTree with capacity = 1, should we return an error?)
    Answer:
    No, though you could have thrown an exception for such cases.
    Question:
    I have a question, for intersection and union do I just use the built in python functions for those methods? or do I implement the intersection and union algorithms on my own?
    Answer:
    For the purposes of the assignment, you should re-implement these. See my advanced yield exercises. Most importantly is the understanding of the advantage of using something other than the builtins. You can just retrieve the first result, for example, without pulling all of them.
    Question:
    Sorry to bother you, but also, you wrote that our code should respond to commands like this one on the assignment page: ws.search(["canada", "history", "1722"])
    Answer:
    Please conform to the WebSearch.py example.
    Question:
    Could you please tell me, are we supposed to test all of the methods?� Or just intersect, union, search (also in the BPTree) and perhaps _clean?
    Answer:
    Search, intersect and union are sufficient.
    Question:
    Can you outline the union and intersection algorithms again?
    Answer:
    union(L): maintain data associated with each list while at least one of the lists is non-empty: yield the minimum advance all that have the minimum value intersect(L): maintain data associated with each list while all of the lists are non-empty: if the minimum of all lists matches: yield the minimum advance all that have the minimum value
    Question:
    Why does the sample yield5 code not use next, instead of __next__.
    Answer:
    OK, now it uses next.
    Question:
    I don't like the design of the search method in class BPTree (said Arnold).
    Answer:
    I have to agree with myself. This was a poor design choice. Adding tuples to the BPTree as keys is fine. Making the BPTree deal with it specifically is out of the scope for the BPTree. Here is a better design: Modify the search method in BPTree so that it lists all keys greater than or equal the specified key. Now it is the responsibility of whoever is calling this method to deal with the stream of keys coming out of the BPTree.search generator. WebSearch.search can pull the keys from BPTree.search and deal with the fact that they are tuples, stopping whenever it runs out of keys. Maybe later!
    Question:
    I don't like the application of BPTree to the WebSearch in general. I have a better data structure in mind.
    Answer:
    There are other possibilities, Meiying suggested that she keep a B+Tree of keywords for partial keyword matching as well as a collection of separate B+Trees, one per keyword, for just document numbers. These are the values in a dictionary, indexed by key. This design IS better in a number of ways, adding only two additional data structures. I am curious which you find easier to understand.
    Question:
    You said you were going to post TestWebSearch.py
    Answer:
    ArnoldsTestWebSearch.py has been posted. I had to modify my WebSearch.py. There were two issues, first, my previous intersect code in WebSearch.py did not work (but you replaced that anyway). It always involved the empty set, so all intersections were empty. Second, to simplify testing, I wanted to get the next document number to be added to the collection of documents. I added that method. class WebSearch(object): ... def get_next_document_number(self): return len(self._documents) ... Please make sure that your code works with my initial test suite. We will be running this on your code for part of your marks.