|
| 1 | +// Adapted from an assignment used Fall 2014 in EE 264 under Prof. Lu |
| 2 | + |
| 3 | +// READING ASSIGNMENT: Chapter 20 of the online PDF text covers binary trees. |
| 4 | + |
| 5 | + |
| 6 | +// ~ Overview ~ // |
| 7 | + |
| 8 | +Recursion is an elegant method for solving many problems, and many data |
| 9 | +structures are actually easier to manipulate with recursive functions. |
| 10 | +In this assignment, we will see how a binary tree (actually any tree |
| 11 | +data structure) can utilize the power of recursion. |
| 12 | + |
| 13 | +// ~ Learning Goals ~ // |
| 14 | + |
| 15 | +(0) The main goal: learning the programming environment tools |
| 16 | +(1) Recursion |
| 17 | +(2) Binary (search) trees |
| 18 | +(3) Large datasets |
| 19 | +(4) Memory management |
| 20 | + |
| 21 | +Your solution will be a single |
| 22 | +(1) answer09.c |
| 23 | + |
| 24 | +We provide a tester program for FINAL EVALUATION to be run as follows: |
| 25 | + |
| 26 | + > ./tester |
| 27 | + |
| 28 | +Don't get tempted and run this program before constructing your own tests. |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +// ~ Binary (Search) Trees ~ // |
| 33 | + |
| 34 | +Practically, you may have enough background to solve the assignment |
| 35 | +without reading this section of the README, but it is required reading |
| 36 | +for deepening your understanding of binary trees. |
| 37 | + |
| 38 | +From Wikipedia: A binary tree is a tree data structure in which each |
| 39 | +node has at most two children (referred to as the left child and the |
| 40 | +right child). |
| 41 | + |
| 42 | +This is similar (and more general) to "linked lists", where nodes have |
| 43 | +at most one child. |
| 44 | + |
| 45 | +Linked List, 4 examples of increasing length: |
| 46 | + |
| 47 | + O head/tail O head O head O head |
| 48 | + | | | |
| 49 | + V V V |
| 50 | + O tail O O |
| 51 | + | | |
| 52 | + V V |
| 53 | + O tail O |
| 54 | + | |
| 55 | + V |
| 56 | + O tail |
| 57 | + |
| 58 | +Binary Tree, 4 examples of increasing depth: |
| 59 | + |
| 60 | + O root O root O root O root |
| 61 | + / \ / \ / \ |
| 62 | + O O O O O O |
| 63 | + left right / \ / \ / \ / \ |
| 64 | + O O O O O O O O |
| 65 | + / \ / \ / \ / \ |
| 66 | + O O O O O O O O |
| 67 | + |
| 68 | +The above diagrams show *full* binary trees, where every node has |
| 69 | +either zero or two children. It is perfectly acceptable to have a node |
| 70 | +with one child somewhere in the tree: |
| 71 | + |
| 72 | +More Binary Trees: |
| 73 | + |
| 74 | + O root O root O root O root |
| 75 | + / \ / \ / \ |
| 76 | + O O O O O O |
| 77 | + \ / \ \ |
| 78 | + O O O O |
| 79 | + |
| 80 | +Note that every linked list is (conceptually) also a binary tree. |
| 81 | + |
| 82 | +NOTE: at each O in the above diagrams, data can be stored. We are |
| 83 | +just showing the linkage structure relating the nodes, not any contents. |
| 84 | + |
| 85 | +These work well with recursion because the very definition of the data |
| 86 | +structure is actually recursive. Consider this definition: |
| 87 | + |
| 88 | +A "Linked List" is either the NULL pointer, or a node whose next |
| 89 | +pointer points to a "Linked List". |
| 90 | + |
| 91 | +This defines the term "Linked List" in terms of itself. There is a |
| 92 | +formal meaning for such definitions which we will not cover this |
| 93 | +semester, but in fact, it is formally interpreted to be exactly the |
| 94 | +type "Linked List" that you expect. The recursive definition tells |
| 95 | +you that any list processing function can be defined for all linked |
| 96 | +lists by defining it for NULL and for a node pointing to a linked |
| 97 | +list. IF the latter case can be defined by using the same function, |
| 98 | +voila, a recursive function emerges. |
| 99 | + |
| 100 | +Now consider the recursive definition of binary tree: |
| 101 | + |
| 102 | +A "binary tree" is either the NULL pointer, or a node with left and |
| 103 | +right children that are "binary trees". |
| 104 | + |
| 105 | +Again, the definition implies the construction of a recursive function |
| 106 | +that processes trees. |
| 107 | + |
| 108 | +As long as the desired function on trees (or lists) can be naturally |
| 109 | +defined in terms of the results of the SAME function called on the |
| 110 | +children (or next pointer), then a recursive function is natural. The |
| 111 | +body of the function itself only has to consider one node and use |
| 112 | +recursion to handle the rest. |
| 113 | + |
| 114 | +i.e., if a recursive program can handle one node, then it can handle the |
| 115 | +entire tree. |
| 116 | + |
| 117 | +The relevant theory for why a binary tree is more efficient than a |
| 118 | +regular list is covered in future courses, but for a teaser, think |
| 119 | +about this: if you have a collection of N elements in a linked list, |
| 120 | +and you want to see if element X is in the list, then you need to |
| 121 | +compare X to every element in the list, a total of N |
| 122 | +comparisons. However, if you use a binary search tree, you only need |
| 123 | +to traverse down a single branch, and on average will make around |
| 124 | +log_2(N) comparisons. (i.e., log-base-2 of N.) What it means to |
| 125 | +"traverse down a single branch" will become clear as you write the |
| 126 | +code for this assignment. In more technical terms, the worst-case |
| 127 | +behavior of lookup in a binary-search tree involves comparing at each |
| 128 | +node in the longest path from the root to a leaf, i.e. the worst-case |
| 129 | +runtime grows with the "depth" of the tree. If the tree is well |
| 130 | +balanced (leaf nodes occur at similar depths on all paths), then this |
| 131 | +depth will be log_2(N), because at each node when you choose a child |
| 132 | +you eliminate about half the candidates, those that are below the |
| 133 | +other child. |
| 134 | + |
| 135 | +// ~ The Yelp Database as a Tree ~ // |
| 136 | + |
| 137 | +This assignment combines the power of binary search trees with a set of data |
| 138 | +that is as close to the real world as possible: Data about local businesses. |
| 139 | +Users of the website yelp (http://www.yelp.com) can post reviews and |
| 140 | +recommendation about their local restaurants and share them with the world. |
| 141 | +Yelp hosts about 57 million reviews for businesses catering to 132 million |
| 142 | +users a month, who constantly query their servers for information about their |
| 143 | +local businesses. |
| 144 | + |
| 145 | +One fundamental problem for companies with huge datasets like Yelp is |
| 146 | +retrieving information from a database. The user experience of the Yelp website |
| 147 | +would be unacceptable if, hypothetically, all of the data describing the |
| 148 | +businesses would be stored in a simple array or linked list: It would mean that |
| 149 | +for every search, the server would have to compare every single entry in the |
| 150 | +database to the search term. |
| 151 | + |
| 152 | +Of course, this is not how it is done in the real world, and this is where tree |
| 153 | +data structures like the binary search tree described above come in. |
| 154 | + |
| 155 | +The dataset you'll be working with is a small fraction of data pulled from |
| 156 | +Yelp's servers. The information available to you is: |
| 157 | + |
| 158 | + - the name of the business |
| 159 | + - the address of the business |
| 160 | + - an average rating of the business on a scale of 1 through 4 |
| 161 | + |
| 162 | +You will construct a binary tree from this dataset that is ordered according |
| 163 | +to the names of the businesses. Recall that there exists a certain C library |
| 164 | +that provides functions to compare two strings to each other. This will create |
| 165 | +a tree data structure with the following properties: |
| 166 | + |
| 167 | +(1) Each node's left subtree contains nodes with names "less than" "equal to" |
| 168 | + its own name |
| 169 | + |
| 170 | +(2) Each node's right subtree contains nodes with names "greater than" its |
| 171 | + own name |
| 172 | + |
| 173 | +(3) All left and right subtrees are binary search trees themselves. |
| 174 | + |
| 175 | + |
| 176 | +// ~ The Yelp Dataset File ~ // |
| 177 | + |
| 178 | +The dataset you'll be working with is a small fraction of the Yelp Academic |
| 179 | +dataset, which is itself a fraction of Yelp's database that was made public for |
| 180 | +use in academia. More information and a download for the full dataset can be |
| 181 | +found here: http://www.yelp.com/dataset_challenge |
| 182 | + |
| 183 | +The data was converted from the JSON format to a text file filled with |
| 184 | +tab-seperated values with the extension '.tsv'. There is a lot more information |
| 185 | +in the full dataset, but for this assignment, you will only use the average |
| 186 | +rating, the name, and the address. The file you're provided with, called |
| 187 | +'yelp_businesses.tsv' has information about 42153 businesses in the United |
| 188 | +States and should be read line by line. |
| 189 | + |
| 190 | +One line in the .tsv file lists one business and is structured like this: |
| 191 | + |
| 192 | + [rating]\t[name]\t[address]\n |
| 193 | + |
| 194 | +In order to fill the fields of one node, you need to seperate a line according |
| 195 | +to the delimiter '\t'. Remember your explode() function from PA03? It could |
| 196 | +come in very handy for this assignment. If you want to use your explode |
| 197 | +function from PA03, then you will need to copy and paste it somewhere into |
| 198 | +your answer09.c file. Of course, you don't HAVE to use explode(...) here, as |
| 199 | +long as you create the nodes properly and without memory leaks. |
| 200 | + |
| 201 | +// ~ Hints ~ // |
| 202 | + |
| 203 | +Chapter 20 of the online PDF text covers binary trees. |
| 204 | + |
| 205 | +---------------------------------------------------------------------- |
| 206 | + |
| 207 | +(*) Even though a tree is not an array, it is still easy to "iterate" |
| 208 | +over all of the elements. Iteration means you want to visit every |
| 209 | +element once, and only once. You already know how to do this with an |
| 210 | +array: |
| 211 | + |
| 212 | +int myints[] = { 5, 3, 6, 7 }; |
| 213 | +for(ind = 0; ind < 4; ++ind) |
| 214 | + do_something_with(myints[ind]); // see, visit each element once and only once |
| 215 | + |
| 216 | +With trees, you choose either pre-order, in-order, or post-order |
| 217 | +traversal to do the same thing. Please look these up in the class notes. |
| 218 | + |
| 219 | +---------------------------------------------------------------------- |
| 220 | + |
| 221 | +(*) If you get stuck getting started, then try writing just |
| 222 | +create_node(...), and a print_tree(...) function. You should |
| 223 | +then be able to do the following: |
| 224 | + |
| 225 | +// Step 1 // |
| 226 | +Create and print trees like so: |
| 227 | + |
| 228 | +int main(int argc, char * * argv) |
| 229 | +{ |
| 230 | + // calls to create_node below needs to use ustrdup() on each string, omitted for clarity |
| 231 | + BusinessNode * root = create_node("5.0", "random name", "random address"); |
| 232 | + root->left = create_node("3.5", "another name", "another address"); |
| 233 | + root->right = create_node("4.0", "yet another name", "some address"); |
| 234 | + root->left->right = create_node("1.5", "name 3", "address 3"); |
| 235 | + print_tree(root); |
| 236 | + return 0; |
| 237 | +} |
| 238 | + |
| 239 | +// Step 2 // |
| 240 | +Write destroy_tree(...). You should now have no memory leaks or |
| 241 | +errors. |
| 242 | + |
| 243 | +// Step 3 // |
| 244 | +Write tree_insert(...) and make sure it |
| 245 | +always works no matter what is thrown at it. |
| 246 | + |
| 247 | +// Step 4 // |
| 248 | +Write load_tree_from_file(...), which calls insert in a loop. |
| 249 | + |
| 250 | +// Step 5 // |
| 251 | +Write tree_search_name() and try to search for some Businesses that you know |
| 252 | +are in the tree. |
| 253 | + |
| 254 | +At this stage you will be reasonably close to completing the assignment. |
| 255 | + |
| 256 | +---------------------------------------------------------------------- |
| 257 | + |
| 258 | +(*) To test the load_tree_from_file(...) function, it makes sense to work with |
| 259 | +a smaller subset of the data. To read the first 5 lines of |
| 260 | +'yelp_businesses.tsv' into a new file called 'shortfile.tsv', use the following |
| 261 | +command: |
| 262 | + |
| 263 | + > cat yelp_businesses.tsv | head -n 5 > shortfile.tsv |
| 264 | + |
| 265 | +Use "man cat" and "man head" to understand these two commands. This |
| 266 | +simple shell command also uses "pipes" (|) and "output redirection" |
| 267 | +(>), both of which are simple and easy to understand and very very |
| 268 | +useful. |
0 commit comments