Skip to content

Commit 2a5d26e

Browse files
committed
Add PA 09
1 parent 9e81696 commit 2a5d26e

File tree

4 files changed

+42736
-0
lines changed

4 files changed

+42736
-0
lines changed

pa09/README

+268
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
// Adapted from an assignment used Fall 2014 in EE 264 under Prof. Lu
2+
3+
// READING ASSIGNMENT: Chapter 20 of the online PDF text covers binary trees.
4+
5+
6+
// ~ Overview ~ //
7+
8+
Recursion is an elegant method for solving many problems, and many data
9+
structures are actually easier to manipulate with recursive functions.
10+
In this assignment, we will see how a binary tree (actually any tree
11+
data structure) can utilize the power of recursion.
12+
13+
// ~ Learning Goals ~ //
14+
15+
(0) The main goal: learning the programming environment tools
16+
(1) Recursion
17+
(2) Binary (search) trees
18+
(3) Large datasets
19+
(4) Memory management
20+
21+
Your solution will be a single
22+
(1) answer09.c
23+
24+
We provide a tester program for FINAL EVALUATION to be run as follows:
25+
26+
> ./tester
27+
28+
Don't get tempted and run this program before constructing your own tests.
29+
30+
31+
32+
// ~ Binary (Search) Trees ~ //
33+
34+
Practically, you may have enough background to solve the assignment
35+
without reading this section of the README, but it is required reading
36+
for deepening your understanding of binary trees.
37+
38+
From Wikipedia: A binary tree is a tree data structure in which each
39+
node has at most two children (referred to as the left child and the
40+
right child).
41+
42+
This is similar (and more general) to "linked lists", where nodes have
43+
at most one child.
44+
45+
Linked List, 4 examples of increasing length:
46+
47+
O head/tail O head O head O head
48+
| | |
49+
V V V
50+
O tail O O
51+
| |
52+
V V
53+
O tail O
54+
|
55+
V
56+
O tail
57+
58+
Binary Tree, 4 examples of increasing depth:
59+
60+
O root O root O root O root
61+
/ \ / \ / \
62+
O O O O O O
63+
left right / \ / \ / \ / \
64+
O O O O O O O O
65+
/ \ / \ / \ / \
66+
O O O O O O O O
67+
68+
The above diagrams show *full* binary trees, where every node has
69+
either zero or two children. It is perfectly acceptable to have a node
70+
with one child somewhere in the tree:
71+
72+
More Binary Trees:
73+
74+
O root O root O root O root
75+
/ \ / \ / \
76+
O O O O O O
77+
\ / \ \
78+
O O O O
79+
80+
Note that every linked list is (conceptually) also a binary tree.
81+
82+
NOTE: at each O in the above diagrams, data can be stored. We are
83+
just showing the linkage structure relating the nodes, not any contents.
84+
85+
These work well with recursion because the very definition of the data
86+
structure is actually recursive. Consider this definition:
87+
88+
A "Linked List" is either the NULL pointer, or a node whose next
89+
pointer points to a "Linked List".
90+
91+
This defines the term "Linked List" in terms of itself. There is a
92+
formal meaning for such definitions which we will not cover this
93+
semester, but in fact, it is formally interpreted to be exactly the
94+
type "Linked List" that you expect. The recursive definition tells
95+
you that any list processing function can be defined for all linked
96+
lists by defining it for NULL and for a node pointing to a linked
97+
list. IF the latter case can be defined by using the same function,
98+
voila, a recursive function emerges.
99+
100+
Now consider the recursive definition of binary tree:
101+
102+
A "binary tree" is either the NULL pointer, or a node with left and
103+
right children that are "binary trees".
104+
105+
Again, the definition implies the construction of a recursive function
106+
that processes trees.
107+
108+
As long as the desired function on trees (or lists) can be naturally
109+
defined in terms of the results of the SAME function called on the
110+
children (or next pointer), then a recursive function is natural. The
111+
body of the function itself only has to consider one node and use
112+
recursion to handle the rest.
113+
114+
i.e., if a recursive program can handle one node, then it can handle the
115+
entire tree.
116+
117+
The relevant theory for why a binary tree is more efficient than a
118+
regular list is covered in future courses, but for a teaser, think
119+
about this: if you have a collection of N elements in a linked list,
120+
and you want to see if element X is in the list, then you need to
121+
compare X to every element in the list, a total of N
122+
comparisons. However, if you use a binary search tree, you only need
123+
to traverse down a single branch, and on average will make around
124+
log_2(N) comparisons. (i.e., log-base-2 of N.) What it means to
125+
"traverse down a single branch" will become clear as you write the
126+
code for this assignment. In more technical terms, the worst-case
127+
behavior of lookup in a binary-search tree involves comparing at each
128+
node in the longest path from the root to a leaf, i.e. the worst-case
129+
runtime grows with the "depth" of the tree. If the tree is well
130+
balanced (leaf nodes occur at similar depths on all paths), then this
131+
depth will be log_2(N), because at each node when you choose a child
132+
you eliminate about half the candidates, those that are below the
133+
other child.
134+
135+
// ~ The Yelp Database as a Tree ~ //
136+
137+
This assignment combines the power of binary search trees with a set of data
138+
that is as close to the real world as possible: Data about local businesses.
139+
Users of the website yelp (http://www.yelp.com) can post reviews and
140+
recommendation about their local restaurants and share them with the world.
141+
Yelp hosts about 57 million reviews for businesses catering to 132 million
142+
users a month, who constantly query their servers for information about their
143+
local businesses.
144+
145+
One fundamental problem for companies with huge datasets like Yelp is
146+
retrieving information from a database. The user experience of the Yelp website
147+
would be unacceptable if, hypothetically, all of the data describing the
148+
businesses would be stored in a simple array or linked list: It would mean that
149+
for every search, the server would have to compare every single entry in the
150+
database to the search term.
151+
152+
Of course, this is not how it is done in the real world, and this is where tree
153+
data structures like the binary search tree described above come in.
154+
155+
The dataset you'll be working with is a small fraction of data pulled from
156+
Yelp's servers. The information available to you is:
157+
158+
- the name of the business
159+
- the address of the business
160+
- an average rating of the business on a scale of 1 through 4
161+
162+
You will construct a binary tree from this dataset that is ordered according
163+
to the names of the businesses. Recall that there exists a certain C library
164+
that provides functions to compare two strings to each other. This will create
165+
a tree data structure with the following properties:
166+
167+
(1) Each node's left subtree contains nodes with names "less than" "equal to"
168+
its own name
169+
170+
(2) Each node's right subtree contains nodes with names "greater than" its
171+
own name
172+
173+
(3) All left and right subtrees are binary search trees themselves.
174+
175+
176+
// ~ The Yelp Dataset File ~ //
177+
178+
The dataset you'll be working with is a small fraction of the Yelp Academic
179+
dataset, which is itself a fraction of Yelp's database that was made public for
180+
use in academia. More information and a download for the full dataset can be
181+
found here: http://www.yelp.com/dataset_challenge
182+
183+
The data was converted from the JSON format to a text file filled with
184+
tab-seperated values with the extension '.tsv'. There is a lot more information
185+
in the full dataset, but for this assignment, you will only use the average
186+
rating, the name, and the address. The file you're provided with, called
187+
'yelp_businesses.tsv' has information about 42153 businesses in the United
188+
States and should be read line by line.
189+
190+
One line in the .tsv file lists one business and is structured like this:
191+
192+
[rating]\t[name]\t[address]\n
193+
194+
In order to fill the fields of one node, you need to seperate a line according
195+
to the delimiter '\t'. Remember your explode() function from PA03? It could
196+
come in very handy for this assignment. If you want to use your explode
197+
function from PA03, then you will need to copy and paste it somewhere into
198+
your answer09.c file. Of course, you don't HAVE to use explode(...) here, as
199+
long as you create the nodes properly and without memory leaks.
200+
201+
// ~ Hints ~ //
202+
203+
Chapter 20 of the online PDF text covers binary trees.
204+
205+
----------------------------------------------------------------------
206+
207+
(*) Even though a tree is not an array, it is still easy to "iterate"
208+
over all of the elements. Iteration means you want to visit every
209+
element once, and only once. You already know how to do this with an
210+
array:
211+
212+
int myints[] = { 5, 3, 6, 7 };
213+
for(ind = 0; ind < 4; ++ind)
214+
do_something_with(myints[ind]); // see, visit each element once and only once
215+
216+
With trees, you choose either pre-order, in-order, or post-order
217+
traversal to do the same thing. Please look these up in the class notes.
218+
219+
----------------------------------------------------------------------
220+
221+
(*) If you get stuck getting started, then try writing just
222+
create_node(...), and a print_tree(...) function. You should
223+
then be able to do the following:
224+
225+
// Step 1 //
226+
Create and print trees like so:
227+
228+
int main(int argc, char * * argv)
229+
{
230+
// calls to create_node below needs to use ustrdup() on each string, omitted for clarity
231+
BusinessNode * root = create_node("5.0", "random name", "random address");
232+
root->left = create_node("3.5", "another name", "another address");
233+
root->right = create_node("4.0", "yet another name", "some address");
234+
root->left->right = create_node("1.5", "name 3", "address 3");
235+
print_tree(root);
236+
return 0;
237+
}
238+
239+
// Step 2 //
240+
Write destroy_tree(...). You should now have no memory leaks or
241+
errors.
242+
243+
// Step 3 //
244+
Write tree_insert(...) and make sure it
245+
always works no matter what is thrown at it.
246+
247+
// Step 4 //
248+
Write load_tree_from_file(...), which calls insert in a loop.
249+
250+
// Step 5 //
251+
Write tree_search_name() and try to search for some Businesses that you know
252+
are in the tree.
253+
254+
At this stage you will be reasonably close to completing the assignment.
255+
256+
----------------------------------------------------------------------
257+
258+
(*) To test the load_tree_from_file(...) function, it makes sense to work with
259+
a smaller subset of the data. To read the first 5 lines of
260+
'yelp_businesses.tsv' into a new file called 'shortfile.tsv', use the following
261+
command:
262+
263+
> cat yelp_businesses.tsv | head -n 5 > shortfile.tsv
264+
265+
Use "man cat" and "man head" to understand these two commands. This
266+
simple shell command also uses "pipes" (|) and "output redirection"
267+
(>), both of which are simple and easy to understand and very very
268+
useful.

0 commit comments

Comments
 (0)