For this problem set, we will be using real data. We will analyze the height and weight of the athletes at the 2012 London Olympics. You can find the CSV for this dataset here.
- Use
DictReader
fromimport csv
to read the CSV data file into a list of dictionaries namedathletes
, where each row is a dictionary. - Create a list named
ages
that is a simple list of integers of all the ages in our file. - Create two lists named
ages_female
andages_male
that is a simple list of integers of the ages of female and male athletes. - Create three lists
weights
,weights_female
, andweights_male
, much like parts 2 and 3, that are simple lists of integers values of the weights fromathletes
. - Create three lists
heights
,heights_female
, andheights_male
, much like parts 2 and 3, that are simple lists of integers values of the heights fromathletes
. - Create a list called
bmi
, which is a list of the body mass index (BMI) values for each athlete in our list. (HINT: BMI = weight {kg} / (height {meters} * height {meters}).) - Much like part 5, create two lists
bmi_female
andbmi_male
, which include just the BMI values for the female and male atheletes respectively.
NOTE: This problem set deals with the BMI because it is easy to calculate for this particular data set. However, the BMI has many limitations, and it does not fully represent the health of the human body.
- Find the mean and standard deviation of:
ages
,ages_female
, andages_male
. What do you now know about the age of Olympic athletes? Is this what you expected? - Find the mean and standard deviation of:
heights
,heights_female
, andheights_male
. We probably expect the average man to be somewhat taller than the averge woman. Is that true for Olympic athletes? - Find the mean and standard deviation of:
weights
,weights_female
, andweights_male
. We probably expect the average man to be somewhat heavier than the averge woman. Is that true for Olympic athletes? - Find he mean and standard deviation of:
bmi
,bmi_female
, andbmi_male
. What is a typical BMI for an Olympic athlete?
- How do the geometric mean and harmonic mean compare for
heights_female
? - How do the geometric mean and harmonic mean compare for
weights_male
? - Build a 10-bin histogram from the
bmi
list. - Build a histogram for the
heights_female
andheights_male
lists, starting at 120 cm and going to up to 220 cm in 10 cm increments.
If Angelina Jolie and Brad Pitt were in the athletes
list above, here is what their lines would look like:
{'Name': 'Angelina Jolie', 'Age': '40', 'Sex': 'F', 'Weight (kg)': '56.5', 'Sport': 'Acting', 'Height (cm)': '173'}
{'Name': 'Brad Pitt', 'Age': '52', 'Sex': 'M', 'Weight (kg)': '78', 'Sport': 'Acting', 'Height (cm)': '180'}
- What percentile is Angelina Jolie's weight, compared to the
weights_female
list? - What percentile is Brad Pitt's height, compared to the
heights_male
list? - What percentile would Angelina and Brad fall into in
bmi_female
andbmi_male
respectively? - What percentile would YOU fall into, in your respective sex height, weight, and bmi? (No judgements!)
Let's try and fit our data. First, we will try to interpolate between the age and the BMI of our Olympic athletes. As it happens, interpolation is meant for the situation where we have one X value for one Y value. Since we have many duplicate ages among our athletes, this is not a good fit. While taking a small sample of the data is fine for education, it is probably not what we would do with this data in real life.
- Use
dict
andzip
to make a dictionary of the first 25 athletes in yourages
andbmi
lists. Name your dictionarybmi_by_age
. - Create a ordered list, named
age_keys
of the ages inbmi_by_age
. (Usesorted
and.keys()
.) - Create a list, named
bmi_values
, of the bmi values associate with each age inage_keys
. (Use afor
loop and yourage_keys
along withbmi_by_age
.) - Create a function
f_linear
that is an interpolation ofage_keys
andbmi_values
. (Useinterp1d
.) - Create a function
f_cubic
that is a cubic interpolation ofage_keys
andbmi_values
. (Useinterp1d
along withkind='cubic'
.) - Try different ages in your
f_linear
andf_cubic
functions. How well do they match each other? How well do they match the data? Do they make sense?
Let's try to analyze all of our data points (athletes) in a slightly more realistic way. A good start would be to use a more general curve-fitting approach.
Just to help you through the process, here is the data you're trying to fit:
- Convert the following from lists to
numpy.array
:ages_female
,ages_male
,bmi_female
, andbmi_male
. - Create a function named
linear
that takesx
,a
, andb
and returns ax + b. - Use
curve_fit
and yourlinear
function to fit the data where female athletes ages are the x-value and female athletes BMI are the y-values. Do you think your fitted function matches the plot above? - Use
curve_fit
and yourlinear
function to fit the data where male athletes ages are the x-value and male athletes BMI are the y-values. Do you think your fitted function seems reasonable? How could you test that?