Skip to content

Latest commit

 

History

History
168 lines (109 loc) · 4.69 KB

problem_set_1_solutions.md

File metadata and controls

168 lines (109 loc) · 4.69 KB

Data Analysis with Pandas - Solutions

Reading in Data

1.Read in the comma-separated file "client_list.csv". Assign as variable df1.

df1 = pd.read_csv("client_list.csv")
  1. Read in the delimted file "client_list.table". Assign as variable as df2.
df2 = pd.read_csv("client_list.table", sep=';')
  1. Read in the fixed-width file "client_list.txt". Assign as variable df3
df3 = pd.read_fwf("client_list.txt")
  1. Read in the comma-separated file "client_list.csv", skip the first 3 rows, and ignore the header. Do not assign to variable (just return a view).
pd.read_csv("client_list.csv", skiprows=3, header=None)
  1. Read in the comma-separated file "client_list.csv". Set the column headers in all caps. Assign as variable df.
df = pd.read_csv("client_list.csv")
df.columns = [x.upper() for x in df.columns]
  1. Read in the comma-separated file "client_list_practice.csv" and only extract the columns ["FIRST_NAME","AGE","EYE_COLOR"]. Do not assign to a variable.
pd.read_csv("client_list_practice.csv", usecols=["FIRST_NAME","AGE","EYE_COLOR"])

Slicing a Data Set

  1. Slice rows 5 through 11 of df. Can you provide two ways of doing this?
df[4:11]
df.loc[4:10, :]
  1. Return only the columns ['LAST_NAME','AGE','HAIR_COLOR'] for df. Can you provide two ways of doing this?
df[['LAST_NAME','AGE','HAIR_COLOR']]
df.loc[:, ['LAST_NAME','AGE','HAIR_COLOR']]
  1. Combine problems 7 and 8: return rows 5 though 11 and columns ['LAST_NAME','AGE','HAIR_COLOR'] for df. Can you provide two ways of doing this?
df[4:11][['LAST_NAME','AGE','HAIR_COLOR']]
df.loc[4:10, ['LAST_NAME','AGE','HAIR_COLOR']]

Simple Queries

  1. Find the subset of df where the client's last name is "Smith".
df[df.LAST_NAME=='Smith']
df.loc[df.LAST_NAME=='Smith', :]
  1. Find the subset of df where the client's hair color is not black.
df.loc[df.HAIR_COLOR!='black',:]
df.loc[~(df.HAIR_COLOR=='black'), :]
df[df.HAIR_COLOR!='black']
df[~(df.HAIR_COLOR=='black')]
  1. Find the subset of df where the client's hair color is red and reset the values to "ginger".
df.loc[df.HAIR_COLOR=='red', 'HAIR_COLOR'] = "ginger"

Complex Queries

  1. Find the subset of df where the clients are females older than 30 years.
df[(df.AGE>30) & (df.GENDER=='F')]
df.loc[(df.AGE>30) & (df.GENDER=='F'), :]
  1. Repeat problem 13, but return only the hair color and eye color.
df[(df.AGE>30) & (df.GENDER=='F')][['HAIR_COLOR','EYE_COLOR']]
df.loc[(df.AGE>30) & (df.GENDER=='F'), ['HAIR_COLOR','EYE_COLOR']]
  1. Find the unique combination of hair and eye color for women older than 25 years.
df.loc[(df.GENDER=='F') & (df.AGE>25), ['HAIR_COLOR','EYE_COLOR']].drop_duplicates()
df[(df.GENDER=='F') & (df.AGE>25)][['HAIR_COLOR','EYE_COLOR']].drop_duplicates()

Additional Dataframe Operations

  1. Perform a merge using "client_list.csv" and "customer_id_list.csv". Assign the resulting dataframe as clients.
df = pd.read_csv("client_list.csv")
df.columns = [x.upper() for x in df.columns]
ids = pd.read_csv("customer_id_list.csv")
clients = pd.merge(left=df, right=ids, how='left', on=['LAST_NAME','FIRST_NAME','GENDER','AGE'])
  1. Perform a merge using clients and "purchase_log.csv" and limit the subset to only clients who made purchases. Assign the resulting dataframe as detailed_sales.
sales = pd.read_csv("purchase_log.csv")
detailed_sales = pd.merge(left=clients, right=sales, how='inner', on=['CUSTOMER_ID'])
  1. Use groupby to find the client who spent the most money on purchases. Determine how much he/she spent. HINT: save the intermediate dataframe from using groupby as spenders before applying slicing to determine the client who spent the most money on purchases.
spenders = detailed_sales.groupby(['CUSTOMER_ID','FIRST_NAME','LAST_NAME','GENDER','AGE'], as_index=False)['PRICE'].sum()
spenders[spenders.PRICE==spenders.PRICE.max()]
  1. (BONUS) Modify the answer to problem 18 slightly to determine exactly what items where purchased by the top spending client.
top_spender_id = spenders[spenders.PRICE==spenders.PRICE.max()].reset_index(drop=True).loc[0,'CUSTOMER_ID']
sales.loc[sales.CUSTOMER_ID==top_spender_id, 'ITEM_DESCRIPTION']

Writing Files

  1. Save detailed_sales as a csv file named "df_out.csv" with no indices.
detailed_sales.to_csv("df_out.csv", index=False)
  1. Save detailed_sales to a pickle file named "df_out.p"
detailed_sales.to_pickle("df_out.p")

Back to Problem Set