system setup

The following sessions address the general tooling such as using the command line, Python (NumPy, Matplotlib, Pandas, Seaborn), Jupyter Notebooks, Git (and GitHub), and sending HTTP requests. You must be comfortable with these before attending the classes. The following sessions may assist that:

a. Lab1 about Jupyter and Git

(Note that Jupyter Notebook has evolved into Jupyter Lab since the sessions were recorded and we will be using the latter in the class.

A more updated introductory source is available here)

b. Lecture 2 for Pandas (scraping part optional)

c. Lecture 4 Databases

The following sessions are concept refreshers on cohort prerequisites:

a. Lab 3: Probability and distributions

b. Lecture 7: Bias

c. Lab 4 on Regression video and notebook

Step 3: Data Science Presentations

Study academy’s pre-course presentations and make sure you search online for any concepts that you are not familiar with.

jupyter, python, etc

Start Jupyter Lab:

jupyter-lab

python: numpy, matplotlib, pandas, seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ML Notes

Linear Regression - AKA least squares

Linearly Separable

You can draw a line between two sets of data

KNN - x nearest neighbors

just save all the data into the database and future queries lookup to closest value and k neighbours to figure out what answer should be.

Cross Validation

Shuffle the dataset randomly. Split the dataset into k groups For each unique group: Take the group as a hold out or test data set Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set Retain the evaluation score and discard the model Summarize the skill of the model using the sample of model evaluation scores

Bias

High bias means more error in your predictions.

Cost/Loss/Objective function

how far we are away from the right answer.

gradient descent

learning rate

The size of the step we take is called the learning rate.

jupyter, python, etc

Start Jupyter Lab:

import pandas as pd
# tab separated, column 0 is an index column
the_data = pd.read_csv("mydata.tsv", sep="\t", index_col=0)

# get min, max, std, mean info about the data
the_data.describe()

# get the column names (NOTE: property not function):
the_data.columns

# general info about the data, like how much memory it uses:
the_data.info()

# retrieve the first rows of the data to have a look at it
the_data.head()

Modeling

This is following the kaggle intro to machine learning tutorial

https://www.kaggle.com/dansbecker/your-first-machine-learning-model

First lets select the thing we want to predict

y = the_data.hearing_damage

Choosing features that are considered as inputs to the predictions We select multiple features by providing a list of column names inside brackets.

the_data_features = ['source_intensity', 'horizontal_distance', 'vertical_distance', 'insulation']

Now lets get a pandas dataframe with just these columns:

X = the_data[the_data_features]

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
data_model = DecisionTreeRegressor(random_state=1)

data_model.fit(X, y)

data_model.predict( X.head())

X.head()

Lets check how good the model is:

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

split up the data

Its not good to use same data to train AND test with.

from sklearn.model_selection import train_test_split

# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
data_model = DecisionTreeRegressor()

# Fit model
data_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = data_model.predict(val_X)

print(mean_absolute_error(val_y, val_predictions))

vary decision tree depth

We can adjust the depth of the decision tree with a line like:

model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)

We can test a variety of tree depths and their MAE with:

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))
    
** random forests

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

pandas

dataframes

# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age'])

Inference

For now, though, we are only dealing with inference, the process of computing the output using the given structure, input, and whatever weights there are.

functions

uncomplicate.neanderthal.native

dge rows columns Creates a GE matrix using double precision floating point native CPU engine

dv Creates a vector using double precision floating point native CPU engine

uncomplicate.neanderthal.core

(axpy! alpha x y): a times x plus y destructive. multiplies elements of vector/matrix x by scalar alpha, then adds the result to vector/matrix y.

mm! - matrix-matrix multiplication (mm! alpha a b) (mm! alpha a b c) (mm! alpha a b beta c) Multiply matrix a by b. Scale by alpha. Put result into one of a/b whichever is the GE matrix. If c is supplied result is put there. If scalar beta is supplied first scale c by it.

(mrows a) Returns the number of rows of the matrix a.

mv! - Matrix-Vector multiplication (mv! m1 x1 y) Multiplies matrix m1, by vector x1, and adds it to vector y.

(ncols a) Returns the number of columns of the matrix a.

rk! (rk! alpha x y a) Multiplies vector x with transposed vector y, scales resulting matrix by alpha, add result to a.

uncomplicate.neanderthal.vect-math

fmax - keep max value of each pair from 2 vectors (let [v1 (dv [1 2 3]) v2 (dv [0 2 7])] (fmax v1 v2)) ;; => [1.00 2.00 7.00]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!