Skip to content

Preprocessing Data

Simone Carsey edited this page Aug 14, 2025 · 1 revision

For functions preparing data for modeling (e.g. train-test split, group shuffle-split, etc.) see modeling_prep.py. Many of these functions rely on a .csv datafile, such as the type saved from the Importing Data to Python step above. These .csv files are available in the .\processed_data directory in this repo. Example code will reference the copies available there.

Example code for use: For train-test split with county as grouping variable:

from modeling_prep import *

train_data = pd.read_csv('.\processed_data\oregon_train_timeseries.csv',header=0, index_col=1)
split_county_data = county_grouped_shufflesplit(train_data)

For train-test split without grouping variable:

from modeling_prep import *

train_data = pd.read_csv('.\processed_data\oregon_train_timeseries.csv',header=0, index_col=1)
split_data_dict = train_test_split_default(train_data)

For examining one county in particular, use single_oregon_county. For this, you need to have a dictionary of DataFrames already, i.e. from oregon_import().

Regarding single_oregon_county:

Inputs Type Notes Default Value(s)
data_dict dict A dictionary of DataFrames; typically take result of oregon_import None
county_code int FIPS code corresponding to desired county, must be included in each table of the data_dict provided None

Example code for use:

from data_import import *

oregon_data_dict = oregon_import()
wa_dict = single_oregon_county(oregon_data_dict, 41067)

Clone this wiki locally