Revisit linear model spec #269

dfsnow · 2024-12-06T22:46:17Z

Per the ask from @ccao-jardine, let's give the baseline linear model a good once-over to see if we can improve its performance and assumptions. You can see the current model specification here. It uses the Tidymodels recipes framework to perform pre-processing steps. You can find the list of available transformations/steps in the recipes reference. You can find ever more possible steps here.

You'll probably want to get started by just editing the recipe specification, then re-running the linear model locally and checking RMSE (or other performance metrics) against the unaltered recipe. I would take a sample to get started to make it run a bit faster.

Some possible things to look at:

Interactions - We've barely done any exploration here (only sqft * township indicator), but there are a lot of possible interactions in this model. Start with basic ones like yrblt * sqft, # beds * township, etc.
Alternative to one-hot encoding - Right now, the categorical variables are one-hot encoded, resulting in super wide (high P) input data. It's worth trying some other encodings like mean/median, hashing, etc.
Imputation - We currently use the simplest possible imputation strategy (mode for categoricals, median for numeric). Let's try the bagging imputation built into recipes.
Transforms - It's probably worth transforming and normalizing some of the more skewed numeric features e.g. the ACS vars.
Engine - Right now, the linear model just uses the basic lm function as a backend. That means no regularization. Something like glmnet would probably perform better and be faster.

The text was updated successfully, but these errors were encountered:

ssaurbier · 2024-12-20T15:55:43Z

Can you state the goal here?

it could be fun to try make the linear model really good

Is this the goal? If you are trying to predict, there is no point in using a linear model. Is there an assigned task to perform inference? If so, please create an inferential model issue and I will take it.

For inference, I also highly recommend a bayesian approach, like lace: https://github.com/promised-ai/lace. Joint priors will be critical in this housing context, and efforts to linearize this model would verge on procrustean.

Still - it is not clear to me why linear models would be pursued in the first place - fiddling with feature engineering does not move the needle for prediction, and I have not seen any inferential issues.

Please advise

dfsnow · 2024-12-20T16:04:59Z

The linear model included in the pipeline is purely for reference. It's only used for comparison to the boosted tree model. Making the model specification better is just a low-priority training task for our junior employees.

dfsnow · 2025-02-07T21:45:55Z

@SiennaWang12 I filled in some of the details on this issue! Let me know if you have any questions.

dfsnow added the method ML technique or method change label Dec 6, 2024

dfsnow assigned Damonamajor and wagnerlmichael Dec 6, 2024

dfsnow assigned SiennaWang12 and unassigned Damonamajor and wagnerlmichael Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit linear model spec #269

Revisit linear model spec #269

dfsnow commented Dec 6, 2024 •

edited

Loading

ssaurbier commented Dec 20, 2024

dfsnow commented Dec 20, 2024

dfsnow commented Feb 7, 2025

Revisit linear model spec #269

Revisit linear model spec #269

Comments

dfsnow commented Dec 6, 2024 • edited Loading

ssaurbier commented Dec 20, 2024

dfsnow commented Dec 20, 2024

dfsnow commented Feb 7, 2025

dfsnow commented Dec 6, 2024 •

edited

Loading