Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit linear model spec #269

Open
dfsnow opened this issue Dec 6, 2024 · 3 comments
Open

Revisit linear model spec #269

dfsnow opened this issue Dec 6, 2024 · 3 comments
Assignees
Labels
method ML technique or method change

Comments

@dfsnow
Copy link
Member

dfsnow commented Dec 6, 2024

Per the ask from @ccao-jardine, let's give the baseline linear model a good once-over to see if we can improve its performance and assumptions. You can see the current model specification here. It uses the Tidymodels recipes framework to perform pre-processing steps. You can find the list of available transformations/steps in the recipes reference. You can find ever more possible steps here.

You'll probably want to get started by just editing the recipe specification, then re-running the linear model locally and checking RMSE (or other performance metrics) against the unaltered recipe. I would take a sample to get started to make it run a bit faster.

Some possible things to look at:

  • Interactions - We've barely done any exploration here (only sqft * township indicator), but there are a lot of possible interactions in this model. Start with basic ones like yrblt * sqft, # beds * township, etc.
  • Alternative to one-hot encoding - Right now, the categorical variables are one-hot encoded, resulting in super wide (high P) input data. It's worth trying some other encodings like mean/median, hashing, etc.
  • Imputation - We currently use the simplest possible imputation strategy (mode for categoricals, median for numeric). Let's try the bagging imputation built into recipes.
  • Transforms - It's probably worth transforming and normalizing some of the more skewed numeric features e.g. the ACS vars.
  • Engine - Right now, the linear model just uses the basic lm function as a backend. That means no regularization. Something like glmnet would probably perform better and be faster.
@dfsnow dfsnow added the method ML technique or method change label Dec 6, 2024
@ssaurbier
Copy link

Can you state the goal here?

it could be fun to try make the linear model really good

Is this the goal? If you are trying to predict, there is no point in using a linear model. Is there an assigned task to perform inference? If so, please create an inferential model issue and I will take it.

For inference, I also highly recommend a bayesian approach, like lace: https://github.com/promised-ai/lace. Joint priors will be critical in this housing context, and efforts to linearize this model would verge on procrustean.

Still - it is not clear to me why linear models would be pursued in the first place - fiddling with feature engineering does not move the needle for prediction, and I have not seen any inferential issues.

Please advise

@dfsnow
Copy link
Member Author

dfsnow commented Dec 20, 2024

The linear model included in the pipeline is purely for reference. It's only used for comparison to the boosted tree model. Making the model specification better is just a low-priority training task for our junior employees.

@dfsnow
Copy link
Member Author

dfsnow commented Feb 7, 2025

@SiennaWang12 I filled in some of the details on this issue! Let me know if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
method ML technique or method change
Projects
None yet
Development

No branches or pull requests

5 participants