You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per the ask from @ccao-jardine, let's give the baseline linear model a good once-over to see if we can improve its performance and assumptions. You can see the current model specification here. It uses the Tidymodels recipes framework to perform pre-processing steps. You can find the list of available transformations/steps in the recipes reference. You can find ever more possible steps here.
You'll probably want to get started by just editing the recipe specification, then re-running the linear model locally and checking RMSE (or other performance metrics) against the unaltered recipe. I would take a sample to get started to make it run a bit faster.
Some possible things to look at:
Interactions - We've barely done any exploration here (only sqft * township indicator), but there are a lot of possible interactions in this model. Start with basic ones like yrblt * sqft, # beds * township, etc.
Alternative to one-hot encoding - Right now, the categorical variables are one-hot encoded, resulting in super wide (high P) input data. It's worth trying some other encodings like mean/median, hashing, etc.
Imputation - We currently use the simplest possible imputation strategy (mode for categoricals, median for numeric). Let's try the bagging imputation built into recipes.
Transforms - It's probably worth transforming and normalizing some of the more skewed numeric features e.g. the ACS vars.
Engine - Right now, the linear model just uses the basic lm function as a backend. That means no regularization. Something like glmnet would probably perform better and be faster.
The text was updated successfully, but these errors were encountered:
it could be fun to try make the linear model really good
Is this the goal? If you are trying to predict, there is no point in using a linear model. Is there an assigned task to perform inference? If so, please create an inferential model issue and I will take it.
For inference, I also highly recommend a bayesian approach, like lace: https://github.com/promised-ai/lace. Joint priors will be critical in this housing context, and efforts to linearize this model would verge on procrustean.
Still - it is not clear to me why linear models would be pursued in the first place - fiddling with feature engineering does not move the needle for prediction, and I have not seen any inferential issues.
The linear model included in the pipeline is purely for reference. It's only used for comparison to the boosted tree model. Making the model specification better is just a low-priority training task for our junior employees.
Per the ask from @ccao-jardine, let's give the baseline linear model a good once-over to see if we can improve its performance and assumptions. You can see the current model specification here. It uses the Tidymodels recipes framework to perform pre-processing steps. You can find the list of available transformations/steps in the recipes reference. You can find ever more possible steps here.
You'll probably want to get started by just editing the recipe specification, then re-running the linear model locally and checking RMSE (or other performance metrics) against the unaltered recipe. I would take a sample to get started to make it run a bit faster.
Some possible things to look at:
yrblt * sqft
,# beds * township
, etc.lm
function as a backend. That means no regularization. Something likeglmnet
would probably perform better and be faster.The text was updated successfully, but these errors were encountered: