Skip to content

handling categorical varaible with large number of levels #1482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
parsifal9 opened this issue Feb 21, 2025 · 0 comments
Open

handling categorical varaible with large number of levels #1482

parsifal9 opened this issue Feb 21, 2025 · 0 comments

Comments

@parsifal9
Copy link

Hi All,

I have a data set where the variables are a mix of numeric and factor variables, several with 100's of levels.
I can encode the factor variables as one-hot-encoded variables but this will increase the number of variables and reduce the amount of information for each level.

What I would do in ranger is use the option respect.unordered.factors = TRUE. This orders the levels of the factor
according to the mean of the response variable. It is known that this gives the same result as the "partition" method if it is done at
every node. In this case it is only done once at the root node, but still seem to give a good approximation and is the recommended method.
In the absence of that option I can make an X matrix with the factor variables encoded as respect.unordered.factors = TRUE would do it and use that in fitting the random forest.

However, causal_forest requires numeric variables and fits two regression forests,
forest.Y and then forest.W (to estimate Y and W) and then fits causal_train to estimate the causal effect.

So far I have considered

  1. making an X matrix with the factor variables ordered on the response Y -- this seem right for the forest.Y regression
  2. the forest.W regression is only predicting two levels so the same encoded matrix should be fine
  3. causal_train forest is more complex and I am not sure how to handle the factor variables.

Alternatively

  1. I use ranger with the respect.unordered.factors = TRUE to estimate tau
  2. I then make a matrix with the factor variables ordered as tau
  3. I use that in causal_forest

Do either of these methods seem right? What are other people doing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant