Skip to content

Decouple tf-idf transform on training set from tf-idf of test set #121

@aschmu

Description

@aschmu

This is not an issue or bug per se with the FeatureHashing package, but I'm wondering if it's possible to train a model using the tf-idf option with the split function using hashed.model.matrix, but without computing the tf-idf transform on the training + test datasets.
I'm thinking that in many realistic scenarios, we don't know in advance what words the test set will contain, hence the decoupling of the tf-idf.
Normally, at prediction time, one would only keep the words that appeared in the training set and discard the others to construct a tf-idf matrix prior to using the hashing trick.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions