Skip to content

Simplify BaseEncoder mapping #403

@julia-kraus

Description

@julia-kraus

Current State
The class variable mapping of the BaseEncoder class is a dictionary with the following structure
a dictionary with the following item:

  • 'col' -> column name (str)
  • 'mapping' -> pd.Series containing the mapping from category to encoding

Suggestion
Avoid pandas data frames and just use dictionaries instead. These are faster, easier to read and easier to manipulate. Mappings could be chained more easily. Moreover, many of the encoders, they are converted to dict anyways.
Use a nested dictionary of the structure {colname: {'cat_a': mapping_val_a, 'cat_b': mapping_val_b, 'cat_c': mapping_val_c ,...)

Actually the whole encoder could just yield the mapping dictionary, because the column names can be retrieved by mapping.keys().

This would really be helpful for better development and readability, but would be work to update and test all subclasses. I think it is unnecessary that users can supply both dictionaries and pd.Series as mappings because they can run pd.Series.to_dict() themselves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions