Skip to content

[Bug]: type casting outcome_variable and treatment_variable(s) #232

Open
@hjk612

Description

@hjk612

Describe the bug

This is more of a nitpick :) I think there is an implicit assumption that the types of the outcome_variable and treatment_variable(s) should be float. So if we provide a dataframe to DoubleMLData where those variables are of type Decimal, the partialling out step fails with the error shown below. This is more of an issue specially when reading parquet files.

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Minimum reproducible code snippet

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
from doubleml import DoubleMLData, DoubleMLPLR

df = pd.read_parquet("/...")

x_cols = [x for x in df.columns if "pre_" in x]
d_col = "event_action"
y_col = "post_outcome"

dml_data = DoubleMLData(df, y_col = y_col, d_cols=d_col, x_cols=x_cols)

learner = RandomForestRegressor(n_jobs = -1)
lasso = LassoCV()
dml_plr = DoubleMLPLR(dml_data, ml_l = learner, ml_g = learner, ml_m=lasso, score= "IV-type", n_folds = 2)
dml_plr.fit(n_jobs_cv = -1)

Expected Result

Model should fit successfully.

Actual Result

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Versions

Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
DoubleML 0.7.1
Scikit-Learn 1.3.2

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions