Open
Description
Describe the bug
This is more of a nitpick :) I think there is an implicit assumption that the types of the outcome_variable
and treatment_variable(s)
should be float. So if we provide a dataframe to DoubleMLData
where those variables are of type Decimal
, the partialling out step fails with the error shown below. This is more of an issue specially when reading parquet files.
TypeError Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
602 ext_prediction_dict[learner] = None
604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
606 external_predictions=ext_prediction_dict,
607 return_models=store_models)
609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
611 # calculate rmses and store predictions and targets of the nuisance models
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
226 g_hat = {'preds': external_predictions['ml_g'],
227 'targets': None,
228 'models': None}
229 else:
230 # get an initial estimate for theta using the partialling out score
--> 231 psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
232 psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
233 theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)
TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'
Minimum reproducible code snippet
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
from doubleml import DoubleMLData, DoubleMLPLR
df = pd.read_parquet("/...")
x_cols = [x for x in df.columns if "pre_" in x]
d_col = "event_action"
y_col = "post_outcome"
dml_data = DoubleMLData(df, y_col = y_col, d_cols=d_col, x_cols=x_cols)
learner = RandomForestRegressor(n_jobs = -1)
lasso = LassoCV()
dml_plr = DoubleMLPLR(dml_data, ml_l = learner, ml_g = learner, ml_m=lasso, score= "IV-type", n_folds = 2)
dml_plr.fit(n_jobs_cv = -1)
Expected Result
Model should fit successfully.
Actual Result
TypeError Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
602 ext_prediction_dict[learner] = None
604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
606 external_predictions=ext_prediction_dict,
607 return_models=store_models)
609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
611 # calculate rmses and store predictions and targets of the nuisance models
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
226 g_hat = {'preds': external_predictions['ml_g'],
227 'targets': None,
228 'models': None}
229 else:
230 # get an initial estimate for theta using the partialling out score
--> 231 psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
232 psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
233 theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)
TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'
Versions
Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
DoubleML 0.7.1
Scikit-Learn 1.3.2