-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about Interpreting Feature Importance #16
Comments
Hi, thanks for the suggestion on the formula computation - it can definitely be optimized (ideally even moved to GPU) As for the second question, I would refer to the section 3 (page 5) of the RETAIN paper and specifically formula 5. Perhaps "importance" is bad naming and should be renamed to "contribution" to the final logit value instead. Raw beta values will fail to account for relative size/value of the embedding and non-binary values input to the model. |
Thanks for your quick reply! For the first question, what I actually mean is that the values_mask isn't correctly apply to output_scaled. If numeric value isn't used when training, since values_mask will be all 1, it is fine to use alpha_scaled[:, 0] as the importance_feature (it is actually the same as 1*visit_alpha * output_scaled, not applying any masking). However, if numeric value is used when training, then value_mask would not be all 1, it will be like [1,1,1,..., 1, numeric value 1, numeric value 2...], and since alpha_scaled did not correctly multiply with the values_mask, there will be some problem. For the second problem, I'll refer to the paper again and double check the logic behind. |
Ah, sorry for a misunderstanding. I am not 100% there is an error since we have validated that sum of logits matches the prediction score of the model but it is worth investigating more |
Hi!
Thanks for answering my question about temporal data last time. I got some questions with the feature importance implementation this time.
Here is the feature importance showed in df_visit (line 193; retain_interpretations.py)
'importance_feature': alpha_scaled[:, 0]
and alpha_scaled is computed in this way.
alpha_scaled = values_mask * visit_alpha * output_scaled
I'm wondering what you originally wish to do is to first apply visit attention on output_scaled, then apply the value mask on it.
But it eventually became a 2D matrix with shape (len(values_mask), len(values_mask)), while alpha_scaled[:, 0] is retrieving its first column value.
Maybe the correct way to compute alpha_scaled is like this:
alpha_scaled = np.multiply(values_mask * visit_alpha, output_scaled.reshape(-1))
and importance_feature would just be
'importance_feature': alpha_scaled
Besides the implementation question, I am also confused why feature importance is computed this way. I cannot understand why dot the beta_scaled with output_weights and further apply mask on it. I thought beta_scaled could already represent the importance along embedding dimension, and after summing them up we can retrieve the feature importance.
Thanks for your time! Hope you can help answer the questions.
Best,
The text was updated successfully, but these errors were encountered: