Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about Interpreting Feature Importance #16

Closed
fredchen000 opened this issue Aug 6, 2020 · 3 comments
Closed

Questions about Interpreting Feature Importance #16

fredchen000 opened this issue Aug 6, 2020 · 3 comments

Comments

@fredchen000
Copy link

Hi!

Thanks for answering my question about temporal data last time. I got some questions with the feature importance implementation this time.

Here is the feature importance showed in df_visit (line 193; retain_interpretations.py)
'importance_feature': alpha_scaled[:, 0]
and alpha_scaled is computed in this way.
alpha_scaled = values_mask * visit_alpha * output_scaled

I'm wondering what you originally wish to do is to first apply visit attention on output_scaled, then apply the value mask on it.
But it eventually became a 2D matrix with shape (len(values_mask), len(values_mask)), while alpha_scaled[:, 0] is retrieving its first column value.

Maybe the correct way to compute alpha_scaled is like this:
alpha_scaled = np.multiply(values_mask * visit_alpha, output_scaled.reshape(-1))
and importance_feature would just be
'importance_feature': alpha_scaled

Besides the implementation question, I am also confused why feature importance is computed this way. I cannot understand why dot the beta_scaled with output_weights and further apply mask on it. I thought beta_scaled could already represent the importance along embedding dimension, and after summing them up we can retrieve the feature importance.

Thanks for your time! Hope you can help answer the questions.

Best,

@tRosenflanz
Copy link
Contributor

Hi, thanks for the suggestion on the formula computation - it can definitely be optimized (ideally even moved to GPU)

As for the second question, I would refer to the section 3 (page 5) of the RETAIN paper and specifically formula 5. Perhaps "importance" is bad naming and should be renamed to "contribution" to the final logit value instead. Raw beta values will fail to account for relative size/value of the embedding and non-binary values input to the model.
Let me know if you have specific questions about the logic outlined in the paper

@fredchen000
Copy link
Author

Thanks for your quick reply!

For the first question, what I actually mean is that the values_mask isn't correctly apply to output_scaled. If numeric value isn't used when training, since values_mask will be all 1, it is fine to use alpha_scaled[:, 0] as the importance_feature (it is actually the same as 1*visit_alpha * output_scaled, not applying any masking). However, if numeric value is used when training, then value_mask would not be all 1, it will be like [1,1,1,..., 1, numeric value 1, numeric value 2...], and since alpha_scaled did not correctly multiply with the values_mask, there will be some problem.
I'm not pretty sure if I correctly understand the code, but never mind, I'll just close the issue. Thanks again for you reply.

For the second problem, I'll refer to the paper again and double check the logic behind.

@tRosenflanz
Copy link
Contributor

Ah, sorry for a misunderstanding. I am not 100% there is an error since we have validated that sum of logits matches the prediction score of the model but it is worth investigating more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants