-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cockatiel #149
base: master
Are you sure you want to change the base?
Cockatiel #149
Conversation
Signed-off-by: Frederic Boisnard <[email protected]>
This patch separates the methods dedicated to image processing/plotting into different sub classes, so that Cockatiel and NLP related classes will not have to inheritate these. Signed-off-by: Frederic Boisnard <[email protected]>
Currently by default the fit() method of Craft uses a hardcoded sampler. This patch allows Craft to use a sampler given in parameter. Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
f266d96
to
120052a
Compare
Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
Signed-off-by: Frederic Boisnard <[email protected]>
120052a
to
3326a23
Compare
The conflicts must come from the 1st patch "test moving the EPSILON & similar to free some RAM" -> I will remove it from the PR since it was mostly to try to improve RAM management on my side (not sure if we want to take it ; and if yes, then it would be better in another PR I suppose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only read part of the PR for now. We will have to discuss the architecture of the code. It seem that cockatiel forces us to introduce nlp attributions.
@@ -16,4 +16,5 @@ | |||
from .object_detector import BoundingBoxesExplainer | |||
from .global_sensitivity_analysis import SobolAttributionMethod, HsicAttributionMethod | |||
from .gradient_statistics import SmoothGrad, VarGrad, SquareGrad | |||
from .nlp_occlusion import NlpOcclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discuss where to place this, but in my view, it should not be in the same folder as the other attributions methods. My suggestions :
xplique.attributions.nlp.occlusion
xplique.nlp.attributions.occlusion
@@ -40,10 +40,6 @@ class GradCAMPP(GradCAM): | |||
If a string is provided it will look for the layer name. | |||
""" | |||
|
|||
# Avoid zero division during procedure. (the value is not important, as if the denominator is | |||
# zero, then the nominator will also be zero). | |||
EPSILON = tf.constant(1e-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it necessary to move this?
|
||
def explain(self, | ||
sentence: str, | ||
words: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would allow an easy way not to specify words so that all the words are occluded, such as a None.
I think we should talk about this one also because it requires sentence-splitting, and there are many possibilities. Do we include punctuation? Do we split possessive parts? Do we allow the user to decide?
In my view, we should find a simple library that does this and refer to their API for possible parameters. However, this leads to the question, do we include such library in xplique initial requirements or do we make an additionnal nlp requirements?
perturbated_words = NlpOcclusion._apply_masks(words, masks) | ||
|
||
perturbated_sentences = [sentence] | ||
perturbated_sentences.extend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the perturbed sentences lack words from the initial sentence. Here you just concatenated the perturbed words. However, if I give the following inputs:
sentence = "My cat is in the garden."
words = ["cat", "garden"]
separator = " "
The perturbed_sentences
value will be: `["My cat is in the garden.", "garden", "cat"]
I do not think this is what we expect.
def explain(self, | ||
sentence: str, | ||
words: List[str], | ||
separator: str) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you do when the user provides sentences with punctuation? This may completely change the meaning of a sentence. I mean if you just concatenate words with space between them, the final sentence may not be readable.
return WordExtractor() | ||
if extract_fct == "excerpt": | ||
return ExcerptExtractor() | ||
raise ValueError("Extraction function can be only 'clause', \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did not include "flaire_clause"
and "spacy_clause"
in the error message, but you left "clause"
which is not a possibility.
Factory for extractor classes. | ||
""" | ||
@staticmethod | ||
def get_extractor(extract_fct="sentence", tagger=None, pipeline=None, clause_type=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that language
should be an argument of this function.
@@ -7,7 +7,7 @@ | |||
import tensorflow as tf | |||
|
|||
from ..types import Callable, Optional | |||
from ..utils_functions.object_detection import _box_iou, _format_objects, _EPSILON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the epsilon should not be loaded simultaneously as loading operators. But is it a good thing to initialize it at each run of the tf function?
Parameters | ||
---------- | ||
tokenizer | ||
The tokenizer function to be used for tokenization. It must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description needs to be clarified as no model is defined here. Does this correspond to what you define in the nlp_commons?
If it corresponds to the tokenizer-model pair, as in HuggingFace, then I recommend having a class that treats them both at the same time. It is easier to understand and the API would be more natural.
from ..types import Union, Tuple, Callable, List, Dict | ||
|
||
|
||
class NlpPreprocessor(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is called NLP... Should it not be in the nlp commons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wrote NlpProcessor
is it not more natural to call it NLPProcessor
as NLP is an acronym?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Fred, this is a huge PR! A complex one at that, which raises several points where we should all agree before merging. I listed several general remarks:
- No description of the PR in general, it can be great to clarify your choices and what you are trying to do.
- No notebook or tutorials, I cannot review the plot function or the API without examples. Furthermore, it will be necessary for the final version.
- I cannot launch the tests (from github), I think it is because of the conflicts.
- You did not base your code on the last version of master (1.3.3). I think that the problems you have with memory were solved with last pull requests. It should solve the conflicts at the same time.
- You did not make the documentation, it might be better to discuss the choice made before making it. It will save you some time.
- You remade a lot of indentations (automatically, I think), but it created many weird code parts. The line limit is 100 in Xplique.
- We will have to discuss two choices:
- Where do we put the nlp parts? In my view a xplique.npl module is pertinent
- The structure of the code (multi-inheritance)
- The code you added introduced a lot of dependencies to other libraries (flair, nltk, transformers (for tests)); we should discuss how to treat this. Maybe have an additional nlp_requirements.txt file to install when calling xplique.nlp? Furthermore, we should check the licenses of these libraries, they should be MIT, otherwise we cannot use them.
self, | ||
input_to_latent_model: Callable, | ||
latent_to_logit_model: Callable, | ||
preprocessor: NlpPreprocessor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it seem that the user have to wrap his tokenizer using NlpPreprocessor.
Is it possible for the user to give his tokenizer and for us to wrap it ourselves? Otherwise, it adds a step for the user. I would prefer is the API is as simple as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I understand your concern ; it's possible, but one of the goals of the NlpPreprocessor is to gather both the tokenizer and its parameters in one place, in order to 'proxy' the calls to the tokenizer. Then we can use this feature in cockatiel but also in external parts such as nlp_batch_predict():
- cockatiel._latent_predict() calls self.preprocessor.tokenize(), which will pass the good arguments to the tokenizer
- in torch_operations.nlp_batch_predict(preprocessor): we use preprocessor.preprocess() which again calls preprocessor.tokenize() to pass the good arguments to the tokenizer
If we pass the tokenizer alone in cockatiel.init(), then we will have to provide a variable number of parameters in addition (tokenizer arguments, can vary fron one tokenizer to another), and duplicate these parameters in nlp_batch_predict() :/
Or perhaps there is a better / easier way to implements this ?
Maybe I should rename the NlpPreprocessor into something like TokenizerHelper to better show it's simply to encapsulate tokenizer calls ?
from ..types import Union, Tuple, Callable, List, Dict | ||
|
||
|
||
class NlpPreprocessor(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wrote NlpProcessor
is it not more natural to call it NLPProcessor
as NLP is an acronym?
super().__init__(input_to_latent_model, latent_to_logit_model, | ||
number_of_concepts, batch_size, device=device) | ||
self.preprocessor = preprocessor | ||
self.patch_extractor = ExcerptExtractor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are forcing the patch_extractor
here, should it not be an argument of the init function?
self.preprocessor = preprocessor | ||
self.patch_extractor = ExcerptExtractor() | ||
|
||
def _latent_predict(self, inputs: List[str], resize=None) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that in this function you make really similar computations as in torch_operations.nlp_batch_predict
. What is the reason for not using it or extracting the common part between the two?
|
||
return self._to_np_array(activations) | ||
|
||
def _extract_patches(self, inputs: List[str]) -> Tuple[List[str], np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is not clear enough because it also computes embeddings. I suggest using _crop_and_embed
, _split_text_and _embed
, or _from_inputs_to_latent_patches
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved in previous pull requests, you should rebase on the last version of master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, previously solved.
|
||
assert np.array_equal(masks, expected_mask) | ||
|
||
def test_apply_masks(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I detected a problem with the Occlusion explain function. If you change the _apply_mask
method so that it also takes the sentence
as input and returns the perturbed_inputs
, then you will be able to test it more precisely.
import torch.nn as nn | ||
|
||
from torch.nn import MSELoss | ||
from transformers import RobertaPreTrainedModel, RobertaModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates quite a big dependence for xplique tests. The way to manage requirements for dev should also be discussed.
nb_excerpts = 2 | ||
nb_most_important_concepts = 2 | ||
|
||
best_sentences_per_concept = cockatiel_explainer_pos.get_best_excerpts_per_concept( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you be sure of which sentence will correspond to which concept?
First PR about the integration of Cockatiel in Xplique.