Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Slot Filling] Improve data augmentation to make sure possible tag transitions are well represented #728

Open
ClemDoum opened this issue Dec 14, 2018 · 0 comments

Comments

@ClemDoum
Copy link
Collaborator

Problem description

Short description

In certain conditions some CRF tags transitions can by missing after the data augmentation or can be "underrepresented".
We must ensure that all possible tags transitions are in the augmented dataset so that inference does not fail systematically on those examples

Example

Given a dataset with 1 intent and 3 slots: slot_1, slot_2, slot_3

If in the dataset only has 5% utterances with the following pattern: bla bla [slot_1] [slot_2] bla bla and slot_1 only has 5% of length 1 entity values and 95% of length 2 entities values. Then when augmenting the data the probability of getting a the pattern B-slot-1 B-slot-2 in your training data becomes 0,0025 and will probably missing from your training data.

If slot_1 has the value word_1 and slot_2 has the value word_2 word_3, if the CRF sees: "word_1 word_2 word_3" then it will tag it as "B-slot-1 I-slot-1 B-slot-2" instead of "B-slot-1 B-slot-2 I-slot-2" because it has never seen this transition in the training data.

Now let's say that unluckily people use 95% of the time the length 1 value of the slot 1 then it means that the CRF will systematically fail in 95%*5%=4.75% of the cases, which is pretty high

Potential solutions

  • Make sure that all possible tags transitions are in the augmented dataset
  • Boost the proportion of rare tags transitions (this might have a negative impact on performances since CRF transitions weights might be impacted :s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant