Migrate OpenNLP 'ant train-test-models' to Gradle #14198

msfroh · 2025-02-05T01:18:56Z

Description

This resurrects the OpenNLP model training task from Ant (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/analysis/opennlp/build.xml#L52-L84) to Gradle.

I ran the new regenerate task to regenerate the models with the new OpenNLP 2.5.3, which broke some of the chunking tests. I don't really understand the chunking output, so I just updated the test to reflect the new output. (The differences are pretty small.)

Edit: I checked the OpenNLP chunker docs and I think I understand the difference now. I think the older model wasn't properly the identifying verb group. So whereas the first sentence was previously a single long noun group:

Sentence number 1 has 6 words.

Now it's three groups (because the verb phrase gets its own group, which is consistent with the example in the OpenNLP manual):

Sentence number 1
has
6 words.

Resolves #13002

dweiss · 2025-02-06T11:01:42Z

lucene/analysis/opennlp/build.gradle

+  }
+}
+
+tasks.register('regenerate') { dependsOn 'trainTestModels' }


Those regenerate tasks are often more complex than just dependencies - they have inputs and outputs to detect when the generation task can be skipped. Is the output from this opennlp always the same (deterministic)? If so, we could plug it into the regenerate pipeline - I can help, probably. But if it's not deterministic then I'd remove this regenerate line entirely and leave the task to be manually invoked when needed.

Ahh... the LLM that helped me to migrate the Ant build to Gradle (after I spent an hour trying to do it by hand) copied the regenerate task from this line: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/analysis/opennlp/build.xml#L117

I think the output is deterministic, but will change when a) OpenNLP is upgraded again or b) someone updates the training files.

Assuming neither of those happen very often (as has been the case recently), I think leaving it manual makes sense.

Nope -- I checked by running the trainTestModels task again and the model files all had something change. So, I'm going to remove this line and we can stick to manual training as needed. 👍

We can retrain models manually if/when needed.

Migrate OpenNLP 'ant train-test-models' to Gradle

6fda640

github-actions bot added the module:analysis label Feb 5, 2025

dweiss reviewed Feb 6, 2025

View reviewed changes

dweiss assigned dweiss and msfroh Feb 6, 2025

Remove regenerate task

c80d6eb

We can retrain models manually if/when needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate OpenNLP 'ant train-test-models' to Gradle #14198

Migrate OpenNLP 'ant train-test-models' to Gradle #14198

msfroh commented Feb 5, 2025 •

edited

Loading

dweiss Feb 6, 2025

msfroh Feb 6, 2025

msfroh Feb 7, 2025

Migrate OpenNLP 'ant train-test-models' to Gradle #14198

Are you sure you want to change the base?

Migrate OpenNLP 'ant train-test-models' to Gradle #14198

Conversation

msfroh commented Feb 5, 2025 • edited Loading

Description

dweiss Feb 6, 2025

Choose a reason for hiding this comment

msfroh Feb 6, 2025

Choose a reason for hiding this comment

msfroh Feb 7, 2025

Choose a reason for hiding this comment

msfroh commented Feb 5, 2025 •

edited

Loading