Skip to content

Add best-title picking logic in mergeCandidates method #12872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

liamsebestyen
Copy link

@liamsebestyen liamsebestyen commented Apr 1, 2025

Closes #11999

I modified the mergeCanditates method to merge the "best" title candidate. This consisted of also adding a helper method to this process, calculateTitleScore. This method scores titles based on a variety of heuristics, such as file path endings, ending with .(chars), and the number of words contained with a title.

mergeCanditates method effectively merges all candidates, and then overrides the previously merged title with the chosen best title.

Rationale: This reintroduces the ability to pick the best title from multiple candidates, addressing the regression where titles were simply overwritten in a last-wins manner. Now the best title according to the heuristics is selected.

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • [/] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • [/] Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (if change is visible to the user)
  • [/] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • [/] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@jabref-machine
Copy link
Collaborator

Your code currently does not meet JabRef's code guidelines. We use OpenRewrite to ensure "modern" Java coding practices. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / OpenRewrite (pull_request)" and click on it.

The issues found can be automatically fixed. Please execute the gradle task rewriteRun from the rewrite group of the Gradle Tool window in IntelliJ, then check the results, commit, and push.

Copy link

trag-bot bot commented Apr 1, 2025

@trag-bot didn't find any issues in the code! ✅✨

1 similar comment
Copy link

trag-bot bot commented Apr 1, 2025

@trag-bot didn't find any issues in the code! ✅✨

@jabref-machine
Copy link
Collaborator

JUnit tests are failing. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / Unit tests (pull_request)" and click on it.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first start

  • test cases missing
  • logic needs clean up
  • comment on magic numbers - or remove that magic

@@ -161,17 +161,67 @@ private void fetchData(BibEntry candidate, StandardField field, IdBasedFetcher f
}

private static BibEntry mergeCandidates(Stream<BibEntry> candidates) {
final BibEntry entry = new BibEntry();
candidates.forEach(entry::mergeWith);
// Convert the stream to a list so we can iterate over the list twice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "why" is missing.

May be better: Move the "mergeWith" to your loop. Comment that you need to write to a variable and therefore streams cannot be used easily


// Retain online links only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this comment (move it down to line 194)

entry.clearField(StandardField.FILE);
entry.addFiles(onlineLinks);

return entry;
}

private static int calculateTitleScore(String title) {
//for every word in the title, plus one point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the why - i think, the comment is off.

//for every word in the title, plus one point
int wordCount = title.trim().split("\\s+").length;
if(wordcount > 35){
wordcount = -2; //super long titles are less favourable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a magic numer. If there are 100 words - what makes the difference to 98 words?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. This could easily be changed with new logic.

Comment on lines +210 to +211
//if the title ends in .ccc or .cccc where c is any alphabetic char, minus 10 points
int endsInExtension= title.matches(".*\\.[a-zA-Z]{3,4}") ? -10 : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is magic - why this? What is your test data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch any file extension endings not caught by the other logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your test data?

I never saw titles of papers having file extensions.

Try to use Google Scholar and search for some papers to get an impression...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file you gave in the original example had a title which ends in such file extension, and would be marked down by our scoring system. "Microsoft Word - ieee_on_how_we_teach_jul_01.docx"

Other such files have this sometimes too, like ending in .pdf or .word ect. The scoring system definitely could use some work but it will correctly mark down these types of "bad" titles

Copy link
Member

@ThiloteE ThiloteE Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filenames should not be parsed as title, agreed. This part of the code is not good to generally rule out titles containing file endings though. The artificial limititation to 3 or 4 characters is odd. There are file endings that have less and more than just 3 or 4 characters. See https://en.wikipedia.org/wiki/List_of_file_signatures for a list of file extensions. If you remove that part with 3 or 4 characters, then you will penalize all titles containing a dot. While it is unlikely that titles contain dots, it cannot be ruled out. Imagine a scientific with a title such as "What it means to use a . as a character in your file signature and are there any better options?"

Copy link
Member

@ThiloteE ThiloteE Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the same argument could be made to avoid hardcoding well known file extensions, such as pdf|docx|odt|txt|jpg|png.

So we now know there will definitely be edge cases, if this is merged and also if it is not merged and it is about choosing the lesser evil.

I did a Google Scholar search for titles containing a . and I found none. That means, either none exist or Google filters them out. For us it means, the number of edge cases (especially if ".*\\.[a-zA-Z]{3,4}" is retained) should be fairly low and this rule that adds a small penalization could be ok.

@koppor
Copy link
Member

koppor commented Apr 1, 2025

It seems, the team did not use IntelliJ... - At least, a proper Java compiler is needed. - Please re-check https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace/intellij-11-code-into-ide.html - and take this excersice to level-up your implementation skills. One needs to spend time to get better than an AI.

/home/runner/work/jabref/jabref/src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java:206: error: cannot find symbol
        if(wordcount > 35){
           ^
  symbol:   variable wordcount

@liamsebestyen
Copy link
Author

liamsebestyen commented Apr 1, 2025

Thanks for the feedback. Our team did use IntelliJ, I believe that last point you mentioned regarding the word count was just added when one of our team members tried to fix the code naming convention issues automatically detected when the pull request was created.

Thank you for the feedback, our team can address this feedback.

@WillMohr858
Copy link

Hi, I am with the team that submitted this pull request (sorry for putting the comment in the wrong spot at first, that one has been deleted now). I've attached below the results of some manual testing that we did on the Jabref app.

Before our changes if one launched the app and went File->Import->Import into new library and then attached the se2paper.pdf file they would get the following:

before_change

After our change doing the exact same thing you get the following:

after_change


int endsWithFileExtension = 0;

if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){
Copy link
Member

@ThiloteE ThiloteE Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){
if (title.matches("(?i).*(\\.(pdf|doc|docx|odt|txt|jpg|jpeg|png))$")){

Is the ? a typo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegEx.

Pattern.compile should be used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean: docx? matches both doc and docx, which are both valid Word extensions.

However, this functionality is not needed here. Only in very, very seldom cases, titles contain file names. -- Maybe the students found some interesting cases; but then hey should provide test cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve BibTeX-from-PDF import
6 participants