-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add best-title picking logic in mergeCandidates method #12872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ethod, and modified mergeCanditates method.
src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java
Outdated
Show resolved
Hide resolved
…adataImporter.java Co-authored-by: Ethan S <[email protected]>
…adataImporter.java Co-authored-by: Ethan S <[email protected]>
…adataImporter.java Code case fixes Co-authored-by: Ethan S <[email protected]>
…adataImporter.java fix code case Co-authored-by: Ethan S <[email protected]>
Your code currently does not meet JabRef's code guidelines. We use OpenRewrite to ensure "modern" Java coding practices. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / OpenRewrite (pull_request)" and click on it. The issues found can be automatically fixed. Please execute the gradle task |
@trag-bot didn't find any issues in the code! ✅✨ |
1 similar comment
@trag-bot didn't find any issues in the code! ✅✨ |
JUnit tests are failing. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / Unit tests (pull_request)" and click on it. You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good first start
- test cases missing
- logic needs clean up
- comment on magic numbers - or remove that magic
@@ -161,17 +161,67 @@ private void fetchData(BibEntry candidate, StandardField field, IdBasedFetcher f | |||
} | |||
|
|||
private static BibEntry mergeCandidates(Stream<BibEntry> candidates) { | |||
final BibEntry entry = new BibEntry(); | |||
candidates.forEach(entry::mergeWith); | |||
// Convert the stream to a list so we can iterate over the list twice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "why" is missing.
May be better: Move the "mergeWith" to your loop. Comment that you need to write to a variable and therefore streams cannot be used easily
|
||
// Retain online links only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this comment (move it down to line 194)
entry.clearField(StandardField.FILE); | ||
entry.addFiles(onlineLinks); | ||
|
||
return entry; | ||
} | ||
|
||
private static int calculateTitleScore(String title) { | ||
//for every word in the title, plus one point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on the why - i think, the comment is off.
//for every word in the title, plus one point | ||
int wordCount = title.trim().split("\\s+").length; | ||
if(wordcount > 35){ | ||
wordcount = -2; //super long titles are less favourable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a magic numer. If there are 100 words - what makes the difference to 98 words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. This could easily be changed with new logic.
//if the title ends in .ccc or .cccc where c is any alphabetic char, minus 10 points | ||
int endsInExtension= title.matches(".*\\.[a-zA-Z]{3,4}") ? -10 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is magic - why this? What is your test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catch any file extension endings not caught by the other logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your test data?
I never saw titles of papers having file extensions.
Try to use Google Scholar and search for some papers to get an impression...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file you gave in the original example had a title which ends in such file extension, and would be marked down by our scoring system. "Microsoft Word - ieee_on_how_we_teach_jul_01.docx"
Other such files have this sometimes too, like ending in .pdf or .word ect. The scoring system definitely could use some work but it will correctly mark down these types of "bad" titles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filenames should not be parsed as title, agreed. This part of the code is not good to generally rule out titles containing file endings though. The artificial limititation to 3 or 4 characters is odd. There are file endings that have less and more than just 3 or 4 characters. See https://en.wikipedia.org/wiki/List_of_file_signatures for a list of file extensions. If you remove that part with 3 or 4 characters, then you will penalize all titles containing a dot. While it is unlikely that titles contain dots, it cannot be ruled out. Imagine a scientific with a title such as "What it means to use a . as a character in your file signature and are there any better options?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the same argument could be made to avoid hardcoding well known file extensions, such as pdf|docx|odt|txt|jpg|png
.
So we now know there will definitely be edge cases, if this is merged and also if it is not merged and it is about choosing the lesser evil.
I did a Google Scholar search for titles containing a .
and I found none. That means, either none exist or Google filters them out. For us it means, the number of edge cases (especially if ".*\\.[a-zA-Z]{3,4}"
is retained) should be fairly low and this rule that adds a small penalization could be ok.
It seems, the team did not use IntelliJ... - At least, a proper Java compiler is needed. - Please re-check https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace/intellij-11-code-into-ide.html - and take this excersice to level-up your implementation skills. One needs to spend time to get better than an AI.
|
Thanks for the feedback. Our team did use IntelliJ, I believe that last point you mentioned regarding the word count was just added when one of our team members tried to fix the code naming convention issues automatically detected when the pull request was created. Thank you for the feedback, our team can address this feedback. |
Hi, I am with the team that submitted this pull request (sorry for putting the comment in the wrong spot at first, that one has been deleted now). I've attached below the results of some manual testing that we did on the Jabref app. Before our changes if one launched the app and went File->Import->Import into new library and then attached the se2paper.pdf file they would get the following: After our change doing the exact same thing you get the following: |
|
||
int endsWithFileExtension = 0; | ||
|
||
if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){ | |
if (title.matches("(?i).*(\\.(pdf|doc|docx|odt|txt|jpg|jpeg|png))$")){ |
Is the ?
a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegEx.
Pattern.compile should be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean: docx? matches both doc and docx, which are both valid Word extensions.
However, this functionality is not needed here. Only in very, very seldom cases, titles contain file names. -- Maybe the students found some interesting cases; but then hey should provide test cases
Closes #11999
I modified the mergeCanditates method to merge the "best" title candidate. This consisted of also adding a helper method to this process, calculateTitleScore. This method scores titles based on a variety of heuristics, such as file path endings, ending with .(chars), and the number of words contained with a title.
mergeCanditates method effectively merges all candidates, and then overrides the previously merged title with the chosen best title.
Rationale: This reintroduces the ability to pick the best title from multiple candidates, addressing the regression where titles were simply overwritten in a last-wins manner. Now the best title according to the heuristics is selected.
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if change is visible to the user)