Add best-title picking logic in mergeCandidates method #12872

liamsebestyen · 2025-04-01T03:31:29Z

Closes #11999

I modified the mergeCanditates method to merge the "best" title candidate. This consisted of also adding a helper method to this process, calculateTitleScore. This method scores titles based on a variety of heuristics, such as file path endings, ending with .(chars), and the number of words contained with a title.

mergeCanditates method effectively merges all candidates, and then overrides the previously merged title with the chosen best title.

Rationale: This reintroduces the ability to pick the best title from multiple candidates, addressing the regression where titles were simply overwritten in a last-wins manner. Now the best title according to the heuristics is selected.

Mandatory checks

I own the copyright of the code submitted and I license it under the MIT license
[/] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
[/] Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (if change is visible to the user)
[/] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
[/] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

…ethod, and modified mergeCanditates method.

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

…adataImporter.java Co-authored-by: Ethan S <[email protected]>

…adataImporter.java Code case fixes Co-authored-by: Ethan S <[email protected]>

…adataImporter.java fix code case Co-authored-by: Ethan S <[email protected]>

jabref-machine · 2025-04-01T06:33:07Z

Your code currently does not meet JabRef's code guidelines. We use OpenRewrite to ensure "modern" Java coding practices. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / OpenRewrite (pull_request)" and click on it.

The issues found can be automatically fixed. Please execute the gradle task rewriteRun from the rewrite group of the Gradle Tool window in IntelliJ, then check the results, commit, and push.

trag-bot · 2025-04-01T06:33:30Z

@trag-bot didn't find any issues in the code! ✅✨

trag-bot · 2025-04-01T06:33:34Z

@trag-bot didn't find any issues in the code! ✅✨

jabref-machine · 2025-04-01T06:34:27Z

JUnit tests are failing. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / Unit tests (pull_request)" and click on it.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

koppor

Good first start

test cases missing
logic needs clean up
comment on magic numbers - or remove that magic

koppor · 2025-04-01T06:33:57Z

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

@@ -161,17 +161,67 @@ private void fetchData(BibEntry candidate, StandardField field, IdBasedFetcher f
    }

    private static BibEntry mergeCandidates(Stream<BibEntry> candidates) {
-        final BibEntry entry = new BibEntry();
-        candidates.forEach(entry::mergeWith);
+        // Convert the stream to a list so we can iterate over the list twice


The "why" is missing.

May be better: Move the "mergeWith" to your loop. Comment that you need to write to a variable and therefore streams cannot be used easily

koppor · 2025-04-01T06:34:20Z

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java


-        // Retain online links only


Keep this comment (move it down to line 194)

koppor · 2025-04-01T06:34:38Z

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

        entry.clearField(StandardField.FILE);
        entry.addFiles(onlineLinks);

        return entry;
    }

+    private static int calculateTitleScore(String title) {
+        //for every word in the title, plus one point


Comment on the why - i think, the comment is off.

koppor · 2025-04-01T06:35:06Z

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

+        //for every word in the title, plus one point
+        int wordCount = title.trim().split("\\s+").length;
+        if(wordcount > 35){
+            wordcount = -2; //super long titles are less favourable


This is really a magic numer. If there are 100 words - what makes the difference to 98 words?

Thank you for the feedback. This could easily be changed with new logic.

koppor · 2025-04-01T06:35:27Z

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

+        //if the title ends in .ccc or .cccc where c is any alphabetic char, minus 10 points
+        int endsInExtension= title.matches(".*\\.[a-zA-Z]{3,4}") ? -10 : 0;


This is magic - why this? What is your test data?

Catch any file extension endings not caught by the other logic.

What is your test data?

I never saw titles of papers having file extensions.

Try to use Google Scholar and search for some papers to get an impression...

The file you gave in the original example had a title which ends in such file extension, and would be marked down by our scoring system. "Microsoft Word - ieee_on_how_we_teach_jul_01.docx"

Other such files have this sometimes too, like ending in .pdf or .word ect. The scoring system definitely could use some work but it will correctly mark down these types of "bad" titles

Filenames should not be parsed as title, agreed. This part of the code is not good to generally rule out titles containing file endings though. The artificial limititation to 3 or 4 characters is odd. There are file endings that have less and more than just 3 or 4 characters. See https://en.wikipedia.org/wiki/List_of_file_signatures for a list of file extensions. If you remove that part with 3 or 4 characters, then you will penalize all titles containing a dot. While it is unlikely that titles contain dots, it cannot be ruled out. Imagine a scientific with a title such as "What it means to use a . as a character in your file signature and are there any better options?"

I guess the same argument could be made to avoid hardcoding well known file extensions, such as pdf|docx|odt|txt|jpg|png.

So we now know there will definitely be edge cases, if this is merged and also if it is not merged and it is about choosing the lesser evil.

I did a Google Scholar search for titles containing a . and I found none. That means, either none exist or Google filters them out. For us it means, the number of edge cases (especially if ".*\\.[a-zA-Z]{3,4}" is retained) should be fairly low and this rule that adds a small penalization could be ok.

koppor · 2025-04-01T06:37:58Z

It seems, the team did not use IntelliJ... - At least, a proper Java compiler is needed. - Please re-check https://devdocs.jabref.org/getting-into-the-code/guidelines-for-setting-up-a-local-workspace/intellij-11-code-into-ide.html - and take this excersice to level-up your implementation skills. One needs to spend time to get better than an AI.

/home/runner/work/jabref/jabref/src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java:206: error: cannot find symbol
        if(wordcount > 35){
           ^
  symbol:   variable wordcount

liamsebestyen · 2025-04-01T06:54:26Z

Thanks for the feedback. Our team did use IntelliJ, I believe that last point you mentioned regarding the word count was just added when one of our team members tried to fix the code naming convention issues automatically detected when the pull request was created.

Thank you for the feedback, our team can address this feedback.

WillMohr858 · 2025-04-01T06:59:42Z

Hi, I am with the team that submitted this pull request (sorry for putting the comment in the wrong spot at first, that one has been deleted now). I've attached below the results of some manual testing that we did on the Jabref app.

Before our changes if one launched the app and went File->Import->Import into new library and then attached the se2paper.pdf file they would get the following:

After our change doing the exact same thing you get the following:

ThiloteE · 2025-04-02T14:08:46Z

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java

+
+        int endsWithFileExtension = 0;
+
+        if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){


Suggested change

if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){

if (title.matches("(?i).*(\\.(pdf|doc|docx|odt|txt|jpg|jpeg|png))$")){

Is the ? a typo?

RegEx.

Pattern.compile should be used

What I mean: docx? matches both doc and docx, which are both valid Word extensions.

However, this functionality is not needed here. Only in very, very seldom cases, titles contain file names. -- Maybe the students found some interesting cases; but then hey should provide test cases

koppor · 2025-05-12T20:55:25Z

Closing this issue due to inactivity 💤 Please ping us if you intend to resume work on this one.

Fixed issue where poor title is selected. added calculateTitleScore m…

cbe9d0a

…ethod, and modified mergeCanditates method.

eswain99 reviewed Apr 1, 2025

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java Outdated Show resolved Hide resolved

eswain99 reviewed Apr 1, 2025

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java Outdated Show resolved Hide resolved

eswain99 reviewed Apr 1, 2025

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java Outdated Show resolved Hide resolved

eswain99 reviewed Apr 1, 2025

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMetadataImporter.java Outdated Show resolved Hide resolved

liamsebestyen and others added 5 commits March 31, 2025 23:28

Update src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMet…

50b2a65

…adataImporter.java Co-authored-by: Ethan S <[email protected]>

Update src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMet…

6dc434c

…adataImporter.java Co-authored-by: Ethan S <[email protected]>

Update src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMet…

ca6c832

…adataImporter.java Code case fixes Co-authored-by: Ethan S <[email protected]>

Update src/main/java/org/jabref/logic/importer/fileformat/PdfMergeMet…

0fe2a26

…adataImporter.java fix code case Co-authored-by: Ethan S <[email protected]>

Merge branch 'main' into issue-11999

351c508

koppor requested changes Apr 1, 2025

View reviewed changes

ThiloteE reviewed Apr 2, 2025

View reviewed changes

koppor closed this May 12, 2025

		//if the title ends in .ccc or .cccc where c is any alphabetic char, minus 10 points
		int endsInExtension= title.matches(".*\\.[a-zA-Z]{3,4}") ? -10 : 0;


		int endsWithFileExtension = 0;

		if (title.matches("(?i).*(\\.(pdf\|docx?\|txt\|jpg\|png))$")){

	if (title.matches("(?i).*(\\.(pdf\|docx?\|txt\|jpg\|png))$")){
	if (title.matches("(?i).*(\\.(pdf\|doc\|docx\|odt\|txt\|jpg\|jpeg\|png))$")){

Uh oh!

Add best-title picking logic in mergeCandidates method #12872

Add best-title picking logic in mergeCandidates method #12872

Uh oh!

Conversation

liamsebestyen commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mandatory checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jabref-machine commented Apr 1, 2025

Uh oh!

trag-bot bot commented Apr 1, 2025

Uh oh!

trag-bot bot commented Apr 1, 2025

Uh oh!

jabref-machine commented Apr 1, 2025

Uh oh!

koppor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThiloteE Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThiloteE Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koppor commented Apr 1, 2025

Uh oh!

liamsebestyen commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillMohr858 commented Apr 1, 2025

Uh oh!

ThiloteE Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koppor commented May 12, 2025

Uh oh!

Uh oh!

liamsebestyen commented Apr 1, 2025 •

edited

Loading

ThiloteE Apr 2, 2025 •

edited

Loading

ThiloteE Apr 2, 2025 •

edited

Loading

liamsebestyen commented Apr 1, 2025 •

edited

Loading

ThiloteE Apr 2, 2025 •

edited

Loading