Skip to content

Add best-title picking logic in mergeCandidates method #12872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -161,17 +161,67 @@ private void fetchData(BibEntry candidate, StandardField field, IdBasedFetcher f
}

private static BibEntry mergeCandidates(Stream<BibEntry> candidates) {
final BibEntry entry = new BibEntry();
candidates.forEach(entry::mergeWith);
// Convert the stream to a list so we can iterate over the list twice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "why" is missing.

May be better: Move the "mergeWith" to your loop. Comment that you need to write to a variable and therefore streams cannot be used easily

List<BibEntry> candidateList = candidates.toList();

// Retain online links only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this comment (move it down to line 194)

List<LinkedFile> onlineLinks = entry.getFiles().stream().filter(LinkedFile::isOnlineLink).toList();

BibEntry entry = new BibEntry();

// Score titles to find the "best" title among candidates
int bestTitleScore = -1;
String bestTitle = null;

for (BibEntry candidate : candidateList) {
Optional<String> candidateTitle = candidate.getField(StandardField.TITLE);
if (candidateTitle.isPresent()) {
int score = calculateTitleScore(candidateTitle.get());
if (score > bestTitleScore) {
bestTitleScore = score;
bestTitle = candidateTitle.get();
}
}
}

// Merge all fields from the candidates, same as previous method
candidateList.forEach(entry::mergeWith);

// Override the best title we found
if (bestTitle != null) {
entry.setField(StandardField.TITLE, bestTitle);
}


List<LinkedFile> onlineLinks = entry.getFiles().stream()
.filter(LinkedFile::isOnlineLink)
.toList();
entry.clearField(StandardField.FILE);
entry.addFiles(onlineLinks);

return entry;
}

private static int calculateTitleScore(String title) {
//for every word in the title, plus one point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the why - i think, the comment is off.

int wordCount = title.trim().split("\\s+").length;
if(wordcount > 35){
wordcount = -2; //super long titles are less favourable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a magic numer. If there are 100 words - what makes the difference to 98 words?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. This could easily be changed with new logic.

}

//if the title ends in .ccc or .cccc where c is any alphabetic char, minus 10 points
int endsInExtension= title.matches(".*\\.[a-zA-Z]{3,4}") ? -10 : 0;
Comment on lines +210 to +211
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is magic - why this? What is your test data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch any file extension endings not caught by the other logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your test data?

I never saw titles of papers having file extensions.

Try to use Google Scholar and search for some papers to get an impression...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file you gave in the original example had a title which ends in such file extension, and would be marked down by our scoring system. "Microsoft Word - ieee_on_how_we_teach_jul_01.docx"

Other such files have this sometimes too, like ending in .pdf or .word ect. The scoring system definitely could use some work but it will correctly mark down these types of "bad" titles

Copy link
Member

@ThiloteE ThiloteE Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filenames should not be parsed as title, agreed. This part of the code is not good to generally rule out titles containing file endings though. The artificial limititation to 3 or 4 characters is odd. There are file endings that have less and more than just 3 or 4 characters. See https://en.wikipedia.org/wiki/List_of_file_signatures for a list of file extensions. If you remove that part with 3 or 4 characters, then you will penalize all titles containing a dot. While it is unlikely that titles contain dots, it cannot be ruled out. Imagine a scientific with a title such as "What it means to use a . as a character in your file signature and are there any better options?"

Copy link
Member

@ThiloteE ThiloteE Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the same argument could be made to avoid hardcoding well known file extensions, such as pdf|docx|odt|txt|jpg|png.

So we now know there will definitely be edge cases, if this is merged and also if it is not merged and it is about choosing the lesser evil.

I did a Google Scholar search for titles containing a . and I found none. That means, either none exist or Google filters them out. For us it means, the number of edge cases (especially if ".*\\.[a-zA-Z]{3,4}" is retained) should be fairly low and this rule that adds a small penalization could be ok.


int endsWithFileExtension = 0;

if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){
Copy link
Member

@ThiloteE ThiloteE Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){
if (title.matches("(?i).*(\\.(pdf|doc|docx|odt|txt|jpg|jpeg|png))$")){

Is the ? a typo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegEx.

Pattern.compile should be used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean: docx? matches both doc and docx, which are both valid Word extensions.

However, this functionality is not needed here. Only in very, very seldom cases, titles contain file names. -- Maybe the students found some interesting cases; but then hey should provide test cases

//Check for some common file extensions, remove points if contains these common filepath endings.
endsWithFileExtension = -10; // subtract ten more points for file extension ending, very undesirable.
}
return wordcount + endsinExtension + endsWithFileExtension;
}




/**
* Imports the BibTeX data from the given PDF file and relativized the paths of each linked file based on the context and the file preferences.
*/
Expand Down