-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add best-title picking logic in mergeCandidates method #12872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
cbe9d0a
50b2a65
6dc434c
ca6c832
0fe2a26
351c508
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -161,17 +161,67 @@ private void fetchData(BibEntry candidate, StandardField field, IdBasedFetcher f | |||||
} | ||||||
|
||||||
private static BibEntry mergeCandidates(Stream<BibEntry> candidates) { | ||||||
final BibEntry entry = new BibEntry(); | ||||||
candidates.forEach(entry::mergeWith); | ||||||
// Convert the stream to a list so we can iterate over the list twice | ||||||
List<BibEntry> candidateList = candidates.toList(); | ||||||
|
||||||
// Retain online links only | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keep this comment (move it down to line 194) |
||||||
List<LinkedFile> onlineLinks = entry.getFiles().stream().filter(LinkedFile::isOnlineLink).toList(); | ||||||
|
||||||
BibEntry entry = new BibEntry(); | ||||||
|
||||||
// Score titles to find the "best" title among candidates | ||||||
int bestTitleScore = -1; | ||||||
String bestTitle = null; | ||||||
|
||||||
for (BibEntry candidate : candidateList) { | ||||||
Optional<String> candidateTitle = candidate.getField(StandardField.TITLE); | ||||||
if (candidateTitle.isPresent()) { | ||||||
int score = calculateTitleScore(candidateTitle.get()); | ||||||
if (score > bestTitleScore) { | ||||||
bestTitleScore = score; | ||||||
bestTitle = candidateTitle.get(); | ||||||
} | ||||||
} | ||||||
} | ||||||
|
||||||
// Merge all fields from the candidates, same as previous method | ||||||
candidateList.forEach(entry::mergeWith); | ||||||
|
||||||
// Override the best title we found | ||||||
if (bestTitle != null) { | ||||||
entry.setField(StandardField.TITLE, bestTitle); | ||||||
} | ||||||
|
||||||
|
||||||
List<LinkedFile> onlineLinks = entry.getFiles().stream() | ||||||
.filter(LinkedFile::isOnlineLink) | ||||||
.toList(); | ||||||
entry.clearField(StandardField.FILE); | ||||||
entry.addFiles(onlineLinks); | ||||||
|
||||||
return entry; | ||||||
} | ||||||
|
||||||
private static int calculateTitleScore(String title) { | ||||||
//for every word in the title, plus one point | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment on the why - i think, the comment is off. |
||||||
int wordCount = title.trim().split("\\s+").length; | ||||||
if(wordcount > 35){ | ||||||
wordcount = -2; //super long titles are less favourable | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is really a magic numer. If there are 100 words - what makes the difference to 98 words? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the feedback. This could easily be changed with new logic. |
||||||
} | ||||||
|
||||||
//if the title ends in .ccc or .cccc where c is any alphabetic char, minus 10 points | ||||||
int endsInExtension= title.matches(".*\\.[a-zA-Z]{3,4}") ? -10 : 0; | ||||||
Comment on lines
+210
to
+211
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is magic - why this? What is your test data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Catch any file extension endings not caught by the other logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is your test data? I never saw titles of papers having file extensions. Try to use Google Scholar and search for some papers to get an impression... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The file you gave in the original example had a title which ends in such file extension, and would be marked down by our scoring system. "Microsoft Word - ieee_on_how_we_teach_jul_01.docx" Other such files have this sometimes too, like ending in .pdf or .word ect. The scoring system definitely could use some work but it will correctly mark down these types of "bad" titles There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Filenames should not be parsed as title, agreed. This part of the code is not good to generally rule out titles containing file endings though. The artificial limititation to 3 or 4 characters is odd. There are file endings that have less and more than just 3 or 4 characters. See https://en.wikipedia.org/wiki/List_of_file_signatures for a list of file extensions. If you remove that part with 3 or 4 characters, then you will penalize all titles containing a dot. While it is unlikely that titles contain dots, it cannot be ruled out. Imagine a scientific with a title such as "What it means to use a . as a character in your file signature and are there any better options?" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the same argument could be made to avoid hardcoding well known file extensions, such as So we now know there will definitely be edge cases, if this is merged and also if it is not merged and it is about choosing the lesser evil. I did a Google Scholar search for titles containing a |
||||||
|
||||||
int endsWithFileExtension = 0; | ||||||
|
||||||
if (title.matches("(?i).*(\\.(pdf|docx?|txt|jpg|png))$")){ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RegEx. Pattern.compile should be used There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I mean: docx? matches both doc and docx, which are both valid Word extensions. However, this functionality is not needed here. Only in very, very seldom cases, titles contain file names. -- Maybe the students found some interesting cases; but then hey should provide test cases |
||||||
//Check for some common file extensions, remove points if contains these common filepath endings. | ||||||
endsWithFileExtension = -10; // subtract ten more points for file extension ending, very undesirable. | ||||||
} | ||||||
return wordcount + endsinExtension + endsWithFileExtension; | ||||||
} | ||||||
|
||||||
|
||||||
|
||||||
|
||||||
/** | ||||||
* Imports the BibTeX data from the given PDF file and relativized the paths of each linked file based on the context and the file preferences. | ||||||
*/ | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "why" is missing.
May be better: Move the "mergeWith" to your loop. Comment that you need to write to a variable and therefore streams cannot be used easily