Replies: 1 comment
-
|
One of the most challenging issues coming out of those sample queries is misspellings. In order to accommodate frequent misspellings, you need to address domain-specific terms that aren't English dictionary words (e.g. tnf). For instance, you could spell check only English words -- then you need to classify general English misspellings from non-English words like 'tnf'. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Target Users
General
Our target users are biological and medical researchers, that is, “domain experts”. Their primary need is to stay informed about advancements in their research field. They may also be curious about notable efforts on the margins or outside their expertise, because it could inspire their own work or due the amount of attention, but these are secondary.
These researchers are very familiar with PubMed and use it regularly to search for articles. The target users are interested in collecting information related to a topic, and so would typically use PubMed by submitting “informational queries”. This is unlike “navigational queries” used to locate a specific record or set. It is important to note that like most PubMed visitors, our target users have no background or specialized training with databases, knowledge representation or search.
Given users’ needs, when they have a set of search criteria, our researchers would expect to receive a set of articles ordered by date, akin to a feed. It would not necessarily be problematic for a system to return zero new results (10% of PubMed queries). Domain experts are also quite tolerant of noise in sets of informational search hits, because they are considerate of the qualitative difference between content provided by bibliographic indexes and, say, Google. Indeed, their expectations are unlike those of Google or Amazon where one expects a ranked list of possibly millions of results.
Our users are accustomed to submitting queries consisting of a few terms. Like PubMed and search engines, the overwhelming majority (>90%) of users submit single queries in a brief session; with only a minority interested in subsequently refining an initial search. Target users are error-prone, and a significant proportion of input search terms are misspelled (10-20%) in line with observations from PubMed. On average, target users enter three (3) terms in a typical query, with the majority (70%) entering up to four (4) terms to define the content they see. Being non-experts in search and having informational needs, they will not entertain the notion of annotating terms with specialized tags or syntax (e.g. operators) in queries and typically not interested in advanced search functions. All of this notwithstanding, it should be noted that a large proportion of PubMed queries in general include an author name (35%).
Individual Users
Rather than define prototypical users a priori and derive hypothetical queries, we do the inverse: randomly sample queries from 1 day of PubMed logs (Wilbur WJ, Kim W, Xie N. 2006. Spelling correction in the PubMed search engine. Information Retrieval 9 accessible at https://ftp.ncbi.nlm.nih.gov/pub/wilbur/DAYSLOG/ and attempt to infer their intentions. Below in Table I, I randomly sample 1000 lines of the PubMed log using the command:
$ shuf -n 10 pubmed-queries.txt > sample-pubmed-queries.txtand have (arbitrarily) selected 30 queries of interest.Table 1. Informational query samples
Michael Elowitzdelta opioid receptorpropofol, sizureseizure)top 10 deseases for malesdiseases)Christopher Barrettfetal hemoglobin congeintal heart diseasecongenital)hypoxia sex hormonesrhabdomyolysis and labetalolswi promoter escapeKawasakijessell tThymic selectiongene silencing plantampkreproducibilitymuscle myogenesisBra1 brain proteinpsychometric properties hawaii early learning profilerheumatoid arthritis X-rays scoring methods reliabilitycornea and oxidative stressLipossomes cancers Karposi SarcomaLiposomes; Kaposi)brteast cancer positive nodesbreast)tamoxifen MBCcyclosporin, expoliative dermatitisexfoliative)pancreatitis c reactive peptidedepression and fluoxetineHac1 yeast mammanlianmammalian)vegf and prostatecutaneous effects of topamaxBackground
Opposing retrieval models
exact-match (Boolean)
best-match (Information retrieval)
Notes on library science tradition vs information & computer paradigm
**Criticisms of Boolean **
Only (1) seems valid after refuting these criticisms.
On IR and the popularity of Google
There is no question that systems such as Google, based on a kind of scoring system, are easy to use and highly popular. But much of the popularity of contemporary search engines may also be attributed to the easy pickings afforded by the first generation of Internet full-text based systems (owing to the cheap cost of digital storage capacity after 1990): no doubt it is good to have all text on the web indexed and made searchable—and often with free access. However, when the easy pickings have been utilized, more complex strategies (and more humanistic approaches) may be needed to make further progress.
PubMed Log analysis
References
Beta Was this translation helpful? Give feedback.
All reactions