Background on User Profile #10

jvwong · 2022-07-29T16:35:49Z

jvwong
Jul 29, 2022
Maintainer

Target Users

General

Our target users are biological and medical researchers, that is, “domain experts”. Their primary need is to stay informed about advancements in their research field. They may also be curious about notable efforts on the margins or outside their expertise, because it could inspire their own work or due the amount of attention, but these are secondary.

These researchers are very familiar with PubMed and use it regularly to search for articles. The target users are interested in collecting information related to a topic, and so would typically use PubMed by submitting “informational queries”. This is unlike “navigational queries” used to locate a specific record or set. It is important to note that like most PubMed visitors, our target users have no background or specialized training with databases, knowledge representation or search.

Given users’ needs, when they have a set of search criteria, our researchers would expect to receive a set of articles ordered by date, akin to a feed. It would not necessarily be problematic for a system to return zero new results (10% of PubMed queries). Domain experts are also quite tolerant of noise in sets of informational search hits, because they are considerate of the qualitative difference between content provided by bibliographic indexes and, say, Google. Indeed, their expectations are unlike those of Google or Amazon where one expects a ranked list of possibly millions of results.

Our users are accustomed to submitting queries consisting of a few terms. Like PubMed and search engines, the overwhelming majority (>90%) of users submit single queries in a brief session; with only a minority interested in subsequently refining an initial search. Target users are error-prone, and a significant proportion of input search terms are misspelled (10-20%) in line with observations from PubMed. On average, target users enter three (3) terms in a typical query, with the majority (70%) entering up to four (4) terms to define the content they see. Being non-experts in search and having informational needs, they will not entertain the notion of annotating terms with specialized tags or syntax (e.g. operators) in queries and typically not interested in advanced search functions. All of this notwithstanding, it should be noted that a large proportion of PubMed queries in general include an author name (35%).

Individual Users

Rather than define prototypical users a priori and derive hypothetical queries, we do the inverse: randomly sample queries from 1 day of PubMed logs (Wilbur WJ, Kim W, Xie N. 2006. Spelling correction in the PubMed search engine. Information Retrieval 9 accessible at https://ftp.ncbi.nlm.nih.gov/pub/wilbur/DAYSLOG/ and attempt to infer their intentions. Below in Table I, I randomly sample 1000 lines of the PubMed log using the command: $ shuf -n 10 pubmed-queries.txt > sample-pubmed-queries.txt and have (arbitrarily) selected 30 queries of interest.

Table 1. Informational query samples

Query	Types	Comments
`Michael Elowitz`	author
`delta opioid receptor`	gene family
`propofol, sizure`	medication; neurological event	misspelling (`seizure`)
`top 10 deseases for males`	sex-based disease	misspelling (`diseases`)
`Christopher Barrett`	author
`fetal hemoglobin congeintal heart disease`	developmental protein; disease	misspelling (`congenital`)
`hypoxia sex hormones`	state; protein family
`rhabdomyolysis and labetalol`	pathological process; medication
`swi promoter escape`	protein complex; molecular process
`Kawasaki`	disease	Partial “Kawasaki disease”
`jessell t`	author
`Thymic selection`	cellular process
`gene silencing plant`	molecular process; taxon
`ampk`	protein kinase
`reproducibility`	scientific integrity
`muscle myogenesis`	developmental process
`Bra1 brain protein`	protein; organ
`psychometric properties hawaii early learning profile`	curriculum-based assessment tool
`rheumatoid arthritis X-rays scoring methods reliability`	disease; clinical assessment method
`cornea and oxidative stress`	tissue; stimulus	Not operator
`Lipossomes cancers Karposi Sarcoma`	organelle; disease	misspelling (`Liposomes; Kaposi`)
`brteast cancer positive nodes`	disease staging	misspelling (`breast`)
`tamoxifen MBC`	drug; disease stage	Abbreviation for “Male Breast Cancer”
`cyclosporin, expoliative dermatitis`	drug; pathological condition	misspelling (`exfoliative`)
`pancreatitis c reactive peptide`	disease;
`depression and fluoxetine`	disease state; antidepressant
`Hac1 yeast mammanlian`	gene; organism	misspelling (`mammalian`)
`vegf and prostate`	gene; organ
`cutaneous effects of topamax`	side-effect of medication

Excluded queries with: Nothing (empty); tags; local IDs; non-english; citations; garbage

Background

Opposing retrieval models

exact-match (Boolean)

The query specifies precise retrieval criteria
Every document either matches or fails to match the query
The result is a set of documents

best-match (Information retrieval)

The query describes a good or “best” matching document
The result is a ranked list of documents
The result may include an estimate of quality

Notes on library science tradition vs information & computer paradigm

Information and computer paradigm
- emphasize “query transformation” which transforms a users verbal query into a set of documents or bibliographic records
- reify “relevance” towards the construction of a perfect system
Library science tradition
- “selection power” is a user’s ability to make relevant distinctions during a search
- There is a need for human labor to create this selection power
Warner J has stated that boolean searches have the aim of increasing the “selection power” of users because of the iterative nature of the search and users’ control over the process
The quality of search hits, as judged by a user, may evolve over time in response to each item of new information, such as prior queries

**Criticisms of Boolean **

Requires user training in arcane query syntax, knowledge organization
Too restrictive (AND)
Too inclusive/noisy (OR)
(2) and (3) result in search failure or overload
No priority or weighting of terms
Boolean logic requires that terminology for a user’s query and the document index align
Boolean logic precludes document ranking

Only (1) seems valid after refuting these criticisms.

On IR and the popularity of Google

There is no question that systems such as Google, based on a kind of scoring system, are easy to use and highly popular. But much of the popularity of contemporary search engines may also be attributed to the easy pickings afforded by the first generation of Internet full-text based systems (owing to the cheap cost of digital storage capacity after 1990): no doubt it is good to have all text on the web indexed and made searchable—and often with free access. However, when the easy pickings have been utilized, more complex strategies (and more humanistic approaches) may be needed to make further progress.

PubMed Log analysis

24 hours of log analysis
- Total: 2 996 301 queries issued by 627 455 users
- Type
  - informational (no search field tag): 2 585 183 (89%)
  - navigation (tag): 11%
- Queries
  - Median 3 terms per query
  - Majority (90%) had single session during the day
    - Largely 1 query per session

References

Hjørland, B. Classical databases and knowledge organization: A case for boolean retrieval and human decision-making during searches. J Assoc Inf Sci Tech 66,
Herskovic, J. R., Tanaka, L. Y., Hersh, W. & Bernstam, E. V. A Day in the Life of PubMed: Analysis of a Typical Day’s Query Log. J Am Med Inform Assn 14, 212–220 (2007).
Fiorini, N., Leaman, R., Lipman, D. J. & Lu, Z. How user intelligence is improving PubMed. Nat Biotechnol 36, 937–945 (2018).

maxkfranz · 2022-08-02T15:36:37Z

maxkfranz
Aug 2, 2022

One of the most challenging issues coming out of those sample queries is misspellings. In order to accommodate frequent misspellings, you need to address domain-specific terms that aren't English dictionary words (e.g. tnf). For instance, you could spell check only English words -- then you need to classify general English misspellings from non-English words like 'tnf'.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Background on User Profile #10

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Background on User Profile #10

Uh oh!

Uh oh!

jvwong Jul 29, 2022 Maintainer

Target Users

General

Individual Users

Background

References

Replies: 1 comment

Uh oh!

maxkfranz Aug 2, 2022

jvwong
Jul 29, 2022
Maintainer

maxkfranz
Aug 2, 2022