Communication is an essential component of the scientific endeavor, yet the relationship between the textual properties of scientific papers and their reception by the scientific community is relatively unknown. As a component of the Science of Success project, this project will explore whether specific lexical factors are indicative of the attention an article received, as measured by normalized citation indices, by combining analytical tools from natural language processing, data science, and the science of science. The initial stages of the study will focus on the temporal and disciplinary variance in the length and syntactic features of article titles in the Web of Science, a massive dataset containing over 50 million articles published since 1900. The project also aims to develop a quantitative model, which can estimate the impact a scientific article could create in the community.
This quantitative model of article presentation will be useful for maximizing the impact of future articles and thereby accelerate scientific growth.
-
Installing dependecies for the project
sudo apt-get install python3-tables
sudo pip3 install seaborn pandas networkx
- These notebooks shows data preparation steps for the analysis.
- These notebooks explore the temporal distribution of structures in titles of Applied Physics articles conditioned on Journals in which they appear.
- These models show how linear/weighted linear regression models have been used to predict log c5 (citation counts five years from the year of publication).
- These notebooks show how new (interesting) words come up in literature (titles) and how they decay through time.
- These notebooks show how variations in parts of speech of titles have changed over the years for different disciplines.
- These notebooks show different methods of selecting concepts from titles of publications and how growth and decay of concepts happen for different disciplines.
- These notebooks show how much fluctuations occur in word usage. We also try to characterize these fluctuations to known distributions.