Experiment code for the AAAI'15 paper:
A Neural Probabilistic Model for Context Based Citation Recommendation
Please note that the code is experimental, so it contains two main part:
learning paper embeddings and calculate score (indexing)
The unprocessed data (SQL data) about the citation context and the cited papers are in: https://psu.box.com/v/refseer
-
You are welcome to use the code under the terms of the license, however please acknowledge its use by citation: W. Huang, Z. Wu, C. Liang, P. Mitra, and C. Lee Giles. A Neural Probabilistic Model for Context Based Citation Recommendation. In the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI'15), 2015.
-
Instruction: The shared data is a SQL dump of citeseerx database with 3 tables: citations, citationContexts, and papers.
- Important fields of table papers:
- id: each pdf will have a different id, this id is referred to as paperid in table citations;
- cluster: same paper (may be have more pdfs in our databases) will have a unique cluster number.
- Important fields of table citations:
- id: this id is referred to as citationid in table citationContexts;
- cluster: the cluster number of the cited document;
- paperid: the id of citing document.
- Important fields of table citationContexts:
- citationid: link to the citations table.
- context: citation contexts, citations are surrounded by =-= and -=-.
- Important fields of table papers:
-
Please use MySQL to import the data, I was told that there were some problems when importing 'citationContexts.sql' to Postgres.
-
After the database is imported: These are the steps that may help you:
- create new data format, remove citations (surrounded by -=- and =-=) :
CitationContext Cluster (cited paper)
-
learn word embedding from citation context
-
learn paper embedding from citation context (initial paper embedding)
-
learn word embedding and paper embedding simultaneously.
-
when learning paper embedding only use adj. and noun. words in citation context
-
when learning paper embeddings, I assigned a normalized weight for each noun and adj word in an context
For example, For one pair of citation and citation context:
w_1, w_2, ... , w_{n-1}, w_{n} p_i
when learning embedding of paper
p_i
,word w_1 ,w_2... w_{n}
has different learning weight. I use the co-occurrence of word and paper in the whole corpus as weight.
-
Should you have more questions, please email me at gmail start with harrywy
All codes are under Penn State ownership and is licensed under a reative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.