Team JEM: Reproducing Paper on Casual Topic Mining
Zhangzhou Yu (leader) Matthew McCarty Jack Ma
A presentation and demonstration of installing and running the application is available at https://mediaspace.illinois.edu/media/t/1_yra0qvjp .
This repository contains code to replicate an experiment done in a paper regarding causal topic mining with time series feedback:
Hyun Duk Kim, Malu Castellanos, Meichun Hsu, ChengXiang Zhai, Thomas Rietz, and Daniel Diermeier. 2013. Mining causal topics in text data: Iterative topic modeling with time series feedback. In Proceedings of the 22nd ACM international conference on information & knowledge management (CIKM 2013). ACM, New York, NY, USA, 885-890. DOI=10.1145/2505515.2505612
The intent of this paper is to develop a method to consider additional contextual data (specifically, in the form of a time series) to supplement topic mining. The paper discusses two scenarios (presidential elections and stock prices); we chose to replicate the former.
The specific experiment that was replicated involves determining topics from New York Times (NYT) articles from May-October 2000, with the additional context of betting odds for Bush and Gore winning the 2000 Presidential election. There are two files which are used as input for the Python code. One is the time series data for the betting odds, which is located in time_series.csv
(Iowa2000PresidentOdds.csv
is the raw data). The second input file, consolidated_nyt.tsv
, is a list of NYT articles between May and October 2000. The NYT articles were filtered by 'Bush' and 'Gore' keywords to ensure that non-relevant documents were not considered for topic generation. The article date is also included with the article content, so that the time context of the article's publication can be considered with the presidential odds time series.
The output of the program will be a list of topics, and the top three words within each topics. Unlike the plain vanilla PSLA algorithm, these topics highlight words that are highly correlated with the change of betting odds for Bush or Gore winning the election. The number of topics is determined by a parameter tn, and the paper discusses the performance of the algorithm with varying values of tn. For the purposes of our experiment reproduction, we chose tn=10.
The experiment was reproduced in Python (version 3.8.6) with the help of several libraries, which are listed below:
numpy
- for general linear algebra operationsgensim
- for generating a mapping between a token id and the normalized words which they representstatsmodels
- for the time-series causality test
The algorithm itself is a modified version of the PLSA algorithm, which was initially implemented for a homework assignment (MP3) in CS410 at UIUC. The plsawithprior.py
file contains a plsa
class which contains many variables of use, some of which are highlighted below:
term_doc_matrix
- word count of terms in a given documentdocument_topic_prob
- the probability ofp(z | d)
wherez
represents a specific topic andd
represents a specific documentmu
- the strength of the prior probability (whenmu=0
, the result would match PLSA with no prior)prior
- the prior probability ofp(w | z)
wherew
represents a word andz
represents a specific topictopic_word_prob
= the posterior probability ofp(w | z)
Additional descriptions of software programs are illustrated below:
Granger_Casuality_Test.py
- first clean the presidential betting odds time series raw dataIowa2000PresidentOdds.csv
and output the cleaned datatime_series.csv
; implemented the granger casuality test function ready for use inmain.py
calc_prior.py
- calculate the prior of significant words within significant topics from the granger casuality test significance level outputsanitize_nyt.py
- text data extraction, filtering and cleaning on the NYT articles from May-October 2000; outputconsolidated_nyt.tsv
where each line contains documents with 'Bush' and 'Gore' keywords from one day (the number of lines in this clean text data matches with the number of rows in the clean time series data; readily available for use in the granger casuality test)main.py
- include all the functions discussed above; consolidated main program
Future modifications could be made to the algorithm to change how the prior is generated (based on other time-series/non-document data source).
- Run
git clone https://github.com/enaena633/CourseProject.git
to clone the code repository. - Install Python 3.
- Install the following python libraries (via
pip
, etc.):
numpy
gensim
statsmodels
- Run
python main.py
in the repository directory.
The following list is the top 3 words in the ten topics that were mined from the New York Times documents:
ad win ralli
night lehrer trillion
econom recent try
support governor alaska
state governor alaska
governor clarenc right
night win tuesdai
wetston abm recent
offic men try
win ralli church
These results are different from the paper's results, which are included below:
tax cut 1
screen pataki giuliani
enthusiasm door symbolic
oil energy prices
pres al vice
love tucker presented
partial abortion privatization
court supreme abortion
gun control nra
news w top
This can be explained by the following:
- The implementation of several elements of the algorithm (Granger causality test, PLSA, etc.) were implemented in Python, whereas the paper used R.
- We used the
gensim
package to perform stemming of words (which would cause words likeeconom
to appear instead ofeconomy
oreconomic
). gensim
was also used to remove stop words. The paper does not specify whether a background language model was used in its implementation of PLSA or if any stop word removal was done.- The EM algorithm is guaranteed to converge to a local (but not necessarily global) maximum, which causes output to be different even with the same implementation when different random starting values are used.
- Certain parameters in the paper are not specified (e.g., the threshold value gamma for the significance cutoff for words at the topic level, we used 90%).
All team members were engaged and involved in reproducing the experiment from the paper sourced above. In addition to weekly meetings where everyone contributed, individual team members were responsible for the following:
Zhangzhou Yu (leader) - Time series data retrieval/cleaning, Granger causality test, administrative/organizational tasks
Matthew McCarty - Text data retrieval/cleaning, library research, documentation, presentation/demo recording
Jack Ma - PLSA augmentation to include use of contextual time series data, prior implementation, consolidate/structure software programs