GitHub - wenyanglyu/Hot-Topic-Detection: analyzing and detecting hot topics in blog data using various natural language processing and topic modeling techniques

This repository contains code for analyzing and detecting topics in blog data using various natural language processing and topic modeling techniques, as described in the research paper "Hot Topic Detection with Topic Modeling Methods" by Wenyang Lyu, Henry Hu, and Parma Nand.

Overview

This project applies several topic modeling approaches to a large dataset of 19,320 XML-formatted blog files from an anonymous blogging site, covering a period from 2001 to 2004. The blogs are segmented by demographic categories (gender, age, and student status). The implementation includes comprehensive pre-processing of text data, application of multiple topic extraction methods, and visualization of the discovered topics.

The aim is to identify the two most popular topics discussed within specific demographics:

Males
Females
Age less than 20
Age over 20
Students
General population

Features

Data Preparation: Unzips and segments blog data into demographic categories
Text Pre-processing: Implements robust text cleaning including:
- Lowercasing text
- Replacing non-ASCII characters
- Tokenization
- Removing stop words and punctuation
- Spell checking
- Stemming and lemmatization
- Parallel processing with Dask for large datasets
Topic Detection Methods:
- Noun counting
- Grammatical role analysis (subjects, direct objects, prepositional objects)
- TF-IDF analysis
- n-Gram extraction
- Latent Dirichlet Allocation (LDA) with optimal parameter tuning
- Non-negative Matrix Factorization (NMF)
Visualization:
- Word clouds for topic visualization
- Termite plots for topic term distribution
- Topic distribution plots for most dominant topics

Theoretical Background

The project applies two main topic modeling techniques:

Latent Dirichlet Allocation (LDA): A generative probabilistic model that treats documents as mixtures of topics and topics as mixtures of words.
Non-negative Matrix Factorization (NMF): A matrix decomposition method that breaks down document-term matrices into components representing topics.

Topic quality is evaluated using coherence scores, which measure the semantic similarity between words in a topic.

Where NPMI is the normalized pointwise mutual information between words.

Methods Compared

The project compares several topic modeling approaches:

Counting Nouns: Identifies the most frequent nouns in the corpus
Grammatical Role Analysis: Counts subjects, direct objects, and prepositional objects
TF-IDF Analysis: Identifies important terms using TF-IDF scoring
n-Gram Analysis: Extracts common bi-grams and tri-grams
LDA with CountVectorizer: Tests various numbers of topics (10, 15, 20, 30, 40, 50, 60)
NMF with TF-IDF Vectorizer: Tests various numbers of topics (10, 15, 20, 40)

Key Findings

LDA Performance: Best coherence score (0.52) achieved with 20 topics, contrary to common belief that higher topic numbers yield better coherence
NMF Performance: Best coherence score (0.59) achieved with 15 topics
Optimal Approach: NMF with TF-IDF vectorization provided the most coherent and interpretable topics
Demographic Insights: Each demographic showed distinct dominant topics:
- Students: Internet technologies; Daily life reflection (0.5986 coherence)
- Males: Job and mental health; Travel experiences (0.6036 coherence)
- Females: Love and emotional journey; Urban life and reflections (0.5766 coherence)
- Age > 20: Pressure and political critique; Heartbreak and emotional struggles (0.6602 coherence)
- Age ≤ 20: Heartbreak and love; Daily activities and reflections (0.6034 coherence)
- Everyone: Life events and reflections; Citizenship and societal expectations (0.6101 coherence)

Installation

Install all required packages using pip:

pip install nltk spacy unidecode pyspellchecker "dask[distributed]" pyLDAvis gensim scikit-learn matplotlib seaborn textblob wordcloud

You'll also need to download necessary NLTK and spaCy resources:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import spacy
spacy.cli.download('en_core_web_sm')

Usage

The notebook is organized into six main parts:

Dataset Preparation: Loading and unzipping files
Pre-Processing: Text cleaning and preparation
NMF with TF-IDF: Topic detection by demographic
Basic Topic Detection: Using noun counting, TF-IDF, n-Gram methods
LDA Optimization: Testing various topic numbers with CountVectorizer
NMF Optimization: Testing various topic numbers with TF-IDF

Methodology Workflow

Data Segmentation: Files are categorized by demographics based on file naming patterns
Pre-processing: Text is cleaned and standardized through multiple steps
Topic Modeling: Different methods are applied to extract topics
Clause Extraction: Top documents containing dominant topics are extracted for analysis
Topic Interpretation: Word clouds and human judgment are used to interpret topics

Reflections and Lessons Learned

Understanding the theoretical principles of methods before parameter tuning is crucial
NMF produces more specific and actionable topics compared to counting methods and LDA
Topic coherence doesn't necessarily increase with higher topic numbers
A combination of automated metrics and human judgment provides the best topic evaluation

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation.
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization.
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures.

Acknowledgments

The implementation is based on popular NLP libraries and techniques in topic modeling literature.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Hot Topic Detection with Topic Modeling Methods.pdf		Hot Topic Detection with Topic Modeling Methods.pdf
README.md		README.md
topic_detection.py		topic_detection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Theoretical Background

Methods Compared

Key Findings

Installation

Usage

Methodology Workflow

Reflections and Lessons Learned

References

Acknowledgments

About

Releases

Packages

Languages

wenyanglyu/Hot-Topic-Detection

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Theoretical Background

Methods Compared

Key Findings

Installation

Usage

Methodology Workflow

Reflections and Lessons Learned

References

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages