Skip to content

A tool for tweet topic analysis written in python. It uses PySpark to handle high-volume data.

Notifications You must be signed in to change notification settings

Andreus00/tweepic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

1981ab9 · Jan 25, 2024

History

25 Commits
Jun 6, 2023
Aug 16, 2023
Jun 10, 2023
Aug 16, 2023
Jun 10, 2023
Aug 16, 2023
Jun 10, 2023
Jan 25, 2024
Jun 1, 2023
Aug 16, 2023
Aug 16, 2023
Aug 16, 2023
Aug 16, 2023
Jun 2, 2023
Jun 1, 2023
Jun 3, 2023

Repository files navigation

Tweepic: A Novel Approach to Tweet Clustering

Tweepic is a cutting-edge project aimed at the clustering of live tweets. Our goal is to collect tweets and discern which ones are discussing the same topic, thereby enabling us to group them accordingly. The name 'Tweepic' is a portmanteau of 'tweet' and 'topic', reflecting our project's core objective.

Unlike traditional methods that group tweets based solely on hashtags, Tweepic proposes a novel approach that also considers the proximity of sentences and the similarity of words through their embeddings. Our process begins by determining a proximity measure for sentences, words, and hashtags. Using this measure, we construct a graph where each vertex represents a tweet, and the edges represent the k-th nearest tweets, weighted by their distance. Subsequently, a classifier is employed to decide which edges should be cut due to significant differences between the connected tweets. This results in the final graph, where each connected component represents a cluster of tweets discussing the same topic.

Team Members

Links

About

A tool for tweet topic analysis written in python. It uses PySpark to handle high-volume data.

Resources

Stars

Watchers

Forks

Languages