Skip to content

This repository contains a collection of my projects in data science, covering a variety of domains and techniques. These projects were completed as part of my studies and professional experiences, aiming to showcase my skills in data analysis, predictive modeling, and data visualization.

Notifications You must be signed in to change notification settings

MehatlaRanim/NLP_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP-Project

The system comprises several interconnected classes: TextPreprocessor() , PositionalInvertedIndex() , Query() , BooleanSearch(), PhraseSearch(), TFIDF().

1-Tokenization and Stemming The TextPreprocessor() class focuses on text preprocessing (removing stop words , tokenization and stemming) . Tokenization breaks text into tokens, typically words or sentences. we utilized the Natural Language Toolkit (nltk) library for tokenization, employing methods like word tokenization using word_tokenize() . Stemming, the process of reducing words to their base or root form, was accomplished using the Porter Stemmer algorithm available in nltk. This step standardizes words to their base form, allowing for better matching during search operations.

2-Inverted Index Implementation The PositionalInvertedIndex() class serves as the backbone of the system, constructing an inverted index for the corpus. The index is a mapping of terms to the documents in which they appear, along with additional information such as term frequency and positional information.

3-Search Function Implementations Boolean Search The BooleanSearch() class enables users to perform Boolean operations (AND, OR, NOT) on the inverted index. It processes user queries and retrieves relevant documents based on the Boolean logic applied to the index. This functionality is critical for basic information retrieval operations. Phrase Search The PhraseSearch() class handles more complex queries by considering the positions of terms in documents. It allows users to search for sequences of words or phrases within documents. This is achieved by utilizing the positional information stored in the inverted index.

4-TF-IDF Calculation The TFIDF() class calculates the TF-IDF (Term Frequency-Inverse Document Frequency) scores for terms in the corpus. This scoring system evaluates the importance of a term in a document relative to its frequency in the entire corpus. TF-IDF scores aid in ranking and retrieving documents based on relevance to user queries.

** Web Scraping** Web scraping is a powerful technique employed to gather data from websites, allowing for the extraction of valuable information for various purposes. In our project, we aimed to analyze the Israel-Palestine conflict by collecting data from different sources, including news articles and tweets.

About

This repository contains a collection of my projects in data science, covering a variety of domains and techniques. These projects were completed as part of my studies and professional experiences, aiming to showcase my skills in data analysis, predictive modeling, and data visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published