TF-IDF Search Engine

Project Description

This project implements a TF-IDF (Term Frequency-Inverse Document Frequency) based search engine. It processes a collection of documents, calculates TF-IDF scores, and enables querying to find the most relevant document based on a given query. The project is implemented in Python and uses the NLTK library for text preprocessing.

Features

Document Preprocessing: Tokenization, stemming, and stop-word removal.
TF-IDF Calculation: Computes TF-IDF scores for terms in each document.
Query Processing: Processes user queries to find and rank relevant documents.
Handles Special Cases: Gracefully handles missing files and incorrect inputs.

Requirements

Python 3.7 or higher
NLTK library

Installation

Clone the repository:

git clone https://github.com/meggitt/TFIDF-Based-Search.git
cd tfidf-search-engine

Install required packages:
```
pip install nltk
```
Download NLTK data files: Uncomment and run the following lines in the script:
```
#nltk.download()
```
Prepare the corpus: Place your text files in a directory named US_Inaugural_Addresses.

Running the Application

Execute the script:
```
python tfidf_search.py
```

Usage

Functions

preProcessDocuments: Preprocesses the documents in the corpus.
getidf: Calculates the inverse document frequency for each term.
tfidf: Calculates the TF-IDF score for each term in each document.
get_weight: Retrieves the TF-IDF weight for a specific term in a specific document.
get_idf: Retrieves the IDF value for a specific term.
query: Processes a query to find the most relevant document based on TF-IDF scores.
get_weight_q: Retrieves the weight of a term in the query.

Example Usage

search = TFIDFSearch("./US_Inaugural_Addresses") # change according to your folder
search.preProcessDocuments()
search.getidf()
search.tfidf()

print("%.12f" % search.get_idf('children'))
print("%.12f" % search.get_idf('foreign'))
print("%.12f" % search.get_idf('people'))
print("%.12f" % search.get_idf('honor'))
print("%.12f" % search.get_idf('great'))
print("--------------")
print("%.12f" % search.get_weight('19_lincoln_1861.txt', 'constitution'))
print("%.12f" % search.get_weight('23_hayes_1877.txt', 'public'))
print("%.12f" % search.get_weight('25_cleveland_1885.txt', 'citizen'))
print("%.12f" % search.get_weight('09_monroe_1821.txt', 'revenue'))
print("%.12f" % search.get_weight('05_jefferson_1805.txt', 'press'))
print("--------------")
print("(%s, %.12f)" % search.query("pleasing people"))
print("(%s, %.12f)" % search.query("war offenses"))
print("(%s, %.12f)" % search.query("british war"))
print("(%s, %.12f)" % search.query("texas government"))
print("(%s, %.12f)" % search.query("cuba government"))
print("--------------")
print("\n\nSpecial Cases, Incorrect input\n\n")
print("%.12f" % search.get_idf('AT&T'))
print("%.12f" % search.get_weight('007_JJ.txt', 'UTA'))
print("%.12f" % search.get_weight('05_jefferson_1805.txt', 'AT&T'))
print("(%s, %.12f)" % search.query("arlington texas"))

Special Cases and Incorrect Input

The script includes handling for special cases and incorrect inputs. For example:

Non-existent files are handled gracefully with error messages.
Terms not found in the documents return a TF-IDF weight of -1.
Incorrect or malformed input tokens are handled without causing crashes.

Contributions

Contributions are welcome! Please create an issue or submit a pull request with your changes.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
US_Inaugural_Addresses		US_Inaugural_Addresses
README.md		README.md
tfidf_search.py		tfidf_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF Search Engine

Table of Contents

Project Description

Features

Requirements

Installation

Running the Application

Usage

Functions

Example Usage

Special Cases and Incorrect Input

Contributions

About

Releases

Packages

Languages

meggitt/TFIDF-Based-Search

Folders and files

Latest commit

History

Repository files navigation

TF-IDF Search Engine

Table of Contents

Project Description

Features

Requirements

Installation

Running the Application

Usage

Functions

Example Usage

Special Cases and Incorrect Input

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages