- Project Description
- Features
- Requirements
- Installation
- Running the Application
- Usage
- Functions
- Special Cases and Incorrect Input
- Contributions
This project implements a TF-IDF (Term Frequency-Inverse Document Frequency) based search engine. It processes a collection of documents, calculates TF-IDF scores, and enables querying to find the most relevant document based on a given query. The project is implemented in Python and uses the NLTK library for text preprocessing.
- Document Preprocessing: Tokenization, stemming, and stop-word removal.
- TF-IDF Calculation: Computes TF-IDF scores for terms in each document.
- Query Processing: Processes user queries to find and rank relevant documents.
- Handles Special Cases: Gracefully handles missing files and incorrect inputs.
- Python 3.7 or higher
- NLTK library
-
Clone the repository:
git clone https://github.com/meggitt/TFIDF-Based-Search.git cd tfidf-search-engine
-
Install required packages:
pip install nltk
-
Download NLTK data files: Uncomment and run the following lines in the script:
#nltk.download()
-
Prepare the corpus: Place your text files in a directory named
US_Inaugural_Addresses
.
- Execute the script:
python tfidf_search.py
preProcessDocuments
: Preprocesses the documents in the corpus.getidf
: Calculates the inverse document frequency for each term.tfidf
: Calculates the TF-IDF score for each term in each document.get_weight
: Retrieves the TF-IDF weight for a specific term in a specific document.get_idf
: Retrieves the IDF value for a specific term.query
: Processes a query to find the most relevant document based on TF-IDF scores.get_weight_q
: Retrieves the weight of a term in the query.
search = TFIDFSearch("./US_Inaugural_Addresses") # change according to your folder
search.preProcessDocuments()
search.getidf()
search.tfidf()
print("%.12f" % search.get_idf('children'))
print("%.12f" % search.get_idf('foreign'))
print("%.12f" % search.get_idf('people'))
print("%.12f" % search.get_idf('honor'))
print("%.12f" % search.get_idf('great'))
print("--------------")
print("%.12f" % search.get_weight('19_lincoln_1861.txt', 'constitution'))
print("%.12f" % search.get_weight('23_hayes_1877.txt', 'public'))
print("%.12f" % search.get_weight('25_cleveland_1885.txt', 'citizen'))
print("%.12f" % search.get_weight('09_monroe_1821.txt', 'revenue'))
print("%.12f" % search.get_weight('05_jefferson_1805.txt', 'press'))
print("--------------")
print("(%s, %.12f)" % search.query("pleasing people"))
print("(%s, %.12f)" % search.query("war offenses"))
print("(%s, %.12f)" % search.query("british war"))
print("(%s, %.12f)" % search.query("texas government"))
print("(%s, %.12f)" % search.query("cuba government"))
print("--------------")
print("\n\nSpecial Cases, Incorrect input\n\n")
print("%.12f" % search.get_idf('AT&T'))
print("%.12f" % search.get_weight('007_JJ.txt', 'UTA'))
print("%.12f" % search.get_weight('05_jefferson_1805.txt', 'AT&T'))
print("(%s, %.12f)" % search.query("arlington texas"))
The script includes handling for special cases and incorrect inputs. For example:
- Non-existent files are handled gracefully with error messages.
- Terms not found in the documents return a TF-IDF weight of -1.
- Incorrect or malformed input tokens are handled without causing crashes.
Contributions are welcome! Please create an issue or submit a pull request with your changes.