Skip to content

KPlanisphere/TF-IDF-Cosine-Similarity

Repository files navigation

TF-IDF Cosine Similarity

This repository contains a Python implementation to calculate the cosine similarity between TF-IDF matrices of documents and queries. It identifies and ranks the most relevant documents for each query based on their similarity.

Features

  • TF-IDF Matrix Handling: Load TF-IDF matrices for documents and queries from .txt files.
  • Cosine Similarity Calculation: Compute cosine similarity values between vectors.
  • Ranking Results: Rank documents for each query based on similarity scores.
  • Output: Save the ranked results to a specified output file.

Files

  • lab6.py: Main Python script for processing the TF-IDF matrices and calculating similarities.
  • matriz-TFIDF-docs.txt: Input file containing the TF-IDF matrix for documents.
  • matriz-TFIDF-query.txt: Input file containing the TF-IDF matrix for queries.
  • NPL_tf_idf_rels.txt: Output file containing the ranked document-query relevance scores.

How It Works

  1. Load TF-IDF Matrices: The script reads TF-IDF matrices from text files and converts them into Python lists for processing.

  2. Compute Cosine Similarity: For each query vector, the cosine similarity with each document vector is calculated using the formula:

    $\text{Cosine Similarity} = \frac{\text{Dot Product}(\mathbf{A}, \mathbf{B})}{\|\mathbf{A}\| \cdot \|\mathbf{B}\|}$

  3. Rank Documents: Documents are sorted in descending order of similarity for each query.

  4. Output Results: The results are written to a file in the format:

    <Query_ID> <Document_ID> <Similarity_Score>
    

Usage

  1. Clone the repository:

    git clone https://github.com/yourusername/tf-idf-cosine-similarity.git
    cd tf-idf-cosine-similarity
  2. Place your TF-IDF matrices in the same directory or update the file paths in lab6.py.

  3. Run the script:

    python lab6.py
  4. Check the output file NPL_tf_idf_rels.txt for the ranked results.

Requirements

  • Python 3.10+

Notes

  • Ensure the TF-IDF matrices are in the correct format, with each line representing a vector of float values separated by spaces.
  • Large matrices may take time to process due to computational complexity.
  • The script includes a safety limit to avoid infinite loops during query processing.

License

This project is licensed under the MIT License. See the LICENSE file for more details.