This repository contains a Python implementation to calculate the cosine similarity between TF-IDF matrices of documents and queries. It identifies and ranks the most relevant documents for each query based on their similarity.
- TF-IDF Matrix Handling: Load TF-IDF matrices for documents and queries from
.txt
files. - Cosine Similarity Calculation: Compute cosine similarity values between vectors.
- Ranking Results: Rank documents for each query based on similarity scores.
- Output: Save the ranked results to a specified output file.
lab6.py
: Main Python script for processing the TF-IDF matrices and calculating similarities.matriz-TFIDF-docs.txt
: Input file containing the TF-IDF matrix for documents.matriz-TFIDF-query.txt
: Input file containing the TF-IDF matrix for queries.NPL_tf_idf_rels.txt
: Output file containing the ranked document-query relevance scores.
-
Load TF-IDF Matrices: The script reads TF-IDF matrices from text files and converts them into Python lists for processing.
-
Compute Cosine Similarity: For each query vector, the cosine similarity with each document vector is calculated using the formula:
$\text{Cosine Similarity} = \frac{\text{Dot Product}(\mathbf{A}, \mathbf{B})}{\|\mathbf{A}\| \cdot \|\mathbf{B}\|}$ -
Rank Documents: Documents are sorted in descending order of similarity for each query.
-
Output Results: The results are written to a file in the format:
<Query_ID> <Document_ID> <Similarity_Score>
-
Clone the repository:
git clone https://github.com/yourusername/tf-idf-cosine-similarity.git cd tf-idf-cosine-similarity
-
Place your TF-IDF matrices in the same directory or update the file paths in
lab6.py
. -
Run the script:
python lab6.py
-
Check the output file
NPL_tf_idf_rels.txt
for the ranked results.
- Python 3.10+
- Ensure the TF-IDF matrices are in the correct format, with each line representing a vector of float values separated by spaces.
- Large matrices may take time to process due to computational complexity.
- The script includes a safety limit to avoid infinite loops during query processing.
This project is licensed under the MIT License. See the LICENSE
file for more details.