TF-IDF Cosine Similarity

This repository contains a Python implementation to calculate the cosine similarity between TF-IDF matrices of documents and queries. It identifies and ranks the most relevant documents for each query based on their similarity.

Features

TF-IDF Matrix Handling: Load TF-IDF matrices for documents and queries from .txt files.
Cosine Similarity Calculation: Compute cosine similarity values between vectors.
Ranking Results: Rank documents for each query based on similarity scores.
Output: Save the ranked results to a specified output file.

Files

lab6.py: Main Python script for processing the TF-IDF matrices and calculating similarities.
matriz-TFIDF-docs.txt: Input file containing the TF-IDF matrix for documents.
matriz-TFIDF-query.txt: Input file containing the TF-IDF matrix for queries.
NPL_tf_idf_rels.txt: Output file containing the ranked document-query relevance scores.

How It Works

Load TF-IDF Matrices: The script reads TF-IDF matrices from text files and converts them into Python lists for processing.
Compute Cosine Similarity: For each query vector, the cosine similarity with each document vector is calculated using the formula:

$\text{Cosine Similarity} = \frac{\text{Dot Product}(\mathbf{A}, \mathbf{B})}{\|\mathbf{A}\| \cdot \|\mathbf{B}\|}$
Rank Documents: Documents are sorted in descending order of similarity for each query.
Output Results: The results are written to a file in the format:
```
<Query_ID> <Document_ID> <Similarity_Score>
```

Usage

Clone the repository:

git clone https://github.com/yourusername/tf-idf-cosine-similarity.git
cd tf-idf-cosine-similarity

Place your TF-IDF matrices in the same directory or update the file paths in lab6.py.
Run the script:
```
python lab6.py
```
Check the output file NPL_tf_idf_rels.txt for the ranked results.

Requirements

Python 3.10+

Notes

Ensure the TF-IDF matrices are in the correct format, with each line representing a vector of float values separated by spaces.
Large matrices may take time to process due to computational complexity.
The script includes a safety limit to avoid infinite loops during query processing.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
NPL_tf_idf_rels.txt		NPL_tf_idf_rels.txt
README.md		README.md
lab6.py		lab6.py
matriz-TFIDF-docs.txt		matriz-TFIDF-docs.txt
matriz-TFIDF-query.txt		matriz-TFIDF-query.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF Cosine Similarity

Features

Files

How It Works

Usage

Requirements

Notes

License

About

Releases

Packages

Languages

KPlanisphere/TF-IDF-Cosine-Similarity

Folders and files

Latest commit

History

Repository files navigation

TF-IDF Cosine Similarity

Features

Files

How It Works

Usage

Requirements

Notes

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages