This repository contains the code for the diploma project focused on vectorizing scientific texts using the Top2Vec algorithm.
The aim of the project is to develop a module for analyzing scientific texts, identifying development trends, searching for similar documents, and determining the pace of development of thematic groups. The practical application of the module is intended for the analysis of individual thematic groups in the field of computer science.
- Automated data collection from the arXiv platform.
- Vectorization of scientific texts using the Top2Vec algorithm.
- Identification of key words for thematic groups.
- Clustering and analysis of thematic groups.
- Trend analysis and identification of key development directions.
- Visualization of analysis results.
The model was trained on a dataset of scientific paper abstracts obtained from arXiv. The dataset covers a range of topics in the field of computer science from 2010 to 2024.
You can access the dataset arxiv_papers_cs.
The main project code is contained in the main.ipynb
Jupyter notebook. To work with the notebook:
- Open the notebook in Jupyter Lab or Jupyter Notebook.
- Execute the cells in order, to start the different stages of the analysis.
Here are some examples of the model's output for the thematic group "UAVs in Natural Disasters and Emergencies":
This graph shows the trend of interest in the use of UAVs in situations of natural disasters and emergencies over time.
Analysis for the Thematic Group: Natural Disasters and Emergencies
Year | Number of Publications | Growth Acceleration | Change in Number of Publications | Relative Growth |
---|---|---|---|---|
2010 | 19 | 0 | 0 | 0.0% |
2011 | 15 | -4 | -4 | -21.05% |
2012 | 28 | 17 | 13 | 86.67% |
2013 | 38 | -3 | 10 | 35.71% |
2014 | 28 | -20 | -10 | -26.32% |
2015 | 47 | 29 | 19 | 67.86% |
2016 | 63 | -3 | 16 | 34.04% |
2017 | 94 | 15 | 31 | 49.21% |
2018 | 173 | 48 | 79 | 84.04% |
2019 | 266 | 14 | 93 | 53.76% |
2020 | 337 | -22 | 71 | 26.69% |
2021 | 380 | -28 | 43 | 12.76% |
2022 | 453 | 30 | 73 | 19.21% |
2023 | 509 | -17 | 56 | 12.36% |
We welcome contributions to the top2vec_scientific_texts model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue or submit a pull request.
The results of the analysis include the identification of thematic groups, trend analysis, and visualization of the dynamics of interest in various topics over the years. These results are presented in the form of tables and graphs inside the main.ipynb
notebook.