TASS News Scraper

A powerful and free TASS web scraper designed to efficiently aggregate news articles from tass.com. This data extraction tool enables journalists, researchers, and organizations to gather large datasets of russian state media content for analysing propaganda patterns and narratives.

🔥 Features

Provide a number of articles to scrape
Multiple news categories
Built-in top 10 words analysis algorithm
Export to both JSON and CSV formats
Configurable concurrent workers for faster scraping
Automatic retry mechanism for failed requests
Adjustable parameters for best performance and personal customization
No need to have Python - download the compiled version and run it
Control it through your computer's terminal / command prompt

💿 Setup

⊞ Windows

Download tass_scraper from the releases folder
Open Command Prompt or PowerShell
Navigate to the download location of tass_scraper
Run the executable using the instructions below

 macOS

Download tass_scraper from the releases folder
Open Terminal
Navigate to the download location of tass_scraper
Make the file executable in your terminal. The chmod +x command grants the necessary execute permission to run the program:
```
chmod +x tass_scraper
```
Run the executable using the instructions below

🗂️ Compile from source

Clone this repository or copy the code from the source file
Install Python 3.8 or higher
Install the required libraries for the scraper:
```
pip install requests beautifulsoup4 lxml
```
Install PyInstaller:
```
pip install pyinstaller
```

Create an executable file:

pyinstaller --name tass_scraper --onefile --console --clean tass_scraper.py

🐍 Run as a Python file

Alternatively, you can simply run it as a Python scraper in your environment.

You must have Python installed (3.8 or higher)
You must install the required requests, beautifulsoup4, and lxml libraries
For ease of use, run the .py file from your terminal. For example:

python3 tass_scraper.py --headlines 100 --categories politics

🛠️ How to use TASS scraper

1. Navigate to the scraper folder

First, navigate to the folder where tass_scraper is located on your computer, for example:

On Windows:

cd C:/Users/my-user/projects/tass_news

On macOS:

cd /Users/my-user/projects/tass_news

2. Run the TASS scraper

Basic usage with default settings (scrapes 20 news articles from each category):

./tass_scraper

To see the help page, run ./tass_scraper -h.

Note

The scraper outputs data to the same directory from where you ran it. For example, if you run the scraper from C:/Users/my-user, the news_data folder will be saved in the same path.

Example with custom parameters:

./tass_scraper --headlines 50 --categories world politics defense --workers 5 --csv --top-words

Caution

Use a reasonable value for --workers. The default is 2 workers, which is good for most tasks. A good rule of thumb is to set workers to the number of CPU cores for optimal performance. Too many workers can overload your system and reduce efficiency. Additionally, TASS may block your IP address if you make too many requests per minute.

You can use as many paramters as you need.

If you want to specify your custom folder where to save all the data, you can do it using the --output-dir flag, for instance:

./tass_scraper --output-dir /Users/my-user/Documents/my-custom-folder

⚙️ All Parameters

Parameter	Default	Description
`-h`	-	View the help page
`--headlines`	`20`	Number of headlines to scrape per category
`--categories`	`all`	Categories to scrape (see available categories below)
`--csv`	`false`	Save output in CSV format instead of JSON
`--workers`	`2`	Maximum number of concurrent workers
`--output-dir`	`./news_data`	Output directory for scraped data
`--top-words`	`false`	Enable top 10 words analysis
`--min-delay`	`0.2`	Minimum delay between requests in seconds
`--max-delay`	`1.0`	Maximum delay between requests in seconds
`--max-retries`	`3`	Maximum number of retry attempts

📤 Output

The scraper creates a directory named news_data (or your specified output directory) containing:

One file per category (JSON or CSV) with scraped articles (see both examples here)
A logs subdirectory with detailed execution logs

JSON Output Format

See an example JSON file with 20 headlines and analysed top words.

[
    {
        "...": "..."
    },
    {
        "title": "Lavrov to hold online news conference on December 26 — spokeswoman",
        "description": "Maria Zakharova underlined that there is a lot of topics",
        "date": "2024-12-24 15:00:33",
        "link": "https://tass.com/politics/1892567",
        "content": [
            "Maria Zakharova underlined that there is a lot of topics",
            "MOSCOW, December 24. /TASS/. Russian Foreign Minister Sergey Lavrov will hold an online news conference for foreign journalists on December 26, Russian Foreign Ministry Spokeswoman Maria Zakharova said.",
            "\"The day after tomorrow (December 26 - TASS), the top Russian diplomat will speak with foreign correspondents,\" she told the Rossiya-24 television channel.",
            "\"It is going to be hot because there is a lot of topics. He will outline the conclusions on some aspects of the international situation,\" she said, adding that Lavrov’s plans for December 25 also include an interview with the 60 Minutes program on the Rossiya-1 television channel."
        ]
    },
    {
        "...": "..."
    }
]

CSV Output Format

See an example CSV file with 20 headlines and analysed top words.

Each article is flattened into a single row with columns:

title
description
date
link
content
top_word_1 to top_word_10 (if enabled)
top_word_1_count to top_word_10_count (if enabled)

This tool is here to, hopefully, step up the fight against russian propaganda by making it easier than ever to rapidly gather large-scale propaganda data for analysis.

Feel free to grab the scraper code, make it yours, and plug it into your workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
releases		releases
source code		source code
tass output examples		tass output examples
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TASS News Scraper

🔥 Features

💿 Setup

⊞ Windows

 macOS

🗂️ Compile from source

🐍 Run as a Python file

🛠️ How to use TASS scraper

1. Navigate to the scraper folder

2. Run the TASS scraper

⚙️ All Parameters

📚 Available Categories

📤 Output

JSON Output Format

CSV Output Format

About

Releases

Packages

Languages

License

opnweb/tass-scraper

Folders and files

Latest commit

History

Repository files navigation

TASS News Scraper

🔥 Features

💿 Setup

⊞ Windows

 macOS

🗂️ Compile from source

🐍 Run as a Python file

🛠️ How to use TASS scraper

1. Navigate to the scraper folder

2. Run the TASS scraper

⚙️ All Parameters

📚 Available Categories

📤 Output

JSON Output Format

CSV Output Format

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages