A powerful and free TASS web scraper designed to efficiently aggregate news articles from tass.com
. This data extraction tool enables journalists, researchers, and organizations to gather large datasets of russian state media content for analysing propaganda patterns and narratives.
- Provide a number of articles to scrape
- Multiple news categories
- Built-in top 10 words analysis algorithm
- Export to both JSON and CSV formats
- Configurable concurrent workers for faster scraping
- Automatic retry mechanism for failed requests
- Adjustable parameters for best performance and personal customization
- No need to have Python - download the compiled version and run it
- Control it through your computer's terminal / command prompt
- Download
tass_scraper
from the releases folder - Open Command Prompt or PowerShell
- Navigate to the download location of
tass_scraper
- Run the executable using the instructions below
- Download
tass_scraper
from the releases folder - Open Terminal
- Navigate to the download location of
tass_scraper
- Make the file executable in your terminal. The
chmod +x
command grants the necessary execute permission to run the program:chmod +x tass_scraper
- Run the executable using the instructions below
- Clone this repository or copy the code from the source file
- Install Python 3.8 or higher
- Install the required libraries for the scraper:
pip install requests beautifulsoup4 lxml
- Install PyInstaller:
pip install pyinstaller
- Create an executable file:
pyinstaller --name tass_scraper --onefile --console --clean tass_scraper.py
Alternatively, you can simply run it as a Python scraper in your environment.
- You must have Python installed (3.8 or higher)
- You must install the required
requests
,beautifulsoup4
, andlxml
libraries - For ease of use, run the
.py
file from your terminal. For example:
python3 tass_scraper.py --headlines 100 --categories politics
First, navigate to the folder where tass_scraper
is located on your computer, for example:
On Windows:
cd C:/Users/my-user/projects/tass_news
On macOS:
cd /Users/my-user/projects/tass_news
Basic usage with default settings (scrapes 20 news articles from each category):
./tass_scraper
To see the help page, run ./tass_scraper -h
.
Note
The scraper outputs data to the same directory from where you ran it. For example, if you run the scraper from C:/Users/my-user
, the news_data
folder will be saved in the same path.
Example with custom parameters:
./tass_scraper --headlines 50 --categories world politics defense --workers 5 --csv --top-words
Caution
Use a reasonable value for --workers
. The default is 2
workers, which is good for most tasks. A good rule of thumb is to set workers to the number of CPU cores for optimal performance. Too many workers can overload your system and reduce efficiency. Additionally, TASS may block your IP address if you make too many requests per minute.
You can use as many paramters as you need.
If you want to specify your custom folder where to save all the data, you can do it using the --output-dir
flag, for instance:
./tass_scraper --output-dir /Users/my-user/Documents/my-custom-folder
Parameter | Default | Description |
---|---|---|
-h |
- | View the help page |
--headlines |
20 |
Number of headlines to scrape per category |
--categories |
all |
Categories to scrape (see available categories below) |
--csv |
false |
Save output in CSV format instead of JSON |
--workers |
2 |
Maximum number of concurrent workers |
--output-dir |
./news_data |
Output directory for scraped data |
--top-words |
false |
Enable top 10 words analysis |
--min-delay |
0.2 |
Minimum delay between requests in seconds |
--max-delay |
1.0 |
Maximum delay between requests in seconds |
--max-retries |
3 |
Maximum number of retry attempts |
politics
: Russian Politics & Diplomacyworld
: Worldeconomy
: Business & Economydefense
: Military & Defensescience
: Science & Spaceemergencies
: Emergenciessociety
: Society & Culturepressreview
: Press Reviewsports
: Sports
The scraper creates a directory named news_data
(or your specified output directory) containing:
- One file per category (JSON or CSV) with scraped articles (see both examples here)
- A
logs
subdirectory with detailed execution logs
See an example JSON file with 20 headlines and analysed top words.
[
{
"...": "..."
},
{
"title": "Lavrov to hold online news conference on December 26 — spokeswoman",
"description": "Maria Zakharova underlined that there is a lot of topics",
"date": "2024-12-24 15:00:33",
"link": "https://tass.com/politics/1892567",
"content": [
"Maria Zakharova underlined that there is a lot of topics",
"MOSCOW, December 24. /TASS/. Russian Foreign Minister Sergey Lavrov will hold an online news conference for foreign journalists on December 26, Russian Foreign Ministry Spokeswoman Maria Zakharova said.",
"\"The day after tomorrow (December 26 - TASS), the top Russian diplomat will speak with foreign correspondents,\" she told the Rossiya-24 television channel.",
"\"It is going to be hot because there is a lot of topics. He will outline the conclusions on some aspects of the international situation,\" she said, adding that Lavrov’s plans for December 25 also include an interview with the 60 Minutes program on the Rossiya-1 television channel."
]
},
{
"...": "..."
}
]
See an example CSV file with 20 headlines and analysed top words.
Each article is flattened into a single row with columns:
- title
- description
- date
- link
- content
- top_word_1 to top_word_10 (if enabled)
- top_word_1_count to top_word_10_count (if enabled)
This tool is here to, hopefully, step up the fight against russian propaganda by making it easier than ever to rapidly gather large-scale propaganda data for analysis.
Feel free to grab the scraper code, make it yours, and plug it into your workflow.