Stack Exchange archive reader
This project is needed for convenient access to the stackexchange.com archive. Using a simple API, you can index archives and retrieve data directly. The main goal of this project is to use it to train artificial intelligence.
The main goal was to be able to read the archive stackoverflow.com. The archive weighs 20 gigabytes and approximately more than 70 million records, since all data is recorded in a bzip2 archive this allows them to be read in stream mode
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
Open env_config
with any txt editor and set you configs
Example config:
count_workers = 32
archive_folder = "data/archives"
database_folder = "data/archives"
host = "0.0.0.0"
port = "8000"
- use
/archive/list
to find all files in archive folder - use
/archive/load
or/archive/load/all
for preload archive files (It is worth understanding that large files require preliminary indexing) - use
/indexing/process
or/indexing/process/all
for index content in archive - use
/archive/get/post
or/archive/get/posts
for read posts
- make faster reader for stackoverflow.com
- remade database worker class
- full text search?
- make a variation for static operation