StackExAR

Stack Exchange archive reader

This project is needed for convenient access to the stackexchange.com archive. Using a simple API, you can index archives and retrieve data directly. The main goal of this project is to use it to train artificial intelligence.

The main goal was to be able to read the archive stackoverflow.com. The archive weighs 20 gigabytes and approximately more than 70 million records, since all data is recorded in a bzip2 archive this allows them to be read in stream mode

Installation

python3 -m venv venv
source venv/bin/activate 
pip3 install -r requirements.txt

Open env_config with any txt editor and set you configs

Example config:

count_workers = 32
archive_folder = "data/archives"
database_folder = "data/archives"
host = "0.0.0.0"
port = "8000"

Usage

use /archive/list to find all files in archive folder
use /archive/load or /archive/load/all for preload archive files (It is worth understanding that large files require preliminary indexing)
use /indexing/process or /indexing/process/all for index content in archive
use /archive/get/post or /archive/get/posts for read posts

TODO

make faster reader for stackoverflow.com
remade database worker class
full text search?
make a variation for static operation

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
app		app
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env_config		env_config
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StackExAR

Installation

Usage

TODO

About

Releases 1

Packages

Languages

License

Taruu/StackExAR

Folders and files

Latest commit

History

Repository files navigation

StackExAR

Installation

Usage

TODO

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages