ForumMediaScraper package

Simple python3.6 application that scrapes the 9Gag hot forum's media content. It stores this data into a local database. This app is part of a bigger system for detecting reposts on the 9Gag forum.

Getting started

The scraper requires a couple things to start. It requires a geckodriver executable to be in PATH or at a location specified in the config used to create the scraper instance. It needs a firefox executable binary (for selenium). Lastly, it needs a mongo database to save its data too. Settings for connecting to mongo can be set in the config.

the config:

Option name	Data type	Description	Default
MONGO_INITDB_ROOT_USERNAME	str	Username used to connect to the mongo database server	None
MONGO_INITDB_ROOT_PASSWORD	str	Password used to connect to the mongo database server	None
MONGO_INITDB_HOST	str	Hostname (i.e. localhost)	None
MONGO_INITDB_PORT	int	Port the mongo database server is listening on	27017
SCRAPER_FORUM_NAME	str	Name of forum to scrape	9gag
SCRAPER_MAX_SCROLL_SECONDS	int	How many seconds will the scraper keep scanning the 9gag hot page	60
SCRAPER_CREATE_SERVICE_LOG	bool	True if you want the scraper to push its logs to a file instead of stdout	False
SCRAPER_HEADLESS_MODE	bool	Used to set MOZ_HEADLESS which if set to true will run firefox headless	True
WEBDRIVER_LOGDIR	str	directory to write geckodriver logs to	./log/geckodriver.log
WEBDRIVER_EXECUTABLE_PATH	str	Directory path to the geckodriver executable.	./geckodriver
WEBDRIVER_BROWSER_EXECUTABLE_PATH	str	Directory path to the firefox executable	None

Setup

Use git clone to download the repository:

$ git clone https://github.com/jesseVDwolf/ForumMediaScraper.git

Install the package using pip. Make sure you're in the same directory as the setup.py file:

$ pip install .

Start using the ForumMediaScraper:

from ForumMediaScraper.Scraper import ScraperConfig, SeleniumScraper

config = ScraperConfig({
    'WEBDRIVER_EXECUTABLE_PATH': './drivers/geckodriver-win.exe',
    'MONGO_INITDB_ROOT_USERNAME': 'admin',
    'MONGO_INITDB_ROOT_PASSWORD': 'password123',
    'SCRAPER_CREATE_LOGFILE': True,
    'SCRAPER_HEADLESS_MODE': False,
    'SCRAPER_MAX_SCROLL_SECONDS': 40,
    'WEBDRIVER_BROWSER_EXECUTABLE_PATH': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
})
scraper = SeleniumScraper(config)
scraper.run()

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
ForumMediaScraper		ForumMediaScraper
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.py		entrypoint.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ForumMediaScraper package

Getting started

Setup

About

Releases

Packages

Languages

License

jesseVDwolf/ForumMediaScraper

Folders and files

Latest commit

History

Repository files navigation

ForumMediaScraper package

Getting started

Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages