Simple python3.6 application that scrapes the 9Gag hot forum's media content. It stores this data into a local database. This app is part of a bigger system for detecting reposts on the 9Gag forum.
The scraper requires a couple things to start. It requires a geckodriver executable to be in PATH or at a location specified in the config used to create the scraper instance. It needs a firefox executable binary (for selenium). Lastly, it needs a mongo database to save its data too. Settings for connecting to mongo can be set in the config.
the config:
Option name | Data type | Description | Default |
---|---|---|---|
MONGO_INITDB_ROOT_USERNAME | str | Username used to connect to the mongo database server | None |
MONGO_INITDB_ROOT_PASSWORD | str | Password used to connect to the mongo database server | None |
MONGO_INITDB_HOST | str | Hostname (i.e. localhost) | None |
MONGO_INITDB_PORT | int | Port the mongo database server is listening on | 27017 |
SCRAPER_FORUM_NAME | str | Name of forum to scrape | 9gag |
SCRAPER_MAX_SCROLL_SECONDS | int | How many seconds will the scraper keep scanning the 9gag hot page | 60 |
SCRAPER_CREATE_SERVICE_LOG | bool | True if you want the scraper to push its logs to a file instead of stdout | False |
SCRAPER_HEADLESS_MODE | bool | Used to set MOZ_HEADLESS which if set to true will run firefox headless | True |
WEBDRIVER_LOGDIR | str | directory to write geckodriver logs to | ./log/geckodriver.log |
WEBDRIVER_EXECUTABLE_PATH | str | Directory path to the geckodriver executable. | ./geckodriver |
WEBDRIVER_BROWSER_EXECUTABLE_PATH | str | Directory path to the firefox executable | None |
Use git clone to download the repository:
$ git clone https://github.com/jesseVDwolf/ForumMediaScraper.git
Install the package using pip. Make sure you're in the same directory as the setup.py file:
$ pip install .
Start using the ForumMediaScraper:
from ForumMediaScraper.Scraper import ScraperConfig, SeleniumScraper
config = ScraperConfig({
'WEBDRIVER_EXECUTABLE_PATH': './drivers/geckodriver-win.exe',
'MONGO_INITDB_ROOT_USERNAME': 'admin',
'MONGO_INITDB_ROOT_PASSWORD': 'password123',
'SCRAPER_CREATE_LOGFILE': True,
'SCRAPER_HEADLESS_MODE': False,
'SCRAPER_MAX_SCROLL_SECONDS': 40,
'WEBDRIVER_BROWSER_EXECUTABLE_PATH': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
})
scraper = SeleniumScraper(config)
scraper.run()