Skip to content

Commit

Permalink
Merge pull request #20 from BlipRanger/dev
Browse files Browse the repository at this point in the history
v1.4
  • Loading branch information
BlipRanger authored May 29, 2021
2 parents f1f163c + 8496984 commit 7087dd1
Show file tree
Hide file tree
Showing 29 changed files with 779 additions and 203 deletions.
8 changes: 6 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
.vscode/settings.json
.vscode/
/venv/
.idea
html/
temp/
__pycache__/
config/
.vscode/.ropeproject/config.py
.vscode/.ropeproject/objectdb
output/
input/
config/user_configs
14 changes: 2 additions & 12 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,14 @@ COPY ./bdfrtohtml/ ./bdfrtohtml
COPY ./templates/ ./templates
COPY ./start.py ./start.py
COPY ./requirements.txt ./requirements.txt
COPY ./config/config.yml ./config/config.yml
COPY ./config/default_bdfr_config.cfg ./config/default_bdfr_config.cfg

ENV BDFR_FREQ=15
ENV BDFR_IN=/input
ENV BDFR_OUT=/output
ENV BDFR_RECOVER_COMMENTS=True
ENV BDFR_ARCHIVE_CONTEXT=True
ENV BDFR_LIMIT=1100
ENV RUN_BDFR=False
ENV BDFRH_DELETE=False
ENV BDFRH_LOGLEVEL=INFO

EXPOSE 5000
EXPOSE 7634

RUN pip install -r requirements.txt

RUN mkdir input
RUN mkdir output
RUN mkdir config

CMD python start.py
74 changes: 52 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,67 @@
# bdfr-html
Converts the output of the [bulk downloader for reddit (V2)](https://github.com/aliparlakci/bulk-downloader-for-reddit) to a set of HTML pages.
BDFR-HTML is a companion script that turns the output of the incredibly useful [bulk downloader for reddit](https://github.com/aliparlakci/bulk-downloader-for-reddit) into a set of HTML pages with an index which can be easily viewed in a browser. It also provides a number of other handy tools such as the ability to grab the context for saved comments or pull down deleted posts from Pushshift. The HTML pages are rendered using jinja2 templates and can be easily modified to suit your needs. The script currently requires that you run both the archive and the download portions of the BDfR bulk downloader script and that the names of the downloaded files contain the post id (this is default). This can be automated using the included start.py or docker container.

Currently requires that you run both the archive and the download portions of the BDfR bulk downloader script and that the names of the downloaded files contain the post id (this is default).
Currently only supports the json version of the archive output from BDfR V2.
## Table of Contents
- [bdfr-html](#bdfr-html)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Usage](#usage)
- [Docker](#docker)
- [Contributing](#contributing)
- [Planned Features](#planned-features)

**Usage**
## Installation

`python -m bdfrtohtml --input ./location/of/archive/and/downloads --output /../html/`
You can simply clone this repo and run the script in this folder or you can try your hand at installing the package using setuptools using the following command:
` python setup.py install`
(This is still a work in progress)

Use `python -m bdfrtohtml --help` for a list of options
## Usage

**Docker-Compose**
To run the script with defaults:
`python -m bdfrtohtml` (the default is to look in the folder 'input' and write to the folder 'output')

For ease of use for both bdfr and bdfr-html in an automated fashion, I have included a docker-compose file which will spin up both an automation container and a web server container. The automation container will run bdfr and then subsequently bdfr-html, producing a volume or mounted folder containing the generated html files. The web server container shares the output volume and hosts the generated files. Currently this is tasked to only save "Saved" user content, however this might be changed in the future. If you would prefer to populate bdfr-html with your own reddit json/media files from bdfr, you can use a similar docker-compose file, but mount the folder where you have saved your content to the `BDFR_IN` folder (/input by default) and set the env variable `RUN_BDFR` to false (default).
`python -m bdfrtohtml --input_folder ./location/of/archivedFiles --output_folder /../html/`

**Options**
```
--input_folder TEXT The folder where the download and archive results have been saved to.
--output_folder TEXT Folder where the HTML results should be created.
--recover_comments BOOLEAN Should we attempt to recover deleted comments?
--recover_posts BOOLEAN Should we attempt to recover deleted posts?
--generate_thumbnails BOOLEAN Generate thumbnails for video posts? (deprecated by index_mode)
--archive_context BOOLEAN Should we attempt to archive the contextual post for saved comments?
--delete_media BOOLEAN Should we delete the input media after creating the output?
--index_mode [default|lightweight|oldreddit] What type of templated index page should be generated?
--write_links_to_file [None|Webpages|All] Should we write the links from posts to a text file for external consuption?
--config FILENAME Read in a config file
--help Show this message and exit.
```

**start.py**
The start.py is what (currently) powers the docker container's automation and steps through running both bdfr and bdfr-html in sequence at timed intervals.
Instead of running bdfrtohtml alone, you can run `python start.py` in a cloned copy of this repo to start up the automated process.
The configuration for both bdfrtohtml and the start.py script itself can be found in the [config folder](https://github.com/BlipRanger/bdfr-html/tree/main/bdfrtohtml/config).
This script also includes multi-user support which can be found in the config file.

## Docker

For ease of use of both bdfr and bdfr-html in an automated fashion, there is included a docker-compose file which will spin up both an automation container and a web server container. The automation container will run bdfr and then subsequently bdfr-html, producing a volume or mounted folder containing the generated html files. The web server container shares the output volume and hosts the generated files. Currently this is tasked to only save "Saved" user content, however this might be changed in the future. If you would prefer to populate bdfr-html with your own reddit json/media files from bdfr, you can use a similar docker-compose file, but mount the folder where you have saved your content to the folder (bdfrtohtml/input by default) and set the config variable `RUN_BDFR` to false.

Since BDFR 2.1.1 you should be able to properly hit the Oauth within the docker container. The proper port for validation has is exposed in the docker-compose file.
If you are running the docker container on a different machine, replace `locahost` in the returned url with the address of the docker host.
If you are running the docker container on a different machine, replace `locahost` in the returned Oauth url with the address of the docker host.

To run the compose file, simply clone this repo and run `docker-compose up`.

**Additional Features**
The config file of the docker container can be mounted and modified just like the one mentioned above for the start.py script.

## Contributing
I am open to any and all help or criticism on this project! Please feel free to create issues as you encounter them and I'll work to get them fixed. I have a set idea of the scope of this project, but I am always open to new feature suggestions or improvements to my code. Also, if you have code you'd like to contribute, just open a PR and I'll take a look!

- Use the --archive_context option to pull the related contextual post for downloaded comments.
- Use the --recover_comments and --recover_posts options to have the script attempt to pull deleted comments and posts from Pushshift.
- The script now actively avoids reprocessing inputs by storing a list of processed ids in the output folder.
- Produces an ID file of processed posts which can be fed to bdfr to avoid re-downloading content.
- Templated HTML using jinja2 which can be easily modified to suit your needs.
- Use --write_links_to_file option to write a list of urls for webpages and/or media referenced by posts to a file to use in other processes
- Posts are sorted in chronological order

**Planned Features**
## Planned Features

- Improved HTML templates
- Config file instead of just arguments
- Better docs
- Additional possible docker-compose configs
- Better documentation, including a lessons learned page
- The ability to output more data/metrics
- Docker support for automatically archiving subreddits/users
- PyPi + Dockerhub package support
2 changes: 1 addition & 1 deletion bdfrtohtml/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# __init__.py
__author__ = "BlipRanger"
__version__ = "1.3.0"
__version__ = "1.3.1"
__license__ = "GNU GPLv3"
66 changes: 47 additions & 19 deletions bdfrtohtml/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,40 @@
import shutil
from bdfrtohtml import filehelper
from bdfrtohtml import posthelper
from bdfrtohtml import util
import logging
import copy

logger = logging.getLogger(__name__)


@click.command()
@click.option('--input', default='./', help='The folder where the download and archive results have been saved to')
@click.option('--output', default='./html/', help='Folder where the HTML results should be created.')
@click.option('--recover_comments', default=False, type=bool, help='Should we attempt to recover deleted comments?')
@click.option('--recover_posts', default=False, type=bool, help='Should we attempt to recover deleted posts?')
@click.option('--archive_context', default=False, type=bool,
@click.option('--input_folder', default=None, help='The folder where the download and archive results have been saved to')
@click.option('--output_folder', default=None, help='Folder where the HTML results should be created.')
@click.option('--recover_comments', type=bool, help='Should we attempt to recover deleted comments?')
@click.option('--recover_posts', type=bool, help='Should we attempt to recover deleted posts?')
@click.option('--generate_thumbnails', type=bool, help='Generate thumbnails for video posts?')
@click.option('--archive_context', type=bool,
help='Should we attempt to archive the contextual post for saved comments?')
@click.option('--delete_input', default=False, type=bool, help='Should we delete the input after creating the output?')
def main(input, output, recover_comments, recover_posts, archive_context, delete_input):
output = filehelper.assure_path_exists(output)
input = filehelper.assure_path_exists(input)
@click.option('--delete_media', type=bool, help='Should we delete the input media after creating the output?')
@click.option('--index_mode', type=click.Choice(['default', 'lightweight', 'oldreddit']), help='Generate an index with no playing media and shrunk images?')
@click.option('--write_links_to_file', type=click.Choice(['None', 'Webpages', 'All'], case_sensitive=False),
help='Should we write the links from posts to a text file for external consuption?')
@click.option('--config', type=click.File('r'), help='Read in a config file')
@click.pass_context
def main(context: click.Context, **_):

if context.params.get('config'):
config = util.load_config(context.params.get('config'))
else:
config = util.generate_default_config()
config = util.process_click_arguments(config, context)
logging.debug(config)

output = filehelper.assure_path_exists(config['output_folder'])
input = filehelper.assure_path_exists(config['input_folder'])
filehelper.assure_path_exists(os.path.join(output, "media/"))


# Load all of the json files
all_posts = filehelper.import_posts(input)
Expand All @@ -37,33 +53,45 @@ def main(input, output, recover_comments, recover_posts, archive_context, delete
post = copy.deepcopy(entry)
try:
post = posthelper.handle_comments(post)
if recover_comments:
if config['recover_comments']:
post = posthelper.recover_deleted_comments(post)
if archive_context:
if config['archive_context']:
post = posthelper.get_comment_context(post, input)
if recover_posts:
if config['recover_posts']:
post = posthelper.recover_deleted_posts(post)

post = posthelper.get_sub_from_post(post)
filehelper.find_matching_media(post, input, output)

if config['generate_thumbnails']:
filehelper.generate_thumbnail(post, output)
if config['index_mode'] != "default":
filehelper.generate_light_content(post, output)

filehelper.write_post_to_file(post, output)
posts_to_write.append(post)
except Exception as e:
logging.error("Processing post " + post["id"] + " has failed due to: " + str(e))
logging.error(f"Processing post {post['id']} has failed due to: {str(e)}")

posts_to_write = sorted(posts_to_write, key=lambda d: d['created_utc'], reverse=True)
filehelper.write_index_file(posts_to_write, output)

filehelper.write_index_file(posts_to_write, output, config['index_mode'])
filehelper.write_list_file(posts_to_write, output)
shutil.copyfile('./templates/style.css', os.path.join(output, 'style.css'))
filehelper.populate_css_file(output)


if archive_context:
if config['write_links_to_file'] != "None":
filehelper.write_url_file(posts_to_write, output, config['write_links_to_file'])
if config['archive_context']:
filehelper.empty_input_folder(os.path.join(input, "context/"))
if delete_input:
if config['delete_media']:
filehelper.empty_input_folder(input)

logging.info("BDFRToHTML run complete.")

logging.info("BDFR-HTML run complete.")


if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG)
LOGLEVEL = os.environ.get('BDFRH_LOGLEVEL', 'INFO').upper()
logging.basicConfig(level=LOGLEVEL)
main()
Loading

0 comments on commit 7087dd1

Please sign in to comment.