Merge pull request #20 from BlipRanger/dev

v1.4
BlipRanger · May 29, 2021 · 7087dd1 · 7087dd1
2 parents f1f163c + 8496984
commit 7087dd1
Show file tree

Hide file tree

Showing 29 changed files with 779 additions and 203 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,11 @@
-.vscode/settings.json
+.vscode/
 /venv/
 .idea
 html/
 temp/
 __pycache__/
-config/
+.vscode/.ropeproject/config.py
+.vscode/.ropeproject/objectdb
+output/
+input/
+config/user_configs
diff --git a/Dockerfile b/Dockerfile
@@ -9,24 +9,14 @@ COPY ./bdfrtohtml/ ./bdfrtohtml
 COPY ./templates/ ./templates
 COPY ./start.py ./start.py
 COPY ./requirements.txt ./requirements.txt
+COPY ./config/config.yml ./config/config.yml
+COPY ./config/default_bdfr_config.cfg ./config/default_bdfr_config.cfg
 
-ENV BDFR_FREQ=15
-ENV BDFR_IN=/input
-ENV BDFR_OUT=/output
-ENV BDFR_RECOVER_COMMENTS=True
-ENV BDFR_ARCHIVE_CONTEXT=True
-ENV BDFR_LIMIT=1100
-ENV RUN_BDFR=False
-ENV BDFRH_DELETE=False
-ENV BDFRH_LOGLEVEL=INFO
-
-EXPOSE 5000
 EXPOSE 7634
 
 RUN pip install -r requirements.txt
 
 RUN mkdir input
 RUN mkdir output
-RUN mkdir config
 
 CMD python start.py
diff --git a/README.md b/README.md
@@ -1,37 +1,67 @@
 # bdfr-html
-Converts the output of the [bulk downloader for reddit (V2)](https://github.com/aliparlakci/bulk-downloader-for-reddit)  to a set of HTML pages. 
+BDFR-HTML is a companion script that turns the output of the incredibly useful [bulk downloader for reddit](https://github.com/aliparlakci/bulk-downloader-for-reddit) into a set of HTML pages with an index which can be easily viewed in a browser. It also provides a number of other handy tools such as the ability to grab the context for saved comments or pull down deleted posts from Pushshift. The HTML pages are rendered using jinja2 templates and can be easily modified to suit your needs. The script currently requires that you run both the archive and the download portions of the BDfR bulk downloader script and that the names of the downloaded files contain the post id (this is default). This can be automated using the included start.py or docker container.
 
-Currently requires that you run both the archive and the download portions of the BDfR bulk downloader script and that the names of the downloaded files contain the post id (this is default).
-Currently only supports the json version of the archive output from BDfR V2. 
+## Table of Contents
+- [bdfr-html](#bdfr-html)
+  - [Table of Contents](#table-of-contents)
+  - [Installation](#installation)
+  - [Usage](#usage)
+  - [Docker](#docker)
+  - [Contributing](#contributing)
+  - [Planned Features](#planned-features)
 
-**Usage**
+## Installation
 
-`python -m bdfrtohtml --input ./location/of/archive/and/downloads --output /../html/`
+You can simply clone this repo and run the script in this folder or you can try your hand at installing the package using setuptools using the following command:
+` python setup.py install`
+(This is still a work in progress)
 
-Use `python -m bdfrtohtml --help` for a list of options
+## Usage
 
-**Docker-Compose**
+To run the script with defaults:
+`python -m bdfrtohtml` (the default is to look in the folder 'input' and write to the folder 'output')
 
-For ease of use for both bdfr and bdfr-html in an automated fashion, I have included a docker-compose file which will spin up both an automation container and a web server container. The automation container will run bdfr and then subsequently bdfr-html, producing a volume or mounted folder containing the generated html files. The web server container shares the output volume and hosts the generated files. Currently this is tasked to only save "Saved" user content, however this might be changed in the future. If you would prefer to populate bdfr-html with your own reddit json/media files from bdfr, you can use a similar docker-compose file, but mount the folder where you have saved your content to the `BDFR_IN` folder (/input by default) and set the env variable `RUN_BDFR` to false (default). 
+`python -m bdfrtohtml --input_folder ./location/of/archivedFiles --output_folder /../html/`
+
+**Options**
+```
+  --input_folder TEXT                           The folder where the download and archive results have been saved to.
+  --output_folder TEXT                          Folder where the HTML results should be created.
+  --recover_comments BOOLEAN                    Should we attempt to recover deleted comments?
+  --recover_posts BOOLEAN                       Should we attempt to recover deleted posts?
+  --generate_thumbnails BOOLEAN                 Generate thumbnails for video posts? (deprecated by index_mode)
+  --archive_context BOOLEAN                     Should we attempt to archive the contextual post for saved comments?
+  --delete_media BOOLEAN                        Should we delete the input media after creating the output?
+  --index_mode [default|lightweight|oldreddit]  What type of templated index page should be generated?
+  --write_links_to_file [None|Webpages|All]     Should we write the links from posts to a text file for external consuption?
+  --config FILENAME                             Read in a config file
+  --help                                        Show this message and exit.
+```
+
+**start.py**
+The start.py is what (currently) powers the docker container's automation and steps through running both bdfr and bdfr-html in sequence at timed intervals. 
+Instead of running bdfrtohtml alone, you can run `python start.py` in a cloned copy of this repo to start up the automated process.
+The configuration for both bdfrtohtml and the start.py script itself can be found in the [config folder](https://github.com/BlipRanger/bdfr-html/tree/main/bdfrtohtml/config).
+This script also includes multi-user support which can be found in the config file.
+
+## Docker
+
+For ease of use of both bdfr and bdfr-html in an automated fashion, there is included a docker-compose file which will spin up both an automation container and a web server container. The automation container will run bdfr and then subsequently bdfr-html, producing a volume or mounted folder containing the generated html files. The web server container shares the output volume and hosts the generated files. Currently this is tasked to only save "Saved" user content, however this might be changed in the future. If you would prefer to populate bdfr-html with your own reddit json/media files from bdfr, you can use a similar docker-compose file, but mount the folder where you have saved your content to the folder (bdfrtohtml/input by default) and set the config variable `RUN_BDFR` to false. 
 
 Since BDFR 2.1.1 you should be able to properly hit the Oauth within the docker container. The proper port for validation has is exposed in the docker-compose file. 
-If you are running the docker container on a different machine, replace `locahost` in the returned url with the address of the docker host. 
+If you are running the docker container on a different machine, replace `locahost` in the returned Oauth url with the address of the docker host. 
 
 To run the compose file, simply clone this repo and run `docker-compose up`. 
 
-**Additional Features**
+The config file of the docker container can be mounted and modified just like the one mentioned above for the start.py script.  
+
+## Contributing
+I am open to any and all help or criticism on this project! Please feel free to create issues as you encounter them and I'll work to get them fixed. I have a set idea of the scope of this project, but I am always open to new feature suggestions or improvements to my code. Also, if you have code you'd like to contribute, just open a PR and I'll take a look!
 
-- Use the --archive_context option to pull the related contextual post for downloaded comments.
-- Use the --recover_comments and --recover_posts options to have the script attempt to pull deleted comments and posts from Pushshift. 
-- The script now actively avoids reprocessing inputs by storing a list of processed ids in the output folder.
-- Produces an ID file of processed posts which can be fed to bdfr to avoid re-downloading content. 
-- Templated HTML using jinja2 which can be easily modified to suit your needs.
-- Use --write_links_to_file option to write a list of urls for webpages and/or media referenced by posts to a file to use in other processes
-- Posts are sorted in chronological order
 
-**Planned Features**
+## Planned Features
 
-- Improved HTML templates
-- Config file instead of just arguments
-- Better docs
-- Additional possible docker-compose configs
+- Better documentation, including a lessons learned page
+- The ability to output more data/metrics
+- Docker support for automatically archiving subreddits/users
+- PyPi + Dockerhub package support
diff --git a/bdfrtohtml/__init__.py b/bdfrtohtml/__init__.py
@@ -1,4 +1,4 @@
 # __init__.py
 __author__ = "BlipRanger"
-__version__ = "1.3.0"
+__version__ = "1.3.1"
 __license__ = "GNU GPLv3"
diff --git a/bdfrtohtml/__main__.py b/bdfrtohtml/__main__.py
@@ -9,24 +9,40 @@
 import shutil
 from bdfrtohtml import filehelper
 from bdfrtohtml import posthelper
+from bdfrtohtml import util
 import logging
 import copy
 
 logger = logging.getLogger(__name__)
 
 
 @click.command()
-@click.option('--input', default='./', help='The folder where the download and archive results have been saved to')
-@click.option('--output', default='./html/', help='Folder where the HTML results should be created.')
-@click.option('--recover_comments', default=False, type=bool, help='Should we attempt to recover deleted comments?')
-@click.option('--recover_posts', default=False, type=bool, help='Should we attempt to recover deleted posts?')
-@click.option('--archive_context', default=False, type=bool,
+@click.option('--input_folder', default=None, help='The folder where the download and archive results have been saved to')
+@click.option('--output_folder', default=None, help='Folder where the HTML results should be created.')
+@click.option('--recover_comments', type=bool, help='Should we attempt to recover deleted comments?')
+@click.option('--recover_posts', type=bool, help='Should we attempt to recover deleted posts?')
+@click.option('--generate_thumbnails', type=bool, help='Generate thumbnails for video posts?')
+@click.option('--archive_context', type=bool,
               help='Should we attempt to archive the contextual post for saved comments?')
-@click.option('--delete_input', default=False, type=bool, help='Should we delete the input after creating the output?')
-def main(input, output, recover_comments, recover_posts, archive_context, delete_input):
-    output = filehelper.assure_path_exists(output)
-    input = filehelper.assure_path_exists(input)
+@click.option('--delete_media', type=bool, help='Should we delete the input media after creating the output?')
+@click.option('--index_mode', type=click.Choice(['default', 'lightweight', 'oldreddit']), help='Generate an index with no playing media and shrunk images?')
+@click.option('--write_links_to_file', type=click.Choice(['None', 'Webpages', 'All'], case_sensitive=False), 
+              help='Should we write the links from posts to a text file for external consuption?')
+@click.option('--config', type=click.File('r'), help='Read in a config file')
+@click.pass_context
+def main(context: click.Context, **_):
+
+    if context.params.get('config'):
+        config = util.load_config(context.params.get('config'))
+    else:
+        config = util.generate_default_config()
+    config = util.process_click_arguments(config, context)
+    logging.debug(config)
+
+    output = filehelper.assure_path_exists(config['output_folder'])
+    input = filehelper.assure_path_exists(config['input_folder'])
     filehelper.assure_path_exists(os.path.join(output, "media/"))
+
 
     # Load all of the json files
     all_posts = filehelper.import_posts(input)
@@ -37,33 +53,45 @@ def main(input, output, recover_comments, recover_posts, archive_context, delete
         post = copy.deepcopy(entry)
         try:
             post = posthelper.handle_comments(post)
-            if recover_comments:
+            if config['recover_comments']:
                 post = posthelper.recover_deleted_comments(post)
-            if archive_context:
+            if config['archive_context']:
                 post = posthelper.get_comment_context(post, input)
-            if recover_posts:
+            if config['recover_posts']:
                 post = posthelper.recover_deleted_posts(post)
 
             post = posthelper.get_sub_from_post(post)
             filehelper.find_matching_media(post, input, output)
+
+            if config['generate_thumbnails']:
+                filehelper.generate_thumbnail(post, output)
+            if config['index_mode'] != "default":
+                filehelper.generate_light_content(post, output)
+
             filehelper.write_post_to_file(post, output)
             posts_to_write.append(post)
         except Exception as e:
-            logging.error("Processing post " + post["id"] + " has failed due to: " + str(e))
+            logging.error(f"Processing post {post['id']} has failed due to: {str(e)}")
 
     posts_to_write = sorted(posts_to_write, key=lambda d: d['created_utc'], reverse=True)
-    filehelper.write_index_file(posts_to_write, output)
+
+    filehelper.write_index_file(posts_to_write, output, config['index_mode'])
     filehelper.write_list_file(posts_to_write, output)
-    shutil.copyfile('./templates/style.css', os.path.join(output, 'style.css'))
+    filehelper.populate_css_file(output)
+
 
-    if archive_context:
+    if config['write_links_to_file'] != "None":
+        filehelper.write_url_file(posts_to_write, output, config['write_links_to_file'])
+    if config['archive_context']:
         filehelper.empty_input_folder(os.path.join(input, "context/"))
-    if delete_input:
+    if config['delete_media']:
         filehelper.empty_input_folder(input)
 
-    logging.info("BDFRToHTML run complete.")
+
+    logging.info("BDFR-HTML run complete.")
 
 
 if __name__ == '__main__':
-    logging.basicConfig(level=logging.DEBUG)
+    LOGLEVEL = os.environ.get('BDFRH_LOGLEVEL', 'INFO').upper()
+    logging.basicConfig(level=LOGLEVEL)
     main()