edx-crawler

edx-crawler is a Python-based cross-platform tool for mining text data of the enrolled edX and edge edX courses available on a user's dashboard. It was developed by teaching assistants at Tokyo Tech Online Education Development Office as an extension of edx-dl.

Prerequisites

Python libraries and modules:

Python - version 3.5+
beautifulsoup - a Python library for pulling data out of HTML and XML files
webvtt-py - a Python module for reading/writing WebVTT caption files
youtube-dl - command-line program to download videos from YouTube.com
ffmpeg-python - command-line python wrapper for videos (mpeg) file analysis using ffmpeg software

multimedia framework:

ffmpeg - command-line program to to record, convert and stream audio and video.

How to run

Run a python script edx_crawler.py passing edx course link -url , username -u and password -p as parameters.

python edx_crawler.py -url [course_url] -u [edx_user_name] -p [edx_user_password]

OPTIONS

-url, --course-urls		Specify target course urls given from edx dashboard
-u, --username			Specify your edX username (email)
-p, --password			Input your edX password
-d, --html-dir			Specify directory to store data

The output contents are stored in .json format as the following:

all text components -> all_textcomp.json
all problem components -> all_probcomp.json
all video components -> all_videocomp.json
all components (text, quizes, videos) -> all_comp.json

The raw HTML files corresponding to each Unit are back up in sourcefile.tar.gz

Extra files and folders

transcript_error_report.txt contains the information about video transcripts which are not provided by edX or YouTube.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
lib		lib
README.md		README.md
edx_crawler.py		edx_crawler.py
links.txt		links.txt
simple_run.py		simple_run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

edx-crawler

Prerequisites

How to run

OPTIONS

Extra files and folders

About

Releases

Packages

Contributors 3

Languages

TokyoTechX-TAs/web-crawler

Folders and files

Latest commit

History

Repository files navigation

edx-crawler

Prerequisites

How to run

OPTIONS

Extra files and folders

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages