edx-crawler is a Python-based cross-platform tool for mining text data of the enrolled edX and edge edX courses available on a user's dashboard. It was developed by teaching assistants at Tokyo Tech Online Education Development Office as an extension of edx-dl.
Python libraries and modules:
- Python - version 3.5+
- beautifulsoup - a Python library for pulling data out of HTML and XML files
- webvtt-py - a Python module for reading/writing WebVTT caption files
- youtube-dl - command-line program to download videos from YouTube.com
- ffmpeg-python - command-line python wrapper for videos (mpeg) file analysis using ffmpeg software
multimedia framework:
- ffmpeg - command-line program to to record, convert and stream audio and video.
Run a python script edx_crawler.py
passing edx course link -url
, username -u
and password -p
as parameters.
python edx_crawler.py -url [course_url] -u [edx_user_name] -p [edx_user_password]
-url, --course-urls Specify target course urls given from edx dashboard
-u, --username Specify your edX username (email)
-p, --password Input your edX password
-d, --html-dir Specify directory to store data
The output contents are stored in .json format as the following:
- all text components -> all_textcomp.json
- all problem components -> all_probcomp.json
- all video components -> all_videocomp.json
- all components (text, quizes, videos) -> all_comp.json
The raw HTML files corresponding to each Unit are back up in sourcefile.tar.gz
transcript_error_report.txt contains the information about video transcripts which are not provided by edX or YouTube.