This repository contains two Python scripts for crawling web pages and GitHub repositories to extract and store relevant content. Below is a brief overview of each script's capabilities.
The web_crawler.py
script is designed to crawl a specified website, extract page content, and save the data in JSON format. It respects robots.txt
rules and allows customization of crawl depth and chunk size.
- Domain-Specific Crawling: Crawls only the specified domain and path.
- Robots.txt Compliance: Respects the rules defined in the website's
robots.txt
file. - Chunked Output: Saves crawled data in JSON chunks for easier processing.
- Customizable Depth: Allows setting a maximum crawl depth.
- Exclusion Rules: Excludes URLs with specific patterns (e.g., login pages, static assets).
- Run the script and provide the website URL.
- Optionally, set the chunk size and maximum crawl depth.
- The script will save the crawled data in the
web_crawled_data
directory.
The gitHub_docs_crawler.py
script is designed to crawl a GitHub repository, specifically targeting documentation files (e.g., .md
, .txt
, .html
). It extracts file content and metadata, saving the data in JSON format.
- GitHub API Integration: Uses the GitHub API to fetch repository contents.
- File Type Filtering: Targets specific file extensions (e.g.,
.md
,.html
). - Rate Limit Handling: Automatically pauses when GitHub API rate limits are approached.
- Chunked Output: Saves crawled data in JSON chunks for easier processing.
- Customizable Depth: Allows setting a maximum directory recursion depth.
- Run the script and provide the GitHub repository URL.
- Optionally, provide a GitHub token for authenticated requests.
- The script will save the crawled data in the
github_api_crawled_data
directory.
- The GitHub Documentation Crawler is inspired by the original work by rsain/GitHub-Crawler.