Textharvester · Domain specific web crawler and downloader
#Python#PyPI#YouTube-DL#AsyncIO#Multithreading
Project Summary
TextHarvester is an easy-to-use tool for collecting and crawling urls from the Internet and downloading website content from collected urls into a text file. TextHarvester can be used to efficiently collect a lot of text for general purpose nlp.
TextHarvester is a depth-first algorithm.
- Collect Urls:
- Collect links from starting website url.
- Randomly (based on your parameters) chooses links from previous step and repeats previous step.
- Download Content from collected urls.
Installation using:
git clone https://github.com/techboy-coder/Textharvester.gitcd Textharvester && pip install --upgrade -r requirements.txt -q && pip install .
More infos can be found in the Docs and Example Usage or the Github Repo