Textharvester · Domain specific web crawler and downloader

#Python#PyPI#YouTube-DL#AsyncIO#Multithreading

Textharvester Logo

Project Summary

TextHarvester is an easy-to-use tool for collecting and crawling urls from the Internet and downloading website content from collected urls into a text file. TextHarvester can be used to efficiently collect a lot of text for general purpose nlp.

Textharvester Algorithm

TextHarvester is a depth-first algorithm.

  1. Collect Urls:
    1. Collect links from starting website url.
    2. Randomly (based on your parameters) chooses links from previous step and repeats previous step.
  2. Download Content from collected urls.

Installation using:

Terminal window
git clone https://github.com/techboy-coder/Textharvester.git
cd Textharvester && pip install --upgrade -r requirements.txt -q && pip install .

More infos can be found in the Docs and Example Usage or the Github Repo

Textharvester Docs