Textharvester · Domain specific web crawler and downloader

Textharvester Logo

TextHarvester: Effortlessly Gather Text Data for NLP

TextHarvester is a Python tool designed to streamline the process of collecting and downloading website content for Natural Language Processing (NLP) projects.

How it Works:

Depth-First Crawling: TextHarvester utilizes a depth-first algorithm to efficiently explore websites, starting from a provided seed URL.
URL Collection & Downloading: It systematically collects links from the initial URL and subsequent pages, downloading the text content of each webpage into a user-specified text file.

Key Features:

Simplicity: Designed with ease of use in mind, TextHarvester requires minimal configuration to get started.
Customizability: Adjust crawling parameters like the maximum depth and random link selection to tailor your data collection process.
Versatility: Ideal for gathering large text datasets for a wide range of NLP applications, such as training language models, sentiment analysis, or topic modeling.

Textharvester Algorithm

Getting Started:

Installation is straightforward using pip:

git clone https://github.com/techboy-coder/Textharvester.git
cd Textharvester && pip install --upgrade -r requirements.txt -q && pip install .

Dive deeper into the project documentation and explore practical examples:

Github Repo
Docs & Example Usage

← Back to all projects