Crawler

Scenario-based crawler written in Node.js

About

Crawler is a standalone application written in Node.js built on top of Express.js, Crawlee, Puppeteer and BullMQ, allowing you to crawl data from web pages by defining scenarios. This is all controlled through the Rest API.

Development setup

Prerequisites

Docker compose
Make

Installation

$ git clone https://github.com/68publishers/crawler.git crawler
$ cd crawler
$ make init

Creating a user

HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.

$ docker exec -it crawler-app npm run user:create

Production setup

Prerequisites

Docker
Postgres >=14.6
Redis >=7

For production use, the following Redis settings must be made:

Configuring persistence with Append-only-file strategy - https://redis.io/docs/management/persistence/#aof-advantages
Set Max memory policy to noeviction - https://redis.io/docs/reference/eviction/#eviction-policies

Installation

Firstly, you need to run the database migrations with the following command:

$ docker run \
    --network <NETWORK> \
    -e DB_URL=postgres://<USER>:<PASSWORD>@<HOSTNAME>:<PORT>/<DB_NAME> \
    --entrypoint '/bin/sh' \
    -it \
    --rm \
    68publishers/crawler:latest \
    -c 'npm run migrations:up'

Then download the seccomp file, which is required to run chrome:

$ curl -C - -O https://raw.githubusercontent.com/68publishers/crawler/main/.docker/chrome/chrome.json

And run the application:

$ docker run \
    -- init \
    --network <NETWORK> \
    -e APP_URL=<APPLICATION_URL> \
    -e DB_URL=postgres://<USER>:<PASSWORD>@<HOSTNAME>:<PORT>/<DB_NAME> \
    -e REDIS_HOST=<HOSTNAME> \
    -e REDIS_PORT=<PORT> \
    -e REDIS_AUTH=<PASSWORD> \
    -p 3000:3000 \
    --security-opt seccomp=$(pwd)/chrome.json \
    -d \
    --name 68publishers_crawler \
    68publishers/crawler:latest

Creating a user

HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.

$ docker exec -it 68publishers_crawler npm run user:create

Environment variables

Name	Required	Default	Description
APP_URL	yes	-	Full origin of the application e.g. `https://www.example.com`. The variable is used to create links to screenshots etc.
APP_PORT	no	`3000`	Port to which the application listens
DB_URL	yes	-	Connection string to postgres database e.g. postgres://root:root@localhost:5432/crawler
REDIS_HOST	yes	-	Redis hostname
REDIS_PORT	yes	-	Redis port
REDIS_AUTH	no	-	Optional redis password
REDIS_DB	no	`0`	Redis database number
WORKER_PROCESSES	no	`5`	Number of workers that process the queue of running scenarios
CRAWLEE_STORAGE_DIR	no	`./var/crawlee`	Directory where crawler stores runtime data
CHROME_PATH	no	`/usr/bin/chromium-browser`	Path to Chromium executable file
SENTRY_DSN	no	-	Logging into the Sentry is enabled if the variable is passed
SENTRY_SERVER_NAME	no	`crawler`	Server name that is passed into the Sentry logger

Rest API and Queues board

The specification of the Rest API (Swagger UI) can be found at endpoint /api-docs. Usually http://localhost:3000/api-docs in case of development setup. You can try to call all endpoints here.

Alternatively, the specification can be viewed online.

BullBoard is located at /admin/queues. Here you can see all the scenarios that are currently running or have already run.

Working with scenarios

@todo

Working with scenario schedulers

@todo

Tutorial: Creating the first scenario

@todo

Integrations

PHP Client for Crawler's API - 68publishers/crawler-client-php

License

The package is distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.docker		.docker
.github/workflows		.github/workflows
docs/images		docs/images
openapi		openapi
public		public
src		src
tests/unit		tests/unit
var		var
.dockerignore		.dockerignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
knexfile.mjs		knexfile.mjs
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crawler

Table of Contents

About

Development setup

Prerequisites

Installation

Creating a user

Production setup

Prerequisites

Installation

Creating a user

Environment variables

Rest API and Queues board

Working with scenarios

Working with scenario schedulers

Tutorial: Creating the first scenario

Integrations

License

About

Uh oh!

Releases 24

Packages

Uh oh!

Contributors 2

Languages

License

68publishers/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Table of Contents

About

Development setup

Prerequisites

Installation

Creating a user

Production setup

Prerequisites

Installation

Creating a user

Environment variables

Rest API and Queues board

Working with scenarios

Working with scenario schedulers

Tutorial: Creating the first scenario

Integrations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors 2

Languages

Packages