Selenpy

This framework is built on top of Selenium providing the nesscary features to make web crawling more efficiency and easy.

supports as many threadings as machines can handle
distributed web crawling design for machines scaling
proxy auto rotation (private proxies are also supported)
automating error detection (block, format check, etc)
cookies and local storage insertion
pause and start from where you left off
supports browser extension

Why Selenpy?

Selenium mimics what an actual human would do to browse a website. It increases the difficulties for target websites to detect a crawling robot. Traditional web scraping can only crawl static page. However, with Selenium dynamic page scraping is also possible.

User Manual

Get Started

/selenpy/common.settings

Modify the config file to connect to your own db server

DATABASE = {
    'host': '000.000.0.00', #your mongodb address
    'port': '00000', #your mongodb address port
    'username': None, #database username if any
    'password': None, #database password if any
    'db': 'mydb' #select a databse
}

/main.py

Define a class extending from BaseCrawler class (* are required properties)

Below is the variables that can be specified

variables	Description	type
`collection` *	table name to be stored in database	`str`
`domain` *	target website domain	`str`
`targets` *	targets id/urls	`array`
`fURL`	url pattern	`format str`
`log_source`	save the entire html souce	`boolean`
`Proxy`	use proxies if specified	`class`

Below is the methods that can be specified

variables	Description	input	output
`spider` *	page scraping script	`SeleniumWebElement`	`dict`
`validateBlock`	validate whether the browser is blocked by target server	`SeleniumWebElement`	`boolean`
`validatePage`	validate whether the page is displaying correct content	`SeleniumWebElement`	`boolean`
`validateTask`	validate whether task is in correct format	`str`	`boolean`

For more details, please find an example in demo.py for Amazon products crawling.

Proxy

For setting up proxy, please define a class named as Proxy inside your own crawler class.

    from selenpy.Crawler import BaseCrawler
    from selenpy.common.variables import Mode
    from selenpy.Reader import Loader

    class MyCrawlerClass(BaseCrawler):

        ...

        class Proxy:
            proxies = Loader.read_json("<your_proxies_location>.json")
            type = Mode.DEFAULT #public proxy

        ...

proxies should be a list with the following naming convention (host is a required property while region is optional)

[
    {"host":"888.88.88.888:23056", "region":"Boca Raton Florida"},
    {"host":"111.11.111.111:23056", "region":"Boston Massachusetts"},
    ...
]

If you are using private proxy, you need to specify type = Mode.PRIVATE instead. You would also need to modify the config file. /selenpy/common.settings

PRIVATE_PROXY = {
    'username': '12345678',
    'password': '12345678',
    ...

Run Script

type python manage.py start in your termianl to start crawling. Below are some extra arguments.

options	Description	Default
`--headless`	browser headless mode	`False`
`-t{}`	number of threads assigned for the task	`4`
`-b{}`	slice tasks (begining index)	`0`
`-e{}`	slice tasks (ending index)	`end`
`--enable_image`	hide image	`True`

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
selenpy		selenpy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
main.py		main.py
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Selenpy

Why Selenpy?

User Manual

Get Started

Proxy

Run Script

About

Uh oh!

Releases

Packages

Languages

License

alexngun/selenpy

Folders and files

Latest commit

History

Repository files navigation

Selenpy

Why Selenpy?

User Manual

Get Started

Proxy

Run Script

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages