scrapy-examples

Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider.

Don't use it to do anything illegal!

Usage

Run ./startproject.sh <PROJECT> to start a new project.
It will automatically generate most things, the only left things are:
- PROJECT/PROJECT/items.py
- PROJECT/PROJECT/spider/spider.py

Example to hack `items.py` and `spider.py`

Hacked items.py with additional fields url and description:

from scrapy.item import Item, Field

class exampleItem(Item):
    url = Field()
    name = Field()
    description = Field()

Hacked spider.py with start rules and css rules (here only display the class exampleSpider):

class exampleSpider(CommonSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.com/",
    ]
    # Crawler would start on start_urls, and follow the valid urls allowed by below rules.
    rules = [
        Rule(sle(allow=["/Arts/", "/Games/"]), callback='parse', follow=True),
    ]

    css_rules = {
        '.directory-url li': {
            '__use': 'dump', # dump data directly
            '__list': True, # it's a list
            'url': 'li > a::attr(href)',
            'name': 'a::text',
            'description': 'li::text',
        }
    }

    def parse(self, response):
        info('Parse '+response.url)
        # parse_with_rules is implemented here:
        #   https://github.com/geekan/scrapy-examples/blob/master/misc/spider.py
        self.parse_with_rules(response, self.css_rules, exampleItem)

##Real spider example: doubanbook

####Tutorial

git clone https://github.com/geekan/scrapy-examples
cd scrapy-examples/doubanbook
scrapy crawl doubanbook

####Depth

There are several depths in the spider, and the spider gets real data from depth2.

Depth0: The entrance is http://book.douban.com/tag/
Depth1: Urls like http://book.douban.com/tag/外国文学 from depth0
Depth2: Urls like http://book.douban.com/subject/1770782/ from depth1

####Example image

##Avaiable Spiders

linkedin
- linkedin
tutorial
- dmoz_item
- douban_book
- page_recorder
- douban_tag_book
doubanbook
- doubanbook
hrtencent
- hrtencent
sis
- sis
zhihu
- zhihu (it needs redis)
alexa
- alexa
- alexa.cn

##Advanced

Use parse_with_rules to write a spider quickly.
See dmoz spider for more details.
Proxies
- If you don't want to use proxy, just comment the proxy middleware in settings.
- If you want to custom it, hack misc/proxy.py by yourself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scrapy-examples

Usage

Example to hack `items.py` and `spider.py`

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
alexa		alexa
dmoz		dmoz
doubanbook		doubanbook
hrtencent		hrtencent
jobalibaba		jobalibaba
linkedin		linkedin
misc		misc
sis		sis
template		template
tutorial		tutorial
zhihu		zhihu
.gitignore		.gitignore
README.md		README.md
clean.sh		clean.sh
delay.sh		delay.sh
startproject.sh		startproject.sh

bashyroger/scrapy-examples

Folders and files

Latest commit

History

Repository files navigation

scrapy-examples

Usage

Example to hack items.py and spider.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example to hack `items.py` and `spider.py`

Packages