scrapy_rss

Tools to easy generate RSS feed that contains each scraped item using Scrapy framework.

Installation

Install scrapy_rss using pip
```
pip install scrapy_rss
```
or using pip for the specific interpreter, e.g.:
```
pip3 install scrapy_rss
```

or using setuptools directly:

cd path/to/root/of/scrapy_rss
python setup.py install

or using setuptools for specific interpreter, e.g.:

cd path/to/root/of/scrapy_rss
python3 setup.py install

How To Use

Configuration

Add parameters to the Scrapy project settings (settings.py file) or to the custom_settings attribute of the spider:

Add item pipeline that export items to rss feed:

ITEM_PIPELINES = {
    # ...
    'scrapy_rss.pipelines.FeedExportPipeline': 900,  # or another priority
    # ...
}

Add required feed parameters:

FEED_FILE

the absolute or relative file path where the result RSS feed will be saved. For example, feed.rss or output/feed.rss.

FEED_TITLE

the name of the channel (feed),

FEED_DESCRIPTION

the phrase or sentence that describes the channel (feed),

FEED_LINK

the URL to the HTML website corresponding to the channel (feed)
```
FEED_FILE = 'path/to/feed.rss'
FEED_TITLE = 'Some title of the channel'
FEED_LINK = 'http://example.com/rss'
FEED_DESCRIPTION = 'About channel'
```

Usage

Basic usage

Declare your item directly as RssItem():

import scrapy_rss

item1 = scrapy_rss.RssItem()

Or use predefined item class RssedItem with RSS field named as rss that's instance of RssItem:

import scrapy
import scrapy_rss

class MyItem(scrapy_rss.RssedItem):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    # ...

item2 = MyItem()

Set/get item fields. Case sensitive attributes of RssItem() are appropriate to RSS elements. Attributes of RSS elements are case sensitive too. If the editor allows autocompletion then it suggests attributes for instances of RssedItem and RssItem. It's allowed to set any subset of RSS elements (e.g. title only). For example:

from datetime import datetime

item1.title = 'RSS item title'  # set value of <title> element
title = item1.title.value  # get value of <title> element
item1.description = 'description'

item1.guid = 'item identifier'
item1.guid.isPermaLink = False  # set value of attribute isPermalink of <guid> element,
                                # isPermaLink is True by default
is_permalink = item1.guid.isPermaLink  # get value of attribute isPermalink of <guid> element
guid = item1.guid.value  # get value of element <guid>

item1.category = 'single category'
category = item1.category
item1.category = ['first category', 'second category']
first_category = item1.category[0].value # get value of the element <category> with multiple values
all_categories = [cat.value for cat in item1.category]

# direct attributes setting
item1.enclosure.url = 'http://example.com/file'
item1.enclosure.length = 0
item1.enclosure.type = 'text/plain'

# or dict based attributes setting
item1.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
item1.guid = {'value': 'item identifier', 'isPermaLink': True}

item1.pubDate = datetime.now()  # correctly works with Python' datetimes


item2.rss.title = 'Item title'
item2.rss.guid = 'identifier'
item2.rss.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}

All allowed elements are listed in the scrapy_rss/items.py. All allowed attributes of each element with constraints and default values are listed in the scrapy_rss/elements.py. Also you can read RSS specification for more details.

`RssItem` derivation and namespaces

You can extend RssItem to add new XML fields that can be namespaced or not. You can specify namespaces in an attribute and/or an element constructors. Namespace prefix can be specified in the attribute/element name using double underscores as delimiter (prefix__name) or in the attribute/element constructor using ns_prefix argument. Namespace URI can be specified using ns_uri argument of the constructor.

from scrapy_rss.meta import ElementAttribute, Element
from scrapy_rss.items import RssItem

class Element0(Element):
    # attributes without special namespace
    attr0 = ElementAttribute(is_content=True, required=True)
    attr1 = ElementAttribute()

class Element1(Element):
    # attribute "prefix2:attr2" with namespace xmlns:prefix2="id2"
    attr2 = ElementAttribute(ns_prefix="prefix2", ns_uri="id2")

    # attribute "prefix3:attr3" with namespace xmlns:prefix3="id3"
    prefix3__attr3 = ElementAttribute(ns_uri="id3")

    # attribute "prefix4:attr4" with namespace xmlns:prefix4="id4"
    fake_prefix__attr4 = ElementAttribute(ns_prefix="prefix4", ns_uri="id4")

    # attribute "attr5" with default namespace xmlns="id5"
    attr5 = ElementAttribute(ns_uri="id5")

class MyXMLItem(RssItem):
    # element <elem1> without namespace
    elem1 = Element0()

    # element <elem_prefix2:elem2> with namespace xmlns:elem_prefix2="id2e"
    elem2 = Element0(ns_prefix="elem_prefix2", ns_uri="id2e")

    # element <elem_prefix3:elem3> with namespace xmlns:elem_prefix3="id3e"
    elem_prefix3__elem3 = Element1(ns_uri="id3e")

    # yet another element <elem_prefix4:elem3> with namespace xmlns:elem_prefix4="id4e"
    # (does not conflict with previous one)
    fake_prefix__elem3 = Element0(ns_prefix="elem_prefix4", ns_uri="id4e")

    # element <elem5> with default namespace xmlns="id5e"
    elem5 = Element0(ns_uri="id5e")

Access to elements and its attributes is the same as with simple items:

item = MyXMLItem()
item.title = 'Some title'
item.elem1.attr0 = 'Required content value'
item.elem1 = 'Another way to set content value'
item.elem1.attr1 = 'Some attribute value'
item.elem_prefix3__elem3.prefix3__attr3 = 'Yet another attribute value'
item.elem_prefix3__elem3.fake_prefix__attr4 = '' # non-None value is interpreted as assigned
item.fake_prefix__elem3.attr1 = 42

Several optional settings are allowed for namespaced items:

FEED_NAMESPACES

list of tuples [(prefix, URI), ...] or dictionary {prefix: URI, ...} of namespaces that must be defined in the root XML element

FEED_ITEM_CLASS or FEED_ITEM_CLS

main class of feed items (class object MyXMLItem or path to class "path.to.MyXMLItem"). Default value: RssItem. It's used in order to extract all possible namespaces that will be declared in the root XML element.

Feed items do NOT have to be instances of this class or its subclass.

If these settings are not defined or only part of namespaces are defined then other used namespaces will be declared either in the <item> element or in its subelements when these namespaces are not unique. Each <item> element and its sublements always contains only namespace declarations of non-None attributes (including ones that are interpreted as element content).

Feed (Channel) Elements Customization [optionally]

If you want to change other channel parameters (such as language, copyright, managingEditor, webMaster, pubDate, lastBuildDate, category, generator, docs, cloud, ttl, image, rating, textInput, skipHours, skipDays) then define your own exporter that's inherited from FeedItemExporter class and, for example, modify one or more children of self.channel Element (camelCase attributes naming):

from datetime import datetime
from scrapy_rss.rss import channel_elements
from scrapy_rss.exporters import FeedItemExporter

class MyRssItemExporter(FeedItemExporter):
   def __init__(self, *args, **kwargs):
      super(MyRssItemExporter, self).__init__(*args, **kwargs)
      self.channel.generator = 'Special generator'
      self.channel.language = 'en-us'
      self.channel.managingEditor = 'editor@example.com'
      self.channel.webMaster = 'webmaster@example.com'
      self.channel.copyright = 'Copyright 2025'
      self.channel.pubDate = datetime(2025, 9, 10, 13, 0, 0)

      self.channel.category = ['category 1', 'category 2']
      self.channel.category.append('category 3')
      self.channel.category.extend(['category 4', 'category 5'])

      # initialize image from dict
      self.channel.image = {
          'url': 'https://example.com/img.jpg',
          'description': 'Image link hover text',
      }
      # or initialize image from ImageElement
      self.channel.image = channel_elements.ImageElement(url='https://example.com/img.jpg')
      # or initialize image by each attribute
      self.channel.image.url = 'https://example.com/img.jpg' # required attribute of image
      self.channel.image.title = 'Image title' # optional
      self.channel.image.link = 'https://example.com/page' # optional
      self.channel.image.description = 'Image link hover text' # optional
      self.channel.image.width = 140 # optional
      self.channel.image.height = 350 # optional

      self.channel.docs = 'https://example.com/rss_docs'
      self.channel.cloud = {
          'domain': 'rpc.sys.com',
          'port': '80',
          'path': '/RPC2',
          'registerProcedure': 'myCloud.rssPleaseNotify',
          'protocol': 'xml-rpc'
      }
      self.channel.ttl = 60
      self.channel.rating = 4.0
      self.channel.textInput = channel_elements.TextInputElement(
          title='Input title',
          description='Description of input',
          name='Input name',
          link='http://example.com/cgi.py'
      )

      self.channel.skipHours = (0, 1, 3, 7, 23) # initialize list from iterable
      self.channel.skipHours = 12 # or initialize list with single value

      self.channel.skipDays = 14 # initialize list with single value
      self.channel.skipDays = [1, 14] # or initialize list from list

or modify kwargs arguments (snake_case arguments naming):

from scrapy_rss.exporters import FeedItemExporter

class MyRssItemExporter(FeedItemExporter):
   def __init__(self, *args, **kwargs):
      kwargs['generator'] = kwargs.get('generator', 'Special generator')
      kwargs['language'] = kwargs.get('language', 'en-us')
      kwargs['managing_editor'] = kwargs.get('managing_editor', 'editor@example.com')
      kwargs['managing_editor'] = kwargs.get('managing_editor', ('category 1', 'category 2'))
      kwargs['image'] = kwargs.get('image', {'url': 'https://example.com/img.jpg'})
      # etc.
      super(MyRssItemExporter, self).__init__(*args, **kwargs)

And add FEED_EXPORTER parameter to the Scrapy project settings or to the custom_settings attribute of the spider:

FEED_EXPORTER = 'myproject.exporters.MyRssItemExporter'

Backward compatibility notices

Since version 1.0.0 some classes have been renamed, but old-named classes have been kept and marked as deprecated for bacward compatibility, so they can still be used.

But some elements of RssItem have some their attributes renamed in a backward incompatible way: almost all content attributes (text content of XML tag after exporting) are renamed to value to enhance code readability.

So if you do not want update your code expressions (such as an old-style item.title.title to a new-style item.title.value or item.guid.guid to item.guid.value) then you can easily import old-style classes

# old-style classes
from scrapy_rss.rss.old.items import RssItem, RssedItem

instead of new-style ones

# new-style classes
from scrapy_rss.items import RssItem, RssedItem

respectively.

Scrapy Project Examples

Examples directory contains several Scrapy projects with the scrapy_rss usage demonstration. It crawls this website whose source code is here.

Just go to the Scrapy project directory and run commands

scrapy crawl first_spider
scrapy crawl second_spider

Thereafter feed.rss and feed2.rss files will be created in the same directory.

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
.cov		.cov
.github/workflows		.github/workflows
examples		examples
logs		logs
scrapy_rss		scrapy_rss
tests		tests
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
codecov.yml		codecov.yml
compose.yaml		compose.yaml
coverage4to5.py		coverage4to5.py
pytest-sync-only.ini		pytest-sync-only.ini
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox-in-docker.py		tox-in-docker.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scrapy_rss

Table of Contents

Installation

How To Use

Configuration

Usage

Basic usage

`RssItem` derivation and namespaces

Feed (Channel) Elements Customization [optionally]

Backward compatibility notices

Scrapy Project Examples

About

Uh oh!

Releases 17

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

woxcab/scrapy_rss

Folders and files

Latest commit

History

Repository files navigation

scrapy_rss

Table of Contents

Installation

How To Use

Configuration

Usage

Basic usage

RssItem derivation and namespaces

Feed (Channel) Elements Customization [optionally]

Backward compatibility notices

Scrapy Project Examples

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`RssItem` derivation and namespaces

Packages