Skip to content

Parsing WXR (Wordpress eXpress RSS) XML file data #1

@aharonium

Description

@aharonium

Cross-posted from https://groups.google.com/d/msg/opensiddur-tech/BwTd-_7yZgk/7agCfWQ_BQAJ

I mentioned in the previous email that the site data for opensiddur.org is now available in downloadable WXR (Wordpress eXpress RSS) XML files.

Making these files publicly accessible is mainly intended as a way for researchers to access the site data without having to scrape opensiddur.org. Aside from RSS, there's really no public API (that I know of) for accessing all 960+ posts on opensiddur.org.

But I have another objective for beginning to move our site data into XML. For all of our transcribed text on opensiddur.org, I want to separate our data from its presentation.

A digression:
Such a goal should be no surprise to folk watching this project from its early days. Efraim and I envisioned the Open Siddur as a database by which we could serve liturgists sharing new prayers, scholars researching liturgy, and crafters compiling collections of prayers and related work into new prayerbooks. By separating data from its presentation, that data could be presented in an infinity of ways in an infinity of variations. Our project was founded with great hope in 2009 with this in mind. However, by late 2010, it was clear we wouldn't be realizing this vision soon. So, I began to do something simple and useful with my own modest skills -- just to help collect and curate liturgical content contributed by our community on the wordpress site that had up till then mainly served as a blogspace. In that way, opensiddur.org became the CMS it is today. Meanwhile, development continued on our collaborative transcription environment and siddur building web application at app.opensiddur.org.

Back to these WXR files. By themselves they are large, unwieldly XML files containing the raw HTML and postmeta data of every one of the posts and pages of opensiddur.org. It seems to me that the next step in making this data accessible is to parse these files into 960+ individual post files containing both the raw HTML data and relevant postmeta data such as title, author, co-author(s), content license, date published, categories, and tags -- and to do as much as we can to provide that as structured data. (Further steps can link these files to the manifests of page images linked to the Internet Archive, make them into nice JLPTEI conforming XML, and write some XSLT to display them once again in HTML.)

I've had some success in parsing the WXR posts file into individual text files containing the body of each post using a wxr2txt python script I found here: https://gist.github.com/ruslanosipov/b748a138389db2cda1e8

Unfortunately, that script doesn't bother to copy over the postmeta data along with the HTML in the post body. So I'm still trying to figure out what I need to add to this script to better parse the WXR file. (I also noticed that the file seems to choke on the pages WXR file.) So there's room for improvement for folk who want to help out and flex their python skills. The HTML parse module should come of service into service as can be seen in this fork of the script:
https://gist.github.com/aegis1980/4d00c381b0eb67f83cf93365cd7b69ad

(For some reason, HTML Parse isn't working for me in my Python install, so if you can get the above fork to work, let me know.)

So have fun experimenting with the site data and this wxr2txt.py script -- and let me know what success you have in parsing the site data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions