PPP_Scraper

This is an implementation of the scraperwiki package. The app connects to the UN Department of Peacekeeping (DPKO) report on troop contributions for the previous month hosted on their website. The url pattern uses the pattern http://www.un.org/en/peacekeeping/contributors/<year>/<3 or 4 letter month abbr><last 2 numbers of year>_3.pdf. March 2014's file would be at http://www.un.org/en/peacekeeping/contributors/2014/mar14_3.pdf for instance.

The app then converts the pdf to xml and uses page position to parse into a sqllite db. Country names are positioned between 130 and 140 px left, mission designations are between 276 and 280 px left, and data is anything more than 350 px left. This app is built to run on the Morph.io platform which contains an API that is used by the PPP_Loader app. It is also possible to run scraperwiki locally, though I've had trouble getting it to work.

Morph.io runs this script directly from GitHub once daily and data is only inserted if it doesnt exist in db.

Dependencies

scraperwiki - Web scraping library. The PDF to XML functionality to convert into parsable format.
Request - Python HTTP library to check connection
lxml - Python XML library to navigate the XML tree
python native modules: time, datetime, calendar, urllib2
Data source: DPKO website

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PPP_Scraper

Dependencies

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

IPIDataLab/PPP_Scraper

Folders and files

Latest commit

History

Repository files navigation

PPP_Scraper

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages