OpenPrice

Web software for financial data scraping (i.e. for the purposes of price comparison, analytics, etc.) from online shops and the likes.

Scraping with JavaScript

This design approach allows for a broad range of processing units (pretty much anything from desktop browsers and VPS/cloud servers to mobile browsers and low-cost embedded servers) with virtually no limits to their number, concurrently scraping data for one distributed graph database.

Stack

Front-end
- Semantic UI
- HighCharts/HighStock
Back-end: Laravel 5
Database: Neo4j
Cassidy: JavaScript
- jQuery
- AngularJS
Sundance: NodeJS (Express)

Configuration

An example config entry looks like this:

ngApp.constant('CassidiBlueprints', {
  'ebay.com': {
  	price: {
  		selector: '#prcIsum',
  		callback: function(e){ return parseFloat(e.text().replace(/[^\d.]/g, '')) }
  	},
  	name: '#itemTitle',
  	image: {
  		selector: '#icImg',
  		callback: function(e){ return e.attr('src') }
  	}
  }
});

There are two options when adding a new key for parsing.

When the value is string it's treated as a DOM selector, in which case the contained text will be returned.

That is a shorthand of the full object with selector, and callback keys, where the latter is a function which is called for each match on every page and receives the jQuery element as sole argument. The callback must return a non-nil value for further processing or nil in case of an error. Data accuracy depends on proper callback error reporting.

Usage

Single URL

Once there is a config for a given domain you can scrape URLs in it. With AngularJS it's as simple as:

Cassidi.steal(url).then(function(swag){ if(swag) console.log(swag.price); })

The steal method returns an Angular Promise which is later ("asynchronously") called. The closure for the promise receives an object as a sole argument which has the same keys as the parser for that domain. Each key is either a parse result value or false. In case all keys have failed to parse boolean false is passed instead of an object (so be sure to check for that).

(Client) Queue

However, usually you'd want to run multiple (possibly hundreds or even thousands) of scrape operations in a non-UI-blocking manner. That is the purpose of the queue method:

Cassidi.queue([url1, url2, ...], function(swag){ console.log(swag.price); }, function(){ alert('All done!'); })

It's pretty self-explanatory, but for the sake of clarity:

The first argument is an array of URLs.
The second argument is an optional function which will be used as a callback for each processed URL.
The third argument is an optional final callback, i.e. a simple function to be run when all URLs have been processed.

Usually you'd want to fetch a few items from the Queue REST API (see below) and run them in the queue method while fetching the next items.

Client-side (Butch Cassidy)

Cassidy is an AngularJS service for data scraping with a domain-based parser configuration (or Blueprints).

Server-side (Sundance Kid)

Sundance is a lightweight NodeJS script that can run scraping queues on the server. It's designed to be stackable and scalable.

Queue

The queue has a simple workflow. Using the REST API the typical worker scenario looks like this:

GET /api/v1/queue returns a list of product IDs not related to recent (8h) prices and not related to a recently updated :Promise.
GET /api/v1/lock/{product} assigns the given product a new :Promise and returns its id
POST /api/v1/lock/{promise} adds the value parameter to the parent product's prices and deletes the promise on success (otherwise touches updated_at)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
app		app
bootstrap		bootstrap
config		config
database		database
public		public
resources		resources
storage		storage
sundance		sundance
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
artisan		artisan
composer.json		composer.json
composer.lock		composer.lock
gulpfile.js		gulpfile.js
package.json		package.json
phpspec.yml		phpspec.yml
phpunit.xml		phpunit.xml
readme.md		readme.md
server.php		server.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenPrice

Scraping with JavaScript

Stack

Configuration

Usage

Single URL

(Client) Queue

Client-side (Butch Cassidy)

Server-side (Sundance Kid)

Queue

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenPrice

Scraping with JavaScript

Stack

Configuration

Usage

Single URL

(Client) Queue

Client-side (Butch Cassidy)

Server-side (Sundance Kid)

Queue

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages