Skip to content
/ rdig Public

Crawler and content extractor for building a full text index of a website's contents. Uses Ferret for indexing.

License

Notifications You must be signed in to change notification settings

jkraemer/rdig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.

RDig depends on Ferret (>= 0.10.0) and, for parsing HTML, on either Hpricot (>= 0.4) or the RubyfulSoup library (>= 1.0.4). As I know no way to specify such an OR dependency in a gem specification, the gem depends on Hpricot. If this is a problem for you, install the gem with –force and manually do a +gem install rubyful_soup+.

  • create a config file based on the template in doc/examples

  • to create an index:

    rdig -c CONFIGFILE
  • to run a query against the index (just to try it out)

    rdig -c CONFIGFILE -q 'your query'

    this will dump the first 10 search results to STDOUT

require 'rdig'
require 'rdig_config'   # load your config file here
search_results = RDig.searcher.search(query)

see RDig::Search::Searcher for more information.

  • add to config/environment.rb :

    require 'rdig'
    require 'rdig_config'
    
  • place rdig_config.rb into config/ directory.

  • build index:

    rdig -c config/rdig_config.rb
  • in your controller that handles the search form:

    search_results = RDig.searcher.search(params[:query])
    @results = search_results[:list]
    @hitcount = search_results[:hitcount]
    

Use the :first_doc and :num_docs options to implement paging through search results. (:num_docs is 10 by default, so without using these options only the first 10 results will be retrieved)

from doc/examples/config.rb. The tag_selector properties are called with a BeautifulSoup instance as parameter. See the RubyfulSoup Site for more info about this cool lib. You can also have a look at the html_content_extractor unit test.

:include:doc/examples/config.rb

About

Crawler and content extractor for building a full text index of a website's contents. Uses Ferret for indexing.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages