Skip to content

ealdent/uea-stemmer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ruby implementation of the UEA-Lite stemmer for conservative stemming in search and indexing workloads.

UEA-Lite uses a rule set to normalize suffixes while avoiding aggressive stemming.

The stemmer operates on a single token at a time and returns a stemmed token.

Notable behavior of this implementation:

  • possessive apostrophes are removed

  • contractions are expanded by default (for example, don't becomes do not)

  • tokens beginning with uppercase letters are preserved, and pluralized acronyms ending in a lowercase s are singularized

  • pure numbers, and tokens containing hyphens/underscores, are passed through unchanged

This is a port to Ruby from the Java port of the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.

Install the gem:

gem install uea-stemmer

Install from source:

git clone https://github.com/ealdent/uea-stemmer.git
cd uea-stemmer
bundle install
bundle exec rake test
bundle exec rake install

Basic usage:

require "uea-stemmer"
stemmer = UEAStemmer.new

stemmer.stem("helpers")   # => "helper"
stemmer.stem("dying")     # => "die"
stemmer.stem("scarred")   # => "scar"

You can extract the matching rule with stem_with_rule:

result = stemmer.stem_with_rule("invited")
result.word      # => "invite"
result.rule_num  # => 22.3
result.rule      # => #<UEAStemmer::Rule ...>

Disable contraction expansion:

UEAStemmer.new(nil, nil, skip_contractions: true).stem("don't")
# => "don't"

Use the singleton instance:

DefaultUEAStemmer.instance.stem("running")  # => "run"
  • Fork the project.

  • Make your feature addition or bug fix.

  • Add or update tests.

  • Run +bundle exec rake test+.

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2005 by the University of East Anglia and authored by Marie-Claire Jenkins and Dr. Dan J Smith. This port to Ruby was done by Jason Adams using the port to Java by Richard Churchill.

This project is distributed under the Apache 2.0 License. See LICENSE for details.

About

Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages