View Full Version : KeyWord Stemming & Word Forms
First, a little disclaimer:
I am avidly following Orion's excellent discussion regarding Keywords Co-occurrence and Semantic Connectivity (http://forums.searchenginewatch.com/forum/showthread.php?t=48), and I'm aware that he's just about to start talking about word stemming and prefixes, so I hope I don't preempt him too much here. Sorry, Orion, if I do! ;-)
For those who don't know what word stemming is, I'll describe it very unscientifically here. Stemming is the conversion of a word to its simplest part. For example, stemming would convert to stem, simplest would convert to simple, etc. There is quite a good summary of the process in What is Stemming? (http://www.comp.lancs.ac.uk/computing/research/stemming/general/index.htm).
I am currently working on a PHP application which involves word stemming, but I'm finding the code and algorithms available on the net a little confusing. It seems that most of the algos have inherent flaws in them, and what's more the flaws seem to be intentional. Obviously, since I'm just starting to understand the process, I'm sure these 'flaws' are more to do with my understanding of the process and how it should be applied in practice.
My main confusion is based on certain stemming errors (http://www.comp.lancs.ac.uk/computing/research/stemming/general/stemmingerrors.htm) which are common to all versions of stemmer currently available on the net to various degrees. For example, I am using a PHP implementation of the Porter stemming algorithm (http://snowball.tartarus.org/english/stemmer.html) which states in the sample vocabulary table (http://snowball.tartarus.org/english/stemmer.html) that conspicuous should convert to conspicu, conspired to conspir, etc.
Why is it that the stemmed word forms are not real English words?
This is really bugging me, because I am trying to use the stemmer in an English teaching environment. For a great many word choices the stemmer actually does a great job of reducing them to the correct base form. However, I would like to offer people the chance to 'view all word forms' and other goodies, but of course I can't present stemmed words that are actually non-words.
Another confusion I have is in the modification of these algorithms to work better on a wider range of words. It seems that many of the code examples rely on placing exceptions in the code to account for very specific circumstances and this feels a little like they're 'winging it' a little.
If anyone has a knowledge of stemming, and practice at modification of the algos, I could do with a little push in the right direction. ;)
Some handy stemming references:
The Porter Stemming Algorithm (http://www.tartarus.org/~martin/PorterStemmer/index.html)
PHP Porter Stemming Class (http://www.chuggnutt.com/stemmer.php)
Creating the perfect word stemmer (http://www.phpscripts.com/item/36/catid/3) (Blog Entry)
A couple more PHP stemming resources (http://bugs.tutorbuddy.com/)
Lancaster (Paice/Husk) stemming algorithm (http://www.comp.lancs.ac.uk/computing/research/stemming/)
Snowball - A language for stemming algorithms (http://snowball.tartarus.org/)
Thanks in advance,
Red5
runarb
06-18-2004, 03:04 AM
A stemmer is not design to convert a word to its linguistically correct root, but to represent different variants of a word by a representative form. As described here (http://www.comp.lancs.ac.uk/computing/research/stemming/general/index.htm) :
For IR purposes, it doesn't usually matter whether the stems generated are genuine words or not – thus, "computation" might be stemmed to "comput" – provided that (a) different words with the same 'base meaning' are conflated to the same form, and (b) words with distinct meanings are kept separate.
You need at "lemmatiser".
I would like to offer people the chance to 'view all word forms'
I have worked on something like that my self. To make a stemmer that not only can give you "car" from "cars", but "cars" from "car".
See http://www.boitho.com/cgi-bin/lab/porterExpander/index.cgi?query=image&language=ENG here for the word "image"
The numbers after the words is the number of hits in the Boitho database (a search engine) for that word, and can be used to discard words that are incorrect because they occur rarely in the database.
Dodger
06-18-2004, 03:52 AM
I think while the base word "comput" looks funny linguisticly, it does not look funny progmatically. Computerese and English are two different languages.
Reducing the word to even a more basic representation facilitates easier algorithms to form derivatives of the word. Now all you have to do to the word "comput" is add -ing to form the word computing, or -ation for computation. Or -e for the compute. It is easier to add those suffixes progmatically than it is to remove the E, then add the suffix. This does not make for efficient coding.
The letter E is probably an unusual example at any rate. The letter is silent and is used mostly to signify the 'long' vowel sound that it follows. So basicly, for all intents and purposes "comput" is really the basic form of the word in all actuality.
You need at "lemmatiser".Aaaahhhh! Now that's what I needed to hear! I was beginning to think the case was hopeless, but we were simply barking up the wrong tree. For example, the following pages on my site display the stems for "verbs (http://www.usingenglish.com/resources/wordcheck/index.php?word=verbs&test=1)" correctly, as does "amazingly (http://www.usingenglish.com/resources/wordcheck/index.php?word=amazingly&test=1)", but just look what happens to the word stem indicated for "military (http://www.usingenglish.com/resources/wordcheck/index.php?word=military&test=1)".
So, I'll have to get my head around lemmatising then. I'm looking forward to it.
Thanks for your help.
Regards,
red5
Reducing the word to even a more basic representation facilitates easier algorithms to form derivatives of the word. Now all you have to do to the word "comput" is add -ing to form the word computing, or -ation for computation. Or -e for the compute. It is easier to add those suffixes progmatically than it is to remove the E, then add the suffix. This does not make for efficient coding.
Very true, and thank you for filling in the blanks for me. In all the documantation that I've read online there was never any form of explanation as to why computing would reduce to comput. Now it all starts to make a lot more sense.
Dodger
06-18-2004, 06:15 PM
Very true, and thank you for filling in the blanks for me. In all the documantation that I've read online there was never any form of explanation as to why computing would reduce to comput. Now it all starts to make a lot more sense.
It is probably much more complicated either way you look at it. I was just looking at your "military" stem example above ... what in the heck is "milit"? That is probably because of words like "milit-ia" maybe?
I guess an easy way to look at it is that the "stem" of a word is not synonomous with the "base" of the word.
orion
06-19-2004, 11:32 AM
Hello Red5.
Just stopping by the thread. Sorry I couldn't do it before.
I welcome very much this thread on stemming. I was planning on touching the basics, but now I think, one thread is more than enough. I extend an invitation to readers to follow this thread.
Excellent reference material, you and others of the thread are posting.
Orion
I was planning on touching the basics, but now I think, one thread is more than enough. I extend an invitation to readers to follow this thread.
Thanks Orion, but I can't help feeling that I've somehow stolen your thunder a little, which of course I had no intention of doing. Oh, and thanks for the positive comments. :)
It is probably much more complicated either way you look at it. I was just looking at your "military" stem example above ... what in the heck is "milit"? That is probably because of words like "milit-ia" maybe?It could be that the modifications to the algorithm I made have caused this, or it could be that the Porter stemmer simply over-stems words such as that. Over-stemming is certainly a known issue (link (http://www.lancs.ac.uk/ug/oneillc1/stemmer/general/stemmingerrors.htm)), and for example, the original Porter stemmer incorrectly reduces words beginning gener (source (http://snowball.tartarus.org/english/stemmer.html)).
I think I might set up a test page and/or revert back to the original source code just to see what effect that might have. I could always set up yet another special case, but I certainly can't do that for every error; it would be an impossible task!
Edited to add:
I've just found a reasonable summary of the differences between a stemmer and a lematiser:
A stem is generally seen as the morphological root of an inflected (or sometimes derived) word form. Stemming is mostly used in Information Retrieval to refer to approaches that strip off suffixes (or what looks like suffixes) and return the remainder as stem. For example in Dutch, the strings "vangen", "vangt" and "gevangen" should be attributed the stem "vang".
A lemma is something quite similar, but still slightly different: Lemmatisation takes a word and returns its baseform or citation form (the canonical dictionary entry form). This means that the same type (word form) can be assigned different lemmas, e.g. "aaltje" (small eel) with lemma "aal" (eel) vs. "aaltje" (sort of worm) with lemma "aaltje" (sort of worm).
From: http://odur.let.rug.nl/~tanja/code.html
Sadly the site is geared for Dutch, not English. :(
Just getting back to lemmatisers again, does anyone know if code for such a device has been made available in PHP or Perl? I've found a java version, but I can't stand java. ;)
Dodger
06-19-2004, 01:59 PM
Just getting back to lemmatisers again, does anyone know if code for such a device has been made available in PHP or Perl? I've found a java version, but I can't stand java. ;)
Where is the Java version at? Can it be ported?
Here is a page with more links you might be interested in. Don't know if you have stumbled upon this yet or not. There is section in here for Tagger and Lemmatisers.
http://193.10.182.145/.WK/.SS/Kur/KL/KL.LinkDirectory.html
Where is the Java version at? Can it be ported?
Here's where I found it: http://www.canoo.com/wmtrans/developerzone/lemmJavaREADME.html
Here is a page with more links you might be interested in. Don't know if you have stumbled upon this yet or not. There is section in here for Tagger and Lemmatisers.
http://193.10.182.145/.WK/.SS/Kur/KL/KL.LinkDirectory.html
That looks quite interesting, and no, I hadn't seen it. Thanks, I'll browse it now. ;-)
Dodger
06-19-2004, 02:14 PM
I did a search at Google for syntactic parser and found links like this one in the results ( http://www.link.cs.cmu.edu/link/ ) which is a small program written in C that breaks sentences down into its syntactic structure and assigns labels to the parts (noun phrases, verb phrases). It comes with a 60,000 word dictionary as well.
I only offer this up because I think you need to be able to actually identify what kind of word it is before you can properly stem it.
Dodger
06-19-2004, 02:26 PM
Here's where I found it: http://www.canoo.com/wmtrans/developerzone/lemmJavaREADME.html
You have to order that thing, I guess.
I was digging around over at SourceForge ... nothing that I can find there. I think some searching of Universities might be the way to go -- there has to be something more out there than this.
Nice find! I'm quickly being overwhelmed with links to look at, pages to read and coding tasks. The more I delve the more interesting this becomes. :)
PS - I've just found another site related to content analysis or natural language processing which looks quite interesting: http://www.text-mining.org/
... I think you need to be able to actually identify what kind of word it is before you can properly stem it.Possibly, but there is a fudge that I can think of...
If you take a word such as military and stem it (presumably to milit), then create a list of new words by adding every common suffix (-ing, -ly, -s etc.), one could then compare them to a dictionary and only display the correct forms (although it might be a bit processing intensive and a somewhat "round-the-houses" way of going about it).
Dodger
06-19-2004, 02:38 PM
Another search turned this up -- nice demo http://www.connexor.com/demos/tagger_en.html
Here is a good search (http://www.google.com/search?hl=en&lr=&ie=UTF-8&c2coff=1&q=syntactic+parser&as_q=Lemmatiser&btnG=Search%C2%A0within%C2%A0results) at Google
I found a few like that as well:
ENGCG: Constraint Grammar Parser of English (http://www.lingsoft.fi/cgi-bin/engcg)
ENGTWOL: English morphological analyser (http://www.lingsoft.fi/cgi-bin/engtwol)
and the following list from http://www.staff.amu.edu.pl/~przemka/corplink.html
(Developers of ) downloadable software for corpus work:
Mike Scott's Web (Liverpool University, UK) (http://www.liv.ac.uk/~ms2928/index.htm)(WordSmith Tools)
University of Stuttgart (http://www.ims.uni-stuttgart.de/) - computational linguistics and corpus exploration (IMS Corpus Workbench / Unix - free)
Barlow's MonoConc (http://www.athel.com/) (free demo) & ParaConc (http://www.ruf.rice.edu/~barlow/parac.html) (free beta for Windows)
WinConcord (http://www.linglit.tu-darmstadt.de/downloads/wconcord.zip) (free concordancer for Win3.x and later)
ConcApp (http://vlc.polyu.edu.hk/PUB/concapp/) (free concordancer and vocabulary profiler for Windows)
KWiCFinder (http://miniappolis.com/KWiCFinder/) (free Web concord)
LDC's online collocation tester (http://morph.ldc.upenn.edu/cgi-bin/mutualform) (MI & T-score) (limited non-member access)
The TOSCA Research Group for Corpus Linguistics, Nijmegen, Holland (http://lands.let.kun.nl/TSpublic/tosca/) (free deep POS tagger for DOS)
INTEX (http://www.nyu.edu/pages/linguistics/intex/#Download) - free mulltilingual analyser
UNIX tagger (PC version to appear) (http://sls-www.lcs.mit.edu/~flammia/Nb.html)
Xerox Research Centre (http://www.xrce.xerox.com/research/mltt/themes.html) (text analysis researc: on-line demos of tools for various lgs, incl. Polish)
A Xerox site with taggers to experiment with (http://www.xerox.fr/grenoble/mltt/home.html) (plus articles for downloading)
Georgetown University Natural Language Processing (http://www.georgetown.edu/compling/index.html) (including Parser Modularity Demo page)
Generating taggers/lemmatisers. Maintaining lexica (WordManager) (http://www.idsia.ch/wordmanager.html)
Eric Brill's page AND tagger (http://www.cs.jhu.edu/~brill/)
SIL Computing resources (http://www.sil.org/computing/)
Experimental Part-of-Speech Service (http://www-clg.bham.ac.uk/tagger.html)(University of Birmingham)
E-mail part-of speech tagger (http://www.scs.leeds.ac.uk/amalgam/) (University of Leeds)
A site with state-of-the-art parsers (http://www.georgetown.edu/compling/parsinfo.htm)
Other sites discussing parsers: 1. http://www.cl.cam.ac.uk./ftp/nltools (http://www.cl.cam.ac.uk./ftp/nltools); http://www.sil.org/pcpatr (http://www.sil.org/pcpatr) (Summer Institute of Linguistics, Inc.)
Dodger
06-19-2004, 02:49 PM
If you take a word such as military and stem it (presumably to milit), then create a list of new words by adding every common suffix (-ing, -ly, -s etc.), one could then compare them to a dictionary and only display the correct forms (although it might be a bit processing intensive and a somewhat "round-the-houses" way of going about it).
From what I have been reading so far ... they all have been using dictionaries. So you are on the right track there I think. Here is one from Leeds, but you need to email a request for it.
From http://www.ltg.ed.ac.uk/helpdesk/faq/Tools-html/0075.html
George Demetriou's Lemmatiser
George Demetriou at Leeds (george@scs.leeds.ac.uk) has a lemmatisation lexicon compiled from various sources (LDOCE, Wordnet, OALDE, CELEX) for the roots and parts of speech of about 95,000 words.
If you want this thing to be fast though -- you would probably have to build it into its own server. I don't think this is something that can be slopped into a perl or php file and run with any degree of speed.
Dodger
06-19-2004, 02:52 PM
I found a few like that as well:
ENGCG: Constraint Grammar Parser of English (http://www.lingsoft.fi/cgi-bin/engcg)
ENGTWOL: English morphological analyser (http://www.lingsoft.fi/cgi-bin/engtwol)
<snip> etc. etc. etc.
I can see you were looking at the same things I was (the links are mostly colored purple -- visited ) :D
That oughta keep you busy for a while. It appears that some or all of the parts are there, just need to piece them together.
Dodger
06-19-2004, 03:10 PM
This appears to be written in C, look down on the page where it mentions that you can still get his part of speech tagger http://research.microsoft.com/%7Ebrill/
It has a small lexicon dictionary inside of it. BTW, this guy was hired by MicroSoft and is in their Natural Language Group.
From what I have been reading so far ... they all have been using dictionaries. So you are on the right track there I think.Here's a test page (which btw I'm still working on): http://www.usingenglish.com/resources/expander/index.php?w=simple
Obviously, you can test other words by replacing the word simple in the url.
It is using the php class available from http://www.chuggnutt.com/stemmer.php
The Porter algo seems to be a little under-aggressive; see 'necessarily' for an example where it should produce 'necess'. Regardless, it looks as if the concept's working. ;-)
Dodger
06-19-2004, 04:59 PM
Your test page uses a dictionary?
The "necessary" appears to know the -y, but not -ary. Can you expand upon this? I noticed that it does this on "necessity" also (-y vs. -ity).
The site forum has a spell check feature which uses a dictionary of ~ 200,000 words including suffixed forms. The range of suffixes can be expanded over time. ;-)
The code is as follows:
<?
include("db.php");
include("/path/to/class.stemmer.php");
$Stemmer = new Stemmer;
$stem = $Stemmer->stem($w);
$suffix = array('ing',
's',
'able',
'ible',
'ly',
'ed',
'er',
'est',
'ness',
'tion',
'ian',
'ive',
'ity',
'sis',
'ist',
'ous',
'en',
'or',
'ify',
'ings',
'ably',
'ibly'
);
?>
<p><b>Original Word:</b> <? echo $w ?><br>
<b>Stem:</b> <? echo $stem?></p>
<p><b>Other Word Forms:</b></p>
<ul>
<?
foreach( $suffix as $s )
{
$newword = $stem . $s;
$sql = "SELECT * FROM phpbb_spelling_words WHERE word = '".$newword."'";
$result = mysql_query($sql);
if(!$result)
{
echo mysql_error();
exit;
}
$num_results = mysql_num_rows($result);
if($num_results)
{
echo "<li>$newword</li>";
}
}
?>
</ul>
The "necessary" appears to know the -y, but not -ary. Can you expand upon this? I noticed that it does this on "necessity" also (-y vs. -ity).
Check this out: http://www.usingenglish.com/resources/wordcheck/index.php?word=necessary
:D
Dodger
06-20-2004, 01:03 AM
Lots of suffixes, eh? Looks like you need to find a list, plus a prefix list also.
Here is a place that explains how it is done simplistically called Morpheme Processing (http://www.thunderstone.com/site/texisman/node193.html). Of course, you already came across the Porter's algorithm (http://telemat.det.unifi.it/book/2001/wchange/download/stem_porter.html) (this was the original whitepaper). And this search turned up all sorts of interesting results worth looking at: suffix lists (http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=suffix+lists), including a lisp program for lemmatizing nouns (http://www.umiacs.umd.edu/~resnik/programming/assignments/program1/program1.lisp)
findme
06-24-2004, 04:40 PM
I have heard that stemming on a Web search engine scale can be resouce intensive (cost prohibitive in some cases). Is most of this due to processing or disk I/O?
And why are we trying to reinvent the wheel when many dictionairies have words stemmed already? Can't we just do a look up? Or is disk I/O much more costly than processing a stem list on the fly?
Or is it more a question of the increase in the number of search terms will balloon the working set beyond what is reasonable in terms of relevance and excessive recall?
searchtools
06-29-2004, 04:27 PM
My experience with stemming comes from evaluating enterprise search engines. I find that aggressive stemming (searching for all forms of a word) tends to be confusing and annoying rather than helpful. For example, the default version of ht://Dig expands the simple word "rings" to "rings or ring or ringed or ringing or ringings or ringer or ringers".
In my heuristic tests and user experience tests, I've found that people don't really like all this expansion, especially if it means that the exact match doesn't come first. In many cases, the search engine can't show exact matches first, because the stemmed version is stored in the index. This may save disk space, but it annoys users no end. This is particularly a problem with verbs. For example, if someone's looking for the movie "Run, Lola, Run" they would probably not want to find the samurai movie "Ran".
I don't mean to say that all stemming is bad. I like lightweight pluralization, so if you look for "shoe" you also get "shoes", if you look for "goose" you also get "geese". In my tests, I find that most people don't even notice the fact that they are getting plurals, and no one has ever been upset to get them. It's easy enough to perform the stemming in the query and search for multiple versions of the word, rather than storing the stemmed version only.
So, for most practical purposes, pluralization is great, but extensive stemming overwhelms users who want a good match, not comprehensive recall.
Avi Rappoport
SearchTools.com