Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 06-28-2005   #1
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Tools, Formulas or Software for Primary Topic Extraction

I've been spending a lot of time looking into solutions for analyzing a web page document and retrieving the most important terms or phrases - much in the same way that search engines do.

I'm wondering if there are practical tools that exist on the web, or anything someone has written that can extract the primary topics (both single words and phrases) from a document to create an ordered list of the relevant topics of the page.

I'm most interested to learn how this can be done, what kinds of formulas are used and the "how" of the process. Any ideas?
randfish is offline   Reply With Quote
Old 06-29-2005   #2
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
I can't think of any simple methods, or any tools that are user friendly.

Sourceforge has some things. A lot of preprocessing needs to be done before ant analysis, otherwise all the results are wrong.

Look into text preprocessing first and then have a look at string tokenization and such things.

You can maybe try this:
little tool with explaination in java

Very little used in the research world is pretty or user friendly or simple to use unless there is already a good knowledge of programming, linguistics and all that lovely stuff, because we just want the output to use as the base layer if you like.

A fair bit of research goes into this area. You should be able to find some papers or something to help you along.

Maybe someone else will have a better suggestion for you Randfish.
xan is offline   Reply With Quote
Old 06-29-2005   #3
imsaunders
Newbie
 
Join Date: Mar 2005
Location: UK
Posts: 3
imsaunders is on a distinguished road
Thumbs up Tools, Formulas or Software for Primary Topic Extraction

The tools that you describe are exactly those which we have spent a considerable time and money developing with our Sense Engine technology.
To have an understanding of the value of the terms on a webpage and therefore, the concepts, it is necessary to have a linguistic understanding of the contextual distinctiveness of each and every word on the page. This process cannot be delivered by algorithmic analysis as it is necessary to not only identify the term but also the sense of the word that is being used. A not very well known fact is that for each word in the English language, there are on average 2.5 alternative meanings. If the brand names, proper names and domain names are factored in, this increases to an average of four possible alternative meanings for each word. To illustrate,type 'depression' into any search engine. Depression has at least four alternative meanings but because search fails to recognize the context, you will only receive information relating to depression in a mental health context for the first few pages. Most of the clustering engine also will fail to find the alternative senses but instead will offer information on variants of depression in a mental healthc context.
A solution to this problem is the very basis of our patented Textonomy technology. We have spent a considerable time buidling a semantic network that leverages both language and knowledge to create a composite framework within which all words can reside. The technology layer adopts this framwework and analyses the core concepts within a webpage and as a result within 10 milliseconds, you will have all the concepts and associated terms returned to you.
imsaunders is offline   Reply With Quote
Old 06-29-2005   #4
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Agreed.

We are constantly making these, and some methods get used in IR applications, but also in machine translation and CLIR as well for example.

Some people have spent 10 years or more (some since the 50's, like for Karen Sparck-Jones) learning how to do this stuff and how to make it better and better, and its never over either.

It's a very vast area. We have linguists, computational linguists, programmers, researchers, engineers, IR experts, multi-lingual people, hardware people (lots of power needed for some of these things),...

To be honest a good start is to do a Phd!

For just a rough and quick look, you can use some tools like I gave you above.
xan is offline   Reply With Quote
Old 06-29-2005   #5
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
I appreciate both of your replies. It appears that the cause may be hopeless to someone like myself with a small company, seeking only to return results as part of a free toolset...

I'm very sorry to hear that, but will press on. I'm aware of Yahoo's term extraction project - http://developer.yahoo.net/content/V...xtraction.html - but sadly, it appears they very rarely return any results at all when queried.

I wonder if there is an easier approach - perhaps a very simplistic way of extracting the most pertinent words of a document, then comparing them to one another to get a general theme.

I'd certainly be interested in licensing someone else's technology if it's affordable - let me know if you have any ideas.
randfish is offline   Reply With Quote
Old 06-29-2005   #6
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Well done so far Rand,

keep informed, and keep doing what you're doing!
xan is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off