Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Closed Thread
 
Thread Tools
Old 02-12-2005   #161
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Couldn't help joining in, and wanted to add my thoughts as well!

Fuzzy set theory means that search engines for example can use a massive database of content to identify the relationships between words, rather than using a thesaurus or dictionary. "Fuzzy set theory" means that a thing can belong to some degree to a group. simply put the idea is that it creates an index of word relationships by measuring how often they are used and in what context. The index is a fuzzy ontology of term associations which is used as one of the sources of the search engines knowledge. Pompous sci's refer to it as "FuzzONT".

The work I have seen here is generally based on the idea for an information system to expand a query, so when you put in "banana" you would also get results with terms related to it. Then user refinement comes into play. Google or MSN don't use this to my knowledge.

Almost all IR systems make use of WordNet:
"WordNet is a semantic lexicon for the English language. It groups English words into sets of synonyms called synsets, provides short definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications."

wordnet online

This project is heavily supported as well, and may be useful to look at:

CYC project

Fuzzy sets have been heavily criticised, especially by Haack who argues that "True and False are discrete terms. For example, "The sky is blue" is either true or false; any fuzziness to the statement arises from an imprecise definition of terms, not out of the nature of Truth. "

But:

Fox retaliated by saying that many of Haack's objections "stem from a lack of semantic clarity, and ultimately fuzzy statements may be translatable into phrases which classical logicians would find palatable."

Using fuzzy systems in a dynamic control environment raises the likelihood of encountering difficult stability problems.

For those who want to know more:
People are also working on semantic relationships by using Natural gradient descent (NGD), with neural networks or SVM's, as they have learning capabilities.

Quick reference to the man who introduced it: Professor Lofti Zadeh "Fuzzy sets", 1965.

My job involves mostly getting machines to understand meaning and interrect and react appropiately to user input. Semantics are quite important, but also coherence and cohesion.

Last edited by xan : 02-12-2005 at 02:49 PM.
xan is offline  
Old 02-16-2005   #162
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Terms Co-occurrence has less to do with Fuzzy Set theory and more about semantics. More about this at SES, NY. Venn Diagrams are often used as visualization aids, no more, no less.

Orion

Last edited by orion : 02-16-2005 at 10:50 PM.
orion is offline  
Old 02-17-2005   #163
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by AussieWebmaster
I asked a senior engineer over at Google to have a look at this thread and see what he thought or what comments he could give. He came back with an interesting reply.

My apologies for not being more helpful
here; I'd like to respectfully decline to comment. Any comment that might
eventually find its way to a message board would only get me into trouble.
;-) This is GoogleGuy's area, so I leave that fully to him.

Agreed matey. And stochastic systems BTW are always used in IR, plus semantics seem to be considered the holy grail round here, and they are important, but there's a lot more to linguistics and computational linguistics than just that, even with term co-occurence. I mentioned Fuzzy set theory because Orion did and I wanted to clear that up using my own opinion, thats all.


8.Semantic Classes. You would be surprise how many times I have explained that co-occurrence is demographic/culturally-driven so as the use of hyphens in different languages (or same language different countries).

punctuation is always ignored. The only time it is a problem is for cohesion, like tying to find out where the end of a sentence is for machine translation purposes.

6.Stop Words. We use two kind of filters (a) a stop word list and (b) a library of regular expressions. You may want to check my On-Topic Analysis thread and experiment paper.

Stopword lists vary greatly with each particular use of them, and regex why would you eliminate stopwords using this as well? When the stops are gone, they're gone, this is ultra easy to do, and a first year programmer can work it out. I might not understand what you mean though, I most likely got your intention wrong.

4.Cluster Analysis. We use co-occurrence in preliminary experiments and results are taken as input for cluster analysis (LSI, dendrograms, etc) to be specific. We do carry out cluster analysis.

This a slow method. Pure pattern matching using SVM's or similar methods will be efficient and very fast. It involves less linguistic plough through and more mathematical pattern matching where terms are assigned weights which are then normalized.

7.Stemming. At the beginning of this thread one of my intentions was to discuss stemming using Associative Clusters. Then someone started a thread on stemming and I decided to let them discuss it. If you search for stemming the SEWF forums you should find the thread.

Best not to always stem, and different stemmers for different jobs (and languages of course).

9.My Calculator. Overhead processing is always there and a problem. I can handle several thousand records and would like to push for million. At the On-Topic paper I presented crude sample using few hundreds. I do agree on this with you. I need more power. An university machine-power center is what I am currently shopping around for. Good things are coming for 2005! But cannot talk more.

Orion, can't you actually use university resources? Its much easier. You should have affiliations, don't know how it works in your country. We have here, but power is not a major issue at this point for us, maybe because we have extra resources.
xan is offline  
Old 02-17-2005   #164
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by xan
Agreed matey. And stochastic systems BTW are always used in IR, plus semantics seem to be considered the holy grail round here, and they are important, but there's a lot more to linguistics and computational linguistics than just that, even with term co-occurence. I mentioned Fuzzy set theory because Orion did and I wanted to clear that up using my own opinion, thats all.
Aussie and Xan,

Semantics and co-occurrence, I love that.


Punctuation and demographics. Actually, puctuation does matter in some cases. One is the research we have carried out in the area of hyphenation. In Google, for example, hyphenation in queries in FINDALL (AND) introduces a degree of ordering. This degree of ordering acts as a localized EXACT mode within the FINDALL mode. Hyphenation rules may be different for even the same language (eg, USA English vs UK English). In the case of Spanish, we have many regionalisms. So, punctuation as contextuality causes many different meaning for a given term. This is where machine tranlation fails miserably.

Stop Words. Actually, our exps shows many terms fail to escape the regexp filters. No matter how much granularity we put into the library, we keep finding cases in which words fail to be recognized. So we opted for general classes and sub classes and stop there. Then we put into a bag of words those that escape the "literal" library of filters. This approach helps a lot with both English and Spanish terms (especially very unique terms with high IDF). This approach can be applied to almost other languages. Words that escape both filters are easily pinpointed with our on-topic analyzer software.


Cluster Analysis. This line was in response to Claus. Cluster Analysis is in fact a slow method. We don't use Cluster analysis for the implementation you have suggested. We do use it to identify data structures and is not that time consuming for this.

Stemming. Agreed. Again this was in response to a direct question from Claus. For what we do, in many instances we don't need stemming.

My Calculator. The resources we need for other parts of our research require access via univ grants. I'm currently at La Jolla, San Diego, CA. At this point for our research power is a major issue for us.

Cheers


Orion

Last edited by orion : 02-17-2005 at 12:53 PM.
orion is offline  
Old 02-17-2005   #165
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
You seem to be doing mostly SEO work Orion. I was talking about it from a pure IR point of view. Of course I use stops, but they change all the time depending on what you're doing.
Machine translation fails on a lot more serious things than punctuation. I work cross-language as well, I speak french, german also and I can tell you that the rules change in every language. Oriental ones aren't affected by the rame things. We don't really use punctuation to assess contextuality either at my end. What your stop method does is pretty standard, and thats a decent route to follow.

As for power...hehehehe, you need a big private grant!
xan is offline  
Old 02-17-2005   #166
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by xan
You seem to be doing mostly SEO work Orion.
No.

Quote:
As for power...hehehehe, you need a big private grant!
Yes. He,He.

Orion
orion is offline  
Old 03-01-2005   #167
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Hi guys,

I know you are busy with the SEW conference in NY and reading the info around it gives good insight into what is going on there. I'm going to a pure research based IR conference soon, so maybe I can give feedback from that end of the world and we can compare.

I have a question and I am not suggesting anything is wrong or right, I genuinly would like to know:

It appears to me that the EF-ratio and c-index do not yield much information. How am I wrong?
xan is offline  
Old 03-04-2005   #168
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, xan.


The merits, virtues and applications of the c-indices and EF-ratios were well covered at SES NY. At the conference, I presented (in "virtual space") on c-indices and EF-ratios many applications and chart analytics, sharing podium with world experts like Mike Grehan and Rahul Lahiri (Ask Jeeves)

Many firms I spoke to are using both metrics for their in-house research. Dan Thies, a pioneer and icon in keyword research spent nice few minutes discussing the merits of the metrics from the marketing and research standpoint.

I wish you were there. Maybe in another SES. I'll probably be at SES Canada and the W3C Japan conference, unless plans change.

Cheers

Orion

Last edited by orion : 03-04-2005 at 08:43 PM.
orion is offline  
Old 03-04-2005   #169
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Thank you Orion.

It does seem like there was a lot of fun to be had anyway!
Perhaps I shall venture to one of the other events.
xan is offline  
Old 04-11-2005   #170
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Advance C-Indices

Chris Sherman has mentioned the link technology of Become.com. http://searchenginewatch.com/searchd...le.php/3496571

Chris writes,

"And unlike PageRank, which essentially pays no attention to the content of a web page, relying entirely on link analysis to compute a ranking, AIR analyzes the content of a page and only indexes it if it can determine that the page contains shopping-related content. It also only crawls on-topic links, discarding non-shopping or spam links."


The portion that interests me the most is the last line of the quote (boldfaced) since it brings to me flashbacks of one of my old posts from last summer at this thread: c-indices applied to links and email addresses. It is time now to revisit these.


C-Indices as estimates of link co-occurrances

Link-Keyword Co-Occurrence

Let L1 be number of links containing term k1
Let L2 be number of links containing term k2
Let L12 be number of links containing both terms

Then the Link-Keyword Co-Occurrence is given by

clk-12 = Lk1/(LK1 + LK2 - LK12)*1000

And computed as standard c-indices. To indicate this relates to links and keywords, I added the l and k subscripts to the notation so we preserve the overall nomenclature.


Link-URL Co-Occurrence

Let L1 be number of links pointing to url 1
Let L2 be number of links pointing to url 2
Let L12 be number of links pointing to url 1 and 2

Then the Link-URL Co-Occurrence is given by

clu-12 = LU12/(LK1 + Lk2 - LK12)*1000


Documents-Email Co-Occurrence


Let D1 be the number of documents containing email 1
Let D2 be the number of documents containing email 2
Let D12 be the number of documents containing email 1 and 2

Then the Doc-Email Co-Occurrence is given by

cde-12 = DE12/(DE1 + DE2 - DE12)*1000


These estimators should help business intelligence analysts and those interested in link-building strategies. The approach can be extended to 3 or more non-mutually exclusive events.

One inmediate application of these c-indices is in the area of spotting link spammers and email research.

Though in the wrong hands such c-index studies could serve the opposite purpose or even more obscure purposes.



Orion

Last edited by orion : 04-21-2005 at 05:57 PM. Reason: typos
orion is offline  
Old 04-21-2005   #171
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Yahoo! Patent

For those that still don't get this thing about co-occurrence theory, Bill at this other forum http://www.cre8asiteforums.com/viewtopic.php?t=23819 discusses the new Yahoo patent in which they use co-occurrence theory and membership theory. They even use a "hawaii" example similar to the one I used at SES NY.

I see this patent as a refrit of standard co-occurrence theory I have been using for years.

Orion

Last edited by orion : 04-21-2005 at 05:57 PM.
orion is offline  
Old 04-22-2005   #172
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by orion

I see this patent as a refrit of standard co-occurrence theory I have been using for years.

Orion
Agreed Orion!
xan is offline  
Old 06-20-2005   #173
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Thanks

I want to thank you all for your participation at this thread,
which now spans too many posts.

I was planning in closing it several months ago, but due to lack
of time I was not able to do this. Once closed, I hope it can be
used as a crude reference on c-indices. It is now time to move
to advanced issues on co-occurrence.

I'm upgrading and expanding the series on C-indices at my site.
What motivates me to do this upgrade is the fact that I'm finding
some well-intentioned marketers publishing about c-indices,
co-occurrence and even on-topic analysis without a clear
understanding of the underlying theory.

So far I have updated article 1, only. I plan to upgrade subsequent
articles. Each update will include new information and advances
others have made in the area of co-occurrence theory. I hope
you like the effort and use these concepts in your marketing mix.

Again, thank you all for participating with so much value-added
feedback and recommendations. Hope to see you around,
at a conference or event.


Keep the hard work. Cheers,


Orion
orion is offline  
Closed Thread


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off