|
#1
|
|||
|
|||
|
Latent Semantic Indexing
This topic has come up recently, so I thought I'd start a list of further research reading that already deal with the specific issues of Latent Semantic Indexing - LSI:
Latent Semantic Indexing at Tennessee Uni Latent semantic indexing from NITLE Constructing and Examining Personalized Cooccurrence-based Thesauri on Web Pages Telcordia Latent Semantic Indexing From Latent Semantics to Spatial Hypertext Using Latent Semantic Indexing for Information Filtering Thanks especially to Marcia, who provided a lot of these references. (No, I don't mean she wrote the papers.) ![]() |
|
#2
|
||||
|
||||
|
These are well known research papers.
Could you provide sample step by step calculations of a current commercial search engine using this for indexing/ranking web pages? Orion |
|
#3
|
|||
|
|||
|
Quote:
![]()
__________________
The SEO Book |
|
#4
|
||||
|
||||
|
Thanks, Aaron.
I rather prefer those with evidence to prove their case, so they can shine. This gives me a flashback of term vector theory. Many seos/sems merely talked about TVT in some well known forums but few were able to numerically show the how-to calculations or how the computations could be used by seos/sems as an optimization strategy. Even some pagerankers have talked about the "complexities" of the theory. That's why I debunked the apparent complexities of TVT schemes since TVT as an IR technique can be used as an optimization strategy as well. Now one can even come across other forums in which in an effort of making a buck or two some are selling "training programs" on these and similar advanced topics I have discussed, not having all the facts or required background. It's a shame how easy is to deceive the industry and followers. No thanks. This time I prefer that others demistify LSA/LSI in connection with search engines ranking/indexing. No more free rides! Orion Last edited by orion : 02-03-2005 at 10:42 PM. |
|
#5
|
|||
|
|||
|
Quote:
In the same way as these types of books help people your comments help a ton of people (or at least me), but I am not certain that most people need to know the math to effectively understand what is going on. Back when I was in school I was exceptional at math, but now my math skills are less than stellar. If I studied a bit I am sure I could do a bunch of the math, but because I have not done it yet it does not necissarily make my knowlege inadequate. Quote:
I also do not think the average webmaster has to fully understand many of the more advanced topics of SEO to be able to do a decent job of promoting their sites. The truth is that Yahoo! and MSN are still super easy to rank in. Google takes a bit more spend or effort, but I think it still is possible to rank decent with a limited knowledge of many of the deeper concepts. If it is such a shame that so many people are mislead then maybe there should be a well known database of information and examples. I do think that forums like these help lots of people out (including me).
__________________
The SEO Book Last edited by seobook : 02-03-2005 at 11:20 PM. Reason: I am the most worstestest at grammer ;) |
|
#6
|
||||
|
||||
|
Well said, Aaron and for that I applaud you.
Cheers. Orion |
|
#7
|
|||
|
|||
|
Quote:
If LSI is important now, then it's important for SEO's and Webmasters to perhaps research the topic a little and draw their own conclusions from their own observations of movements of SERPs. Is that something you disagree with? |
|
#8
|
|||
|
|||
|
Quote:
(see http://www.cre8asiteforums.com/viewtopic.php?t=593 from Dec 2002) Orion sensibly questions whether search engines are going to bother with doing all this work to build an LSI index. However, Google already bought such an index, ready-built and functional, when aquiring Applied Semantics. The immediate value was for AdSense of course. But they aquired an expert staff in the deal. Having paid for it, do you think they'd fail to use it, at least at some level? Better yet, Google don't even need to actually do much more processing than they already do. You already know that their supplemental results are pages that haven't been spidered yet, but are deemed relevant just because of the anchor text in links to the URL. Well, now just look at backlinks to a page and you find a whole raft of semantically equivalent terms, but in small, link-text sized chunks that are easy to process. In other words, apply LSI purely to the anchor-text analysis algos, rather than the full-text page index. As a side benefit to that method, you can detect where backlinks don't have a typical level of variety of terms in anchor text, and suspect that the links are artificial or auto-submitted. See also:Using Semantic Analysis to Classify Search Engine Spam (pdf) Incidentally, the opposite approach to Latent Semantic Indexing is to look at human-assigned, or 'active' semantics alternatives which include directories, hubs, and other human-assigned relationship methods. Take a look at this group-effort project and see specifically who's involved and what is cited. Last edited by Black_Knight : 02-04-2005 at 08:32 AM. |
|
#9
|
||||
|
||||
|
This thread has additional information on whether or not the implementations are feasible for a commercial search engine. http://forums.searchenginewatch.com/showthread.php?threadid=4009
Some folks in AI research have joined the discussion. On a personal note, In the AI/IR and at least at univ research level, there is actually two school of thoughts: those that favor LSA and those that do not. I have colleages that even don't buy LSA at all and have communications with pioneers in the field that strongly defend it. As is now, a combo of techniques is the way to go. Personally, I see LSA as another value-added tool, but not the only tool. On-topic analysis as fractal semantics (two fields I love the most) are just another value added tools as well. There are some things one cannot do with ones but with the others. Orion Last edited by orion : 02-04-2005 at 06:10 PM. Reason: typos |
|
#10
|
||||
|
||||
|
Quote:
That's not how the Scientific Method works. My recommendation is, Use the Scientific Method as briefly described in post #29 of this SEWF thread http://forums.searchenginewatch.com/...4&page=1&pp=40 Anyone can "discover" old reference literature and quote it. And? Still it proves nothing. Consider the case of the javelina old papers. Just all talk and few old graphics. When it was time to present numerical calculations, poof! no computations at all from the "authors". I still want to see calculations or how a commercial search engine successfully implements LSI for 8 Billion of documents. Please let not be blind followers. Test, test, and test. Use the Sci Meth until your opinions becomes hypothesis and your hypothesis turns into a theory. Orion Last edited by orion : 02-04-2005 at 08:42 PM. |
|
#11
|
||||
|
||||
|
Hey Brian - much thanks for the input! I always do appreciate your comments here in the forums, and there are lots more resources bookmarked than those listed; but the critical, pressing issue for us all is how to implement the wisdom we glean from reading these papers, on a practical, utilitarian level.
How do we adapt sites to the possibility of ranking by semantically accurate, focused and relevant requirements for ranking criteria? The theoretical is one thing, and the theoretical, academic and scientific side is another; the practical application on the level of implemtation of principles from a primarily commercial standpoint, however, is quite another matter, and as you undoubtedly are aware of, is worthy of serious consideration. Quote:
Being a fan of Susan Dumais, which goes without sayiing: http://www.marciahoo.com/archives/20...mais-lsi-find/ IMHO we need to look into this whole topical area from an objective, pragmatic viewpoint as well as from an intuitive, subjective level of analysis, as it relates to the practical application of the principles on an empirical level. What it may come down to in the final analysis is whether SEO is an objective science or a subjective, intuitively discerned art, per NFFC, who states: SEO is an ART. I concur. ![]() How about you? Which is it? SEO: Science or Art? Science or Art? Which is it? In the final countdown, which will prevail? Last edited by Marcia : 02-05-2005 at 04:21 AM. |
|
#12
|
|||
|
|||
|
Quote:
Certainly SEOs need to be cautious about rushing into judgements - but the experienced ones usually are. I acutely noticed after Florida how a lot of the experienced SEO's kept their mouths firmly shut. It was left to young upstarts like myself to evangelise Hilltop - a very old paper to be sure, but with concepts that appeared to match some post-Florida and Austin observations. Hilltop didn't provide solutions - it provided a template set of hypotheses SEO's could evaluate their observations against, as a way of devising useful strategies. Links from separate C class IP ranges is now standard SEO link-building practice, and the concept of site “authority” is now commonly discussed. Commercial SEO's try to be aware of what actual research developments are in place that search engines can potentially utilise. Forewarned is forearmed. That's why Mike Grehan gives such a large amount of room to search R&D in his "Search Engine Marketing: The Essential Best Practice Guide", and that's why we're grateful to him for that. However, SEO's are not usually privvy to the contemporary scientific applications of information retrieval theory, as used by commercial search engines. We can try and form ideas of what may be happening, based on our own observations, but we remain speculative at best. For quantative certainly you would have to be working in the engineering dept of the search engines themselves. Search engines like Google appear to employ a number of methodologies, and rigorous investigation of how any single one may be applied are going to be hampered by the larger number of uncontrolled variables, coupled with the sheer dynamicism of the ever-changing search engine world. Start a six-month study now, and in six-months time the conclusions would be in danger of being out of date and irrelevant. After all, Quantum ElectroDynamics is one of the most accurate scientific applications of theory to the real world observations. Can you imagine how far QM studies would get if the Planck Constant was modified and tweaked every few weeks? Or if new constants were periodically added? How about if wave equations updated every few months? In SEO we have our ever changing observations, and we try and match that to an ever increasing knowledgebase of what may and may not be applicable to any such observations. It *is* scientific method – but SEO can never be anything but a very reactionary practice to the application of the actual search engine theory, from the labs of search engines. Is LSI actually being applied right now? Only Google can say. Is it important for SEOs to be aware of such research? Absolutely. Last edited by I, Brian : 02-05-2005 at 04:48 AM. Reason: Latent Semantic Corrections. :) |
|
#13
|
||||
|
||||
|
Quote:
Quote:
If it wasn't for reasons of furthering their capability of semantic analysis/indexing, then what the heck else was it for? Google may be on the cutting edge of *some* aspects of search, but do they have it all down? Who, ultimately, is to judge? You, us or "them" for Gods' sakes? Who is it that ultimately judges the quality of search results in the long term? |
|
#14
|
||||
|
||||
|
Quote:
|
|
#15
|
|||
|
|||
|
I agree with the use of the scientific method. Taguchi methods lend themself to this type of analysis. But, it take a lot of time and data.
While I don't discount the the posted speculation, and it is a great read, I wonder if something simpler is going on here. I have not had any chnages on my sites so I can only go by what others have mentioned. I read on another forum that the poster saw drops in ranking only on sites that he knew he had site-wide inbound text links. Now that is an easy algorithm for Google and makes some sense. Anyone else seeing the same? |
|
#16
|
||||
|
||||
|
Thanks, Nacho. The Scientific Method starts with the gathering of observations before even try to formulate a hypothesis or a theory.
I have seen many SEOs doing this all the way around. As for how easy is to deceive other SEOs/SEMs, consider the case of many in the field of keyword research or even link building. Anyone can invent out of thin air a keyword research or link model based on mere opinions and flawed "research" and lead others into deception. Then the very same blind people end questioning their "success". As for acquisitions, let keep in mind that business decisions not always go hand-to-hand with research. There is more under the hood. As for LSI, a dimensionality reduction technique, I still want to see someone coming up with step-by-step computations before I provide tutorial samples or how-to on the subject. I also want to see step-by-step calculations in which these have been successfully implemented in a commercial search engine with a large index as Google. True that there is research pointing at some of the success of LSI. A good example is NIST and the TREC9 tests. However when one drills down at the research one can find that 1. large queries are used (how 20, 30, 50 terms in a query sounds to you?). 2. relatively small database collections are used (few millions) 3. non ambiguous terms are used (terms with precise meanings) 4. no vested interests, commercial noise, business strategies , spam or SEO trickery is present. Compare that with commercial SE collections in which 1. one often use small queries often consisting of 2 or 3 terms. 2. large databases are used (Google, around 8 billion plus) 3. ambiguous terms can be used (very often than you think) 4. vested interests, noise, business strategies, spam or SEO trickery is often present. I can give anyone many valid reasons or discuss many limitations of LSI in commercial systems. Here is just one. In the Local Context Analysis paper and discussed in the LCA thread, Bruce Croft (Distinguished Professor and Chair, Department of Computer Science and Director, Center for Intelligent Information RetrievalUniversity of Massachusetts, Amherst) writes 2.2 Dimensionality Reduction "….Despite the potential claimed by its advocates, retrieval results using LSI so far have not shown to be conclusively better than those of standard vector space retrieval systems. As with term clustering, word ambiguity is also a problem with dimensionality reduction techniques. If a query term is ambiguous, terms related to different meanings of the term will have similar reduced representations. This is equivalent to adding unrelated terms to the query." So, using LSI as a term discovery technique or even for assessing semantics and relevancy is problematic with ambiguous terms. LSI, being a concept matching (as oppose to query string matching) fails. I can provide you all more reasons or cases where LSI cannot be successfully implemented but this one should be enough to point the above posts in the right perspective. For those interested in the following subject: query-sensitive similarity measures and fractal semantics seems more promissory. Orion Last edited by orion : 02-07-2005 at 05:04 PM. |
|
#17
|
|||
|
|||
|
I think the argument about whether LSI is or is not applied is missing a wider picture.
Whether or not any particular paper has been implemented is not the key issue - it's whether we can read up on such papers and learn something of how Google may or may not be applying specific concepts to search filtering/ranking - in the now, and in the future. Go back to a year ago and we could have had ourselves a great 200+ thread arguing whether Hilltop, for example, had been applied at Florida. Yet as I tried to indicate previously with the Hilltop example, the point was never about whether such an old paper had been applied exactly, as much as whether there were concepts involved in the paper that appeared to match post-Florida observations - concepts that could be useful to know, and react to. The same with LSI - the papers should be read by serious commercial SEO's, as with general search engine theory, because it's important to be familiar with the background of the field. SEO's are affected by IR theory, whether they like it or not. That was my single point here. |
|
#18
|
||||
|
||||
|
Quote:
Orion |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|