Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 10-21-2004   #1
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Temporal Link Analysis

Temporal Link Analysis

The general analogy in the sense that link citation is analogous to literature citation (e.g., Garfield's Impact Factors) makes no sense. Literature citation is driven by peer reviews and editorial policies. On the Commercial Web, where anyone can buy or sell links, add or delete links at will, link citation is mostly driven by commercial and vested interests and strategic alliances of all kind.

It can be argued that link-based models or marketing strategies based on the above analogy are questionable. It can be demonstrated that the dynamical nature and value of literature citation and link citation are completely different. Back in 2002, an Italy group presented a generalized Web Page Scoring System framework (WPSS) in which the dynamical nature of links and the Web was taken into consideration. Back then I wrote a non-technical review on this paper and other paper on the futility of link tools and link-based metrics and few seos were quick to react without knowing all the facts.

The WPSS paper points out several theoretical flaws embedded from the start in link-based models, mostly because of the temporal nature of the Web and the fact that web traffic consists of two components: random (by chance) and deterministic (not by chance). A link model in which a user is modeled as a pure random walker (or pure deterministic walker) does not go with the reality and experience of average web surfers. The same can be said about models in which only one atomic action from the user is taken into consideration (e.g; "users don't click back"). To sum up, while under controlled IR lab conditions links may be a measure of citation importance (votes), most likely on the commercial Web this is not the case.

Despite the fact that the Web is a dynamical system, few works have been published with regard to the temporal behavior of links. In 2003, during a presentation at Haifa, IBM researcher Einat Amitay discussed Temporal Link Analysis. Her presentation was enlightening:

"In fact, a journal will be considered more prominent the higher its citation half-life is (i.e., how old in years are most of the papers currently cited in the literature that were previously published in this journal). Combined with another measure called impact-factor (the frequency with which the average article in a given journal has been cited in a particular year), libraries determine the value of a certain journal to their collection. Since the value of journals can change over time, this evaluation is carried out in many libraries on an annual or bi-annual basis. Furthermore, authors learn about the importance of their acceptance to a journal or the citation of their work in a certain journal based on such evaluations.

In contrast, when plotting similar measures for citations on the Web, the reverse behaviour is exhibited: the more time passes the more citations a page receives. Furthermore, unlike the publications studied in co-citation analysis, pages on the Web are modified and updated with respect to real world events. There have been numerous attempts to make use of time to predict trends on the Web. However all of those studies emphasised the detection of the change itself and not the temporal nature of the data studied. None of these studies looked into how to incorporate time into the processes that are currently used for ranking web pages, computing link-based measures of site popularity, and link analysis in general. In fact, to the best of our knowledge, the Web Information Retrieval community has never proposed such a temporal approach.

In this talk I will discuss several aspects and uses of temporal data in the context of Web IR. The main contribution of this work is first and foremost in raising the issue of utilizing the time dimension in the context of link analysis. I will demonstrate the benefits of this approach by showing how we incorporated this additional dimension into two applications. The first application measures the activity within a topical community as a function of time. The second application is an adaptation of link-based ranking schemes that captures timely authorities, the authorities that are on the rise today and should be ranked over the resources of days past."

End of the quote.

Let's discuss Temporal Link Analysis in the context of business intelligence and search engine marketing strategies.

Orion

References
Temporal Link Analysis (research paper)
http://techunix.technion.ac.il/~uriw...k_analysis.pdf

Temporal Link Analysis of Linked Entities (USPTO patent)
http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2 FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PG01&s1=200401282 73.PGNR.&OS=DN/20040128273&RS=DN/20040128273

A paper somewhat related (from the on-topic standpoint)
Knowledge Encapsulation for Focused Searches from Pervasive Devices
http://www10.org/cdrom/papers/436/

Last edited by orion : 10-24-2004 at 11:18 PM. Reason: 1. to change a line 2. more typos
orion is offline   Reply With Quote
Old 10-21-2004   #2
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Orion,

I think you have changed the level of what an SEM forum should expect.

Outstanding, thought provoking and enlightening posts.

Now, it would be very interesting to see test cases and how much they differ to the current algorithms. In fact, I can see this working well on the Web - as it states. But even better in old ancient works, including the old testament, roman philosophy, etc. Not that I am big in those areas.
rustybrick is offline   Reply With Quote
Old 10-21-2004   #3
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by rustybrick
Orion,

I think you have changed the level of what an SEM forum should expect.

Outstanding, thought provoking and enlightening posts.

Now, it would be very interesting to see test cases and how much they differ to the current algorithms. In fact, I can see this working well on the Web - as it states. But even better in old ancient works, including the old testament, roman philosophy, etc. Not that I am big in those areas.
Thanks, Rusty.

Cases:

1. See References.
2. Time series for links: Construct a time series control chart at two sigma levels. Plot y vs x where y=link popularity and x=time (days, weeks, months, years, etc)
3. Time series for co-occurrence: As above but using y=c-index calculated for titles, allintitles, anchor links, etc.

This should reveal trends.

Orion
orion is offline   Reply With Quote
Old 10-21-2004   #4
Mike Grehan
Member
 
Join Date: Jun 2004
Posts: 116
Mike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to all
Orion,

Thanks for the heads-up on this thread.

I'm a huge fan of Einat Amitay. She went to university in Edinburgh which is the next major city north of where I live. And she was hugely helpful with the research work for my second edition.

In fact, both she and Ellen Spertus (once winner of the sexiest geek alive award, believe or not!) have carried out remarkable work.

Temporal tracking is something I'm covering more in-depth with the next edition. I stumbled upon the problem in web link-bibliography when I spoke to Craig Silverstein at Google some time ago. We were discussing the subject of the most popular movies. At that time, Titanic was the most popular movie ever... But you should get Lord of the Rings for that search now, regardless of the ancient linkage data.

I do believe that, with the use of vector support machines (learning machines) some of the problems which I highlighted in a recent article I wrote about evolving networks, as they relate to the web, will become less of an issue.

But as I'm currently finishing a report for a client right now... I'll have to come back to this.

As ever, superb topic for discussion by the way. Wish I could hang around longer!

Cheers!

Mike.
Mike Grehan is offline   Reply With Quote
Old 10-21-2004   #5
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Thanks, Mike for stopping by. I know you are a very busy person.

Quote:
Originally Posted by Mike Grehan
...and Ellen Spertus (once winner of the sexiest geek alive award, believe or not!) ...
1. So I guess Kim Komando no longer holds the title of the geek babe. That's sad.

2. The example you gives about movies is similar to the one given in Amitay's paper; i.e.,

"For example, a concept like “Monica Lewinsky” yielded completely different results during and after Bill Clinton’s presidency. During Clinton’s presidency, most of the top 100 results from the major search engines were related to the news item itself and the opinions and buzz it created. After President Clinton left office, most of the top 100 documents returned were (and still are) about the jokes, humour, and folklore the event created within and outside the USA."

Point 2 suggests me to look at link bombs before and after certain events or periods (e.g. changes in results for "miserable failure" or "kerry waffles" before and after election day; "who's your daddy" before or after the New York-Boston 2004 seventh game, etc).

Now a bit more serious:

The DIP

According to the paper, a DIP for a concept C is computed by submitting a query describing C to a search engine, which returns a set of n pages, P. Thus P(n, C). The "last modified" value of the top n pages is checked to determine the pages that link to each of those pages. These pages and their last modification date conform a set of dated inlinks profiles (DIP) for the concept C. This computation is carried out over time and a DIP-time curve is obtained. The results are normalized (from 0 to 1) to properly detect large and small changes.

Thus and according to the paper, the DIP considers

1. The dates when every page was created and last modified. Every time a page is crawled, the crawler checks its HTTP “last modified” header field (see below). If this information is not available but the engine’s repository detects that the page has changed since the last time it was crawled (for example, by the methods presented in the page’s date of last modification is set to the date of the crawl. In particular, this procedure updates the page’s date of creation when the page is crawled for the first time.

2) The date when a page was detected as deleted. This date is set, for example, when receiving 404 codes for previously seen pages, or when a page cannot be accessed for long periods of time.

3) Dates of creation and deletion of links. In the ideal implementation, the search engine should track the additions and removals of hyperlinks in each page, and tag creation and deletion dates to the links in a similar manner to that described above for the pages.

Questions

1. What is your take on the following: " where search engines trace and store temporal data for each of the pages in their repository"? Do you think is possible to detect and expose significant events and trends?

2. How do you think a DIP curve could be gamed?

3. What do you think about DIPs as a homeland security tracking tool? Check Figure 4 where -I should say- "dipping" was applied to the query

Ussama/Usama/Ossama/Osama Bin Laden

Similar curves can be obtained with c-index values in which terms co-occurrences are measured over time. A huge spike in on-topic co-occurrences could suggest an anomalous activity. In general any time series function that gives significant spikes is trying to tell the observer something.


Orion

Last edited by orion : 10-21-2004 at 11:46 PM.
orion is offline   Reply With Quote
Old 10-22-2004   #6
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Orion,

Once again you take us to the next level in link analysis and help us keep learning new frontiers. Thank you!

Quote:
1. What is your take on the following: " where search engines trace and store temporal data for each of the pages in their repository"? It is possible to detect and expose significant events and trends?
Yes, only by replication of the process and matched. If fact, once mastered, it could be used as an ideal methodology for forecasting events and trends (such as the next recession in a country's economy for example).

Quote:
2. How do you think a DIP curve could be gamed?
Yes, I do think it can be gamed, however it would require the webmaster/marketer to invest very long periods of time (years) to a point that it's no longer a "quick buck" game, but rather a new competitive player to the particular industry in question.

Quote:
3. What do you think about DIPs as a homeland security tracking tool?
Fascinating. It could determine and forecast future alerts on time.

Now time for bed for me. I look forward to hear from you tomorrow.

Saludos!
Nacho is offline   Reply With Quote
Old 10-22-2004   #7
projectphp
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 449
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
Quote:
On the Commercial Web, where anyone can buy or sell links, add or delete links at will, link citation is mostly driven by commercial and vested interests and strategic alliances of all kind.
You ahd me until the word most. I doubt the truth of such a qualifier, given the massive number of pages and sites.

The problem with this sort of link analysis is that it costs in overhead, both enlarging DB sizes and in processing time per page.

The question therefore becomes, in my head at least, is the extra time required to check a page's last modified date (when headers are not offerred) a better way to spend that time, or would JavaScript parsing be more effective to most searches? Ditto a better crawler that can parse alternate filetypes, or fill in forms and utilise cookies.

Quote:
Let's discuss Temporal Link Analysis in the context of business intelligence and search engine marketing strategies
In terms of SEM stratagies, if this is a future direction, then the best links are long term, stable links on big sites. Archives become vital, and places that do this for articles are huge. A link on a syndicated article that is forgotten about for years grows increasingly more valuable by the day.

Similiarly, good old fashioned PR (Public Relations) takes on great imprtance, as a long lasting comment in teh press, archived forever adn a day gets increasingly better.

It also makes the highly charged issue of the sandbox a killer. New sites will struggle big time, if this is applied uniformly.

IMHO, this sort of anaylsis will work brilliantly for a subset of searches, and be the worst thing ever for another. The trick will be for SE to create semantic filters to work out exactly how much of such factors apply at query time. Different queries for different searches, using different combinations of of factors weighted differently.
projectphp is offline   Reply With Quote
Old 10-22-2004   #8
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
The other problem I see with temporal link analysis and DIP in comparison with literature citation is that the time the page was first created and the time it was first crawled could be anywhere from a day to maybe years a part. Unless ALL programmers would need to insert a "creation date" to every page as a standard. However, not all programmers follow every coding standard. Therefore, if one didn't add this "creation date", what would search engines do? Would the search engine's crawler have to consider it's "creation date" the same as its "first index date"? If so, what happens to all those edits and links taken in and out on those pages, would the be a way to pass to the search engines a log of the entire build history? And YES I agree with projectphp, dbase sizes will SUPER size, so what are the potential problems to that?
Nacho is offline   Reply With Quote
Old 10-22-2004   #9
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, Nacho and projectphp. Thank you for stopping by.

I hope this help.

Nacho

1. I'm fascinated and at the same time afraid about a search engine technology that can store all pages with a tracking history back in time. Imagine all that goldmine of data in the hands of marketers, spammers, or the gov.!? True that it could be use for monitoring trends. DIP curves are used mostly for looking back in time for trends and spikes of information. But you are right, Nacho. The technology can be used to predict trends. They provide several examples (e.g., Harry Potter movie release)

2. DIP curves are based on "last modified" stamps. If no stamp is available the stamp of the crawling is used. These are the points gamers can target. Fake the stamps or deceive the crawler.

3. "Dipping Osama". See Figure 4 of the Temporal paper. A lot of research is being conducted in co-words and topic analysis for identifying trends and usage of word patterns. These are used for monitoring sites of high activities (chat rooms, discussion forums, etc).

4. Date/time stamp concerns. See below.

Projectphp

1. When I use the expression "Commercial Web" I mean documents/sites driven by commercial interests, not every document or site on the WWW. Excluded are online documents that also are normally found through scientific, academic, and gov IR systems. The problem with most link-based models is that work fine under free from commercial noise documents but fail miserably under the presence of this noise. So, I stand by "mostly" in that context.

2. Overhead. Check Section 3 of the Temporal paper and post #1. To build a DIP curve you only need to collect the top N ranked pages, capture the "last modified" while crawling the pages or use the default crawling date/time stamp, and then conduct the analysis. This is already done by some crawlers and is even less time consuming than crawling all meta tags, links, and document content. You only need to add a new instruction to the current crawler. Analysis and DIPs curves can then be constructed offline.

3. I do agree with the rest of your post.


Challenging Questions

1. How do you think this could impact filesharing applications?
2. How could this be use to track topic-focused online communities?


Orion

Last edited by orion : 10-23-2004 at 11:07 AM. Reason: typos
orion is offline   Reply With Quote
Old 10-22-2004   #10
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
3. How would this impact allowable content duplication (eg. press releases)?
Nacho is offline   Reply With Quote
Old 10-22-2004   #11
carpediem
Member
 
Join Date: Oct 2004
Posts: 5
carpediem is on a distinguished road
Our site has been doing great over the past 3 1/2 years and has continued to get more links each month. We started a new site about 2 years ago, and about 12 months ago that site started doing very well - even though the original site had more links (and of higher quality). Then about 4-5 months ago we saw a competitor start doing very well (their site is about 14 months old). I have been trying to analyze why this (relatively) new site has done so well. I have had a lot of difficulty in defining reasons as to why they have started to rank better than us. We clearly have many more links (and better quality), and we seem to have similar on-the-page optimization.

One idea that popped in my head when analyzing this was the fact that they had a much higher rate of obtaining new links. I quickly abandoned this idea since I had no idea how google would implement this into the algo and why it would help produce relevant results. But this thread brought up the idea again and explains why the rate of links could possibly lead to relevant results.

I am not proposing that this is already in affect and it really doesn't matter if our situation is applicable or not (I highly doubt this is the reason anything going on right now). Regardless, I wanted to see if people thought that the rate of gaining/losing links could have any affect on things?

The above posts seem to suggest it could be applied at some point. Mike has mentioned the fact that the longer a site is around, the more links it should obtain naturally. So maybe google expects an increase of x links per month for sites that have obtained a certain status.
carpediem is offline   Reply With Quote
Old 10-22-2004   #12
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by carpediem
....the longer a site is around, the more links it should obtain naturally.
The rich gets richer concept in Material Sciences can be traced back to the 70's when a diffusion-limited aggregation (DLA ) model was proposed by Witten and Sander (U of Michigan) and for the growth and evolution of naturally occurring phenomena.

The DLA model incorporates a random walker sticking to a growing, tree-like fractal pattern. The growth rate does play a role.

Immediately after the publication of the DLA model -in the 80's and 90's- physicists found that crystal growth and cluster aggregations tend to mimic the effect in which rich growing branches get richer and poor branches die away. Many random processes are governed by this natural "principle". Back then I participated in several conferences on the subject.

The rich get richer phenomenon is based on probability averages taken from randomly selected samples. Only because a site is old does not mean that it will get rich. As based on probabilistic averages, there will be cases in which many old sites will not get richer.

Orion

Last edited by orion : 10-22-2004 at 10:11 PM.
orion is offline   Reply With Quote
Old 10-23-2004   #13
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by Nacho
3. How would this impact allowable content duplication (eg. press releases)?
Good question, Nacho. Let A, B, and C be different domains, each one with a document D of same content. D has a stamp ("last modified").

Link A >> http://www.domainA.com/D.html
Link B >> http://www.domainB.com/D.html
Link C >> http://www.domainC.com/D.html

This is different as to say that A, B, C point to D; i.e.

http://www.domainD.com/D.html

Since they construct DIP curves by mapping links to date/time stamps, I'm inclined to think that individual URIs count for each site when someone points to them.

Orion

PS Sorry for too much editing. Lack of sleep

Last edited by orion : 10-23-2004 at 12:07 PM. Reason: Typos
orion is offline   Reply With Quote
Old 10-23-2004   #14
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Orion,

I read the research paper just a few hours ago. I was hoping that you can give me your understanding of section 4.3 "Tracing Concepts Over Time". I seem to get lost each of the two times I read that section.

As for the rest of the paper, I think it would really work wonders. Based on the results of the preliminary tests conducted in the study, the results of the "Timely Authorities", to me, seemed much more relevant.

Very interesting on how you can watch a pattern on a specific topic of interest fluctuate over time.

Also, wouldn't you feel that storing all the linkage data over time would be very costly? I mean, storing the date of all inlinks found, the past inlink dates, the topics and communities they belong to, etc. Of course they mention many of the challenges with using the header to determine the last updated date, but even so....
rustybrick is offline   Reply With Quote
Old 10-28-2004   #15
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

The way I understand TLA is as follows.

Let P be the top n pages retrieved by submitting a query associated to a concept. Let Q be the q number of pages linking to the top n pages. Thus, we have the two sets

P = {p1,p2,p3....pn}
Q = {u1,u2,u3....uq}

Each u is a timestamp url linking to pages in P. One constructs a DIP curve consisting of the points (u, t) with u as normalized values and associated to a time range and time interval. Consider a range from 1995 to 1998 and intervals of one year. Let assume that for the query associated to the "hotel" concept we have

Range: From Jan 1995 to Dec 1998 in increments of 1 year

interval Jan 1995-Dec 1995 10 timestamped links point to pages in P
interval Jan 1996-Dec 1996 10 timestamped links point to pages in P
interval Jan 1997-Dec 1997 20 timestamped links point to pages in P
interval Jan 1998-Dec 1998 60 timestamped links point to pages in P
Total = 100 links

Thus, 10/100, 10/100, 20/100, 60/100
and SUM = 0.1 + 0.1 + 0.2 + 0.6 = 1.0

The DIP curve is given by the normalized u,t points

(0.1, 1995), (0.1, 1996), (0.2, 1997), (0.6, 1998)

I don't see an overhead issue here since (a) the data is already public or can be collected as specified by a users and (b) computation is done off-line.

The disjoint technique looks to me as a method for resolving important portions of the response curve. We could do the same using time decomposition/delay techniques from non linear dynamics. Adding time as a variable opens the door to the injection of non linear dynamical tools into link models.

Few days ago, Dr. Einat Amitay, a co-author of the TLA paper sent me a new work they are working on. This new work is about to be published and shows new temporal trends and developments that changed the concepts first described in their previous paper.

I sent back some questions on this new work since honestly there are some aspects I don't understand or are not clear to me --and I don't want to speculate. I'm waiting for feedback. Since this new work is not yet published, it would not be ethical for me to make any comment without their permission or before the publication date (however, we can discuss things already published).

This is an outstanding work!!! and I can see many applications of TLA in Web analytics, intelligence, and marketing.

Orion

Last edited by orion : 10-28-2004 at 09:16 PM. Reason: 1. typos 2. refining, adding punctuation
orion is offline   Reply With Quote
Old 11-01-2004   #16
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Orion, thank you for clarifying Timely Authorities.

And I am excited to see Dr. Einat Amitay new work. I was thinking of visiting her this weekend, while in Israel. But she wouldn't want to talk to me.
rustybrick is offline   Reply With Quote
Old 11-09-2004   #17
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Conspiracy Theories and TLA

This is a recap on Temporal Link Analysis. In the TLA model

1. First, a query about a concept C is submitted to a search engine.

2. Next, the set P consisting of the top n ranked pages is collected. So,

P = {p1, p2, p3,.... pn}

3. Now the set Q consisting of the number of urls across the Web that point to pages in P is collected. So,

Q = {u1, u2, u3, .... uq}

4. Each url is a timestamped link. This timestamped data is readily available from the http headers, the document itself or from date of the crawls.

5. For a given time range divided in specific time intervals, one constructs a curve consisting of number of timestamped urls vs time. The y-axis (# timestamped urls) is normalized to run from 0 to 1.

The resultant curve describes the time evolution of the number of links associated to the queried concept C. This curve does not tell us how relevant the individual urls are. This is important since one can see several sites and forums spreading some speculatives and conspiracy theories (sandbox, BLOOD, TLD vs. TLA, etc), about temporal link analysis and about the age of links.

To illustrate, the SOCEngine (SeoSurvey) site in The Sandbox, the March Filter & BLOOD vs. TLD article misquotes this thread by stating

"This argument is the exact opposite of the theory described by Dr. Garcia (Orion) at SearchEngineWatch in a thread titled - Temporal Link Analysis - which claims that the most relevant links are those that are new, fresh or on pages that are frequently updated."

This quote is unfortunate. Not only such claims have never been made at this thread but are incorrect and cannot be found in the original TLA paper. For the record, the only public statements on temporal link analysis I have made outside this thread are found in a short note I wrote to Rusty (Barry) at the SeoRoundTable site. I’m reproducing that post below.

--------------

"Temporal Link Analysis

I want to expand on the statement

"The basic premise is that the more often AND the more recent those citations are, the more important the journal is."

True that the more time passes, the less hard copy citations a paper receives. Unlike hard copy citation, the more time passes the more link citations a Web page receives.

This is one of the reasons that make the literature-link citation analogy a fallacy. Of course, there are other reasons that deal directly with the commercial "intention" and "perception" of link citation and reduce the above analogy to a caricature of the reality.

Since the inception of PageRank in the Web scene, several seo "experts", sem "discussion" forums, and marketing firms with vested interests used the analogy merely as a point of sales for their products and services. Even some well-known researchers fueled this fallacy. Back in 2002 I exposed this but the usual suspects were quick to react."

--------------
End of the quote.

The line that reads, "Unlike hard copy citation, the more time passes the more link citations a Web page receives." says it all.

The emphasis is on the page the timestamped urls link to, not on the age (date) of the timestamped links themselves. As a first and crude approximation, we can think of TLA as a temporal link popularity-like model. However, is more than this.

The idea is to track the time evolution of the number of links pointing to the top ranked pages relevant to a concept. As given by IBM’s current TLA model, this is not concerned with the relevancy of the timed urls respect to the concept that has been queried.

Note that the semantic content of the timestamped urls is not taken into consideration. The timestamped urls may not necessarily discuss or have the queried concept C as their main topic. I believe this is an area in which the current TLA could be improved. Still the actual model provides important information.

DIP curves could be used to compare between changes in the activity levels in communities discussing related topics. Effectively, we can track in time or monitor the activities of such communities and conduct interesting intel or even seasonal studies.

There is another type of analysis in the TLA paper, which consists in examining the number of timestamped urls a given domain or web document receives. This provides a DIP curve for that particular domain or page.

Now if we incorporate the weight of timestamped urls into a ranking algorithm, we can now go from authorities to timely authorities.

At this SeoChat thread it is claimed that the age of a page in Google affects how the page ranks. At the time of writing, there is no scientific or research evidence of such claims or of claims about something called “temporal link devaluation” (TLD). Regardless of the validity of such claims, these concepts are not what timely authorities are about.

There is an upcoming research paper on Temporal Link Analysis from the IBM Research Group. This paper modifies, greatly improves, and provides excellent examples of TLA in the real world. The paper expands on timely authorities, how time could in theory affect a ranking algorithm and how temporal-based weights are assigned.

Again in this new work, the emphasis is on the page receiving the timestamped urls, not on the relevancy of these urls with respect to the queried concept or on a suposse relationship between the age of these urls and how they rank. I received copy of this new work several weeks ago. Once officially published we can discuss it. In the meantime, what is left is what is already public domain.


Orion

Last edited by orion : 11-09-2004 at 10:34 PM.
orion is offline   Reply With Quote
Old 11-10-2004   #18
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
As the author of that article, I apologize. I made a glaringly generalization rather than specifically noting what you said. Obviously, I have misinterpreted the specifics, if not the general meaning of the ideas behind TLA.

Orion, certainly you grasp that there are two sides of the field from which people who post and read these boards come - the SEO business side, and the search engine engineers, students & academics. I certainly admit to being from the first, less educated group - however, when reading over the discussion here, I cannot help but be struck again by the purpose of TLA, which I interpret to mean:

The analysis of the time-relevance of a particular web page to a particular query, based on the links it receives.

Many factors are obviously being taken into account in measuring the links - their authority, source, "timestamp", etc. However, the purpose from a search engine's point of view appears to be as I described - to increase the relevancy of the document returned based on its timeliness.

In writing the sentence you quoted, I clearly made an error. Perhaps you could offer assistance in mending it. Based on a re-read and some thought, albeit probably less then is warranted, I would say:
"This argument confilcts with the theory described by Dr. Garcia (Orion) at SearchEngineWatch in a thread titled - Temporal Link Analysis (TLA). TLA purports to help search engines return more relevant results by adding a time analysis component to the value of a link. However, if speculation about the 'sandbox' factor holds true, it would suggest that TLA is not yet being included in Google's algorithm, or that sites suffering from sandboxing are not benefiting from it."
My point in the article was not to suggest that TLA was taking into account the age of sites, but that the idea of 'devaluing recent links' directly conflicts with the idea that new links are more relevant (provided they are from a reputable source).

Thanks for pointing out my error and taking the time to comment on it. The last thing I want to do is spread misinformation. Get back to me when you have time - I will amend the article immediately.

Regarding the lack of evidence for Google's current preference for older sites, I would agree that it is largely circumstantial. However, I'm not sure what kind of evidence could be amassed to help confirm or deny the hypothesis. I made a quick sampling at http://socengine.com/seo/guide/age-of-sites.html - perhaps someone could suggest ideas for a larger study that would help make a more complete analysis.

Last edited by randfish : 11-10-2004 at 02:45 AM.
randfish is offline   Reply With Quote
Old 11-10-2004   #19
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Quote:
Originally Posted by projectphp
In terms of SEM stratagies, if this is a future direction, then the best links are long term, stable links on big sites. Archives become vital, and places that do this for articles are huge. A link on a syndicated article that is forgotten about for years grows increasingly more valuable by the day.
This seems to be the complete opposite of what TLA is about, unless I'm missing something. As I read through the papers, they suggested to me that those sites who were "timely authorities" would be receiving new links frequently, whereas sites whose linkes were primarily old archives or in old, stale documents would suffer.

This would also suggest to me that the "Google prefers older sites" concept is directly at odds with TLA. Orion, Mike, Rusty, et al - hopefully you can tell me what I'm missing.
randfish is offline   Reply With Quote
Old 11-10-2004   #20
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, Randfish. It is an honor to having you at this thread. Please feel at home.

I salute you. About your explanation, that’s fair enough. I know it wasn’t your intention to misinforming members of both forums. Feel free to clarify at the SeoChat forum if you think is necessary. BTW, some SeoChat users are posting good and interesting observations. Let see how I can address some of these.

Timestamped Urls

Let say we query a search engine for Osama and we inspect the top 30 documents. Let say we find out that 600 urls across the web are linking to these 30 documents. There are two possible treatments.

1. We can sort these 600 urls based on their timestamp data and then group the urls in specific time intervals. Next we normalize the counts by dividing by 600, so counts runs from 0 to 1. Now we plot the url-time curve.

2. We proceed as in “1” but we predefine a time range to be monitored, ignoring urls from the 600 collected urls whose timestamps are not within the range.

In any case the resultant curves monitor the linking activity associated to the query “Osama”. Similar url-time curves can be constructed for a given domain. In this case the curves gives a record of the linking activity associated to a given domain. A significant spike in url-time curves “tells” the user that important link activity took place around a particular date or time interval.

I can see many applications for TLA, to mention only two.

1. Intelligence: TLA curves could be used to correlated with, for example, historical events in time (e.g., September 11, 2001) or to monitor linkage patterns and trends around a given concept.

2. Marketing: TLA curves could be used to correlate seasonal and fashion trends around a given product or brand.

As it stands, I can also see several areas in which the current TLA model could be improved. Two of these are

1. documents retrieved by the query may not be on-topic; it is assumed that the top ranked documents are relevant to the concept C, not necessarily the case, especially in the presence of noise (e.g., bloggers, link bombs, relevancy tricks)

2. timestamped urls may not be on-topic; i.e., the content and relevancy of the timestamped urls with respect to the initial query is not taken into consideration.

Although there are other areas that deserve improvements, TLA is a promissory intel model.


Google using TLA?

I can only refer readers to the public information available. The evidence suggests this is unlikely for three reasons.

1. IBM holds a patent on TLA (published just this Summer).

2. Check the TLA paper, Section 4.4 “From Authorities to Timely Authorities”, Tables 2 and 3. When TLA is incorporated into a ranking model the IBM group found that new, recent, and fresh sites tend to rank higher, not lower as suggested by proponents of Sandbox, TLD, BLOOD and other conspiracy theorists.

3. Models that try to explain spatio-temporal behaviors on the Web are far from being fully explored, developed, and implemented. Welcome to the world of Non Linear Dynamics (Chaos) and Fractals.

Orion

Last edited by orion : 02-03-2005 at 11:09 PM. Reason: Fixing first line
orion is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off