Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > MSN Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 11-11-2004   #1
Jeff Martin
 
Jeff Martin's Avatar
 
Join Date: Jun 2004
Location: Dallas, Texas
Posts: 364
Jeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really nice
Microsoft Scraping Google and Yahoo! SERPS?

Hot in my inbox is the WebProWorld article
"Microsoft Crawling Google Results For New Search Engine?"

They already have an interesting thread going at WPW, link

Quote:
I was questioned today by a developer who was watching a particular IP address scan his site.

The behavior it demonstrated made it look like a crawler, especially since it was spidering urls that were no longer in existence...and doing so at the rate of 1 page every 3 - 5 seconds.

So now you're saying, so what, big deal. But this really is a big deal. It's a big deal not only because the urls this visitor was making requests to don't exist any longer but because the only place these urls can be found is in Google’s search results using site:www.sitename.com.
You have to agree that if being done this would save time hunting down quality (subject to intrpretation of course) pages, examining link structure, etc.

One thing you cant say about MS is that they arent resourceful.
__________________
Jeff Martin - SEW Moderator
Vericlix

Last edited by Jeff Martin : 11-11-2004 at 03:03 PM.
Jeff Martin is offline   Reply With Quote
Old 11-11-2004   #2
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
No comment

Well, exept for one.

Quote:
You have to agree that if being done this would save time hunting down quality (subject to intrpretation of course) pages, examining link structure, etc. and save a considerable amount of time.
What a very true statement that is. Good point Jeff!

Last edited by Nacho : 11-11-2004 at 03:03 PM.
Nacho is offline   Reply With Quote
Old 11-11-2004   #3
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Seems a bit unethical to me, if true.
rustybrick is offline   Reply With Quote
Old 11-11-2004   #4
seomike
M·d_Rewrite Guru
 
Join Date: Jun 2004
Location: Dallas, Texas but forever a Floridian!
Posts: 627
seomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to behold
They should by Fantomasters spiderspy list and redirect their spiders LOL.

They would be the definition or irony
seomike is offline   Reply With Quote
Old 11-11-2004   #5
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
I was thinking of discussing this topic at the blog I write at. But I do not think its really worth it, unless this topic gets really hot. I know Jason, I actually speak with him on a regular basis. I respect him, the company he works for and his colleagues.

Ok that being said, I think MSN would never even consider this.

Couldn't they seed their index with the Yahoo! results they paid for and are still paying for to run the non beta version of MSN Search? Why would they use the Google API or screen scrape when they have access to Yahoo!'s index?

Of course everyone needs something to seed the index. I believe most new engines start off with the Yahoo! Directory and ODP. But I am sure, others, smaller ones, use Jason's method. But MSN, I highly doubt it.
rustybrick is offline   Reply With Quote
Old 11-11-2004   #6
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
From my understanding search engines feed their crawlers from just about anywhere they can get their hands on to find new relevant documents. I've talked with SE engineers about this and it amazed me to know sometimes where they can start looking.

Quote:
Originally Posted by RustyBrick
Couldn't they seed their index with the Yahoo! results they paid for and are still paying for to run the non beta version of MSN Search? Why would they use the Google API or screen scrape when they have access to Yahoo!'s index?
Yes, I would assume they are, but I have nothing to prove such belief.

How the get as you call "seed" pages to request the crawlers to fetch is totally different that what the SEs use to analyze and store in their index modules.

From looking at results with limited testing I can perfectly tell their algorithms are completely unique and different to Google's.

They have a lot of tweeking in their hands to do, but I rather not speculate what MSN Search (beta) is doing unless I have enough testing to prove any theories.

If MSN Search is, then I would put attention to this comment:

Quote:
It makes sense from a business case but I wonder if there are any legal issues there.
Nacho is offline   Reply With Quote
Old 11-11-2004   #7
Jeff Martin
 
Jeff Martin's Avatar
 
Join Date: Jun 2004
Location: Dallas, Texas
Posts: 364
Jeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really nice
Quote:
It makes sense from a business case but I wonder if there are any legal issues there.
Legal issues? How?

G and Y! dont own any of the content.....we do.

If we as an internet whole placed 4 lines of text in our robots.txt file we would shut down the two most powerful search properties in a matter of months.

Since G and Y! dont own any of the content on what legal ground could they stand upon?

SEOMike's got it right (wonder where that I dea came from Mike?), get a subscription to Ralph's list and block em
__________________
Jeff Martin - SEW Moderator
Vericlix

Last edited by Jeff Martin : 11-11-2004 at 06:29 PM.
Jeff Martin is offline   Reply With Quote
Old 11-11-2004   #8
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
So MSN Search can be considered the "Black Hat Search Engine"?
rustybrick is offline   Reply With Quote
Old 11-11-2004   #9
Jeff Martin
 
Jeff Martin's Avatar
 
Join Date: Jun 2004
Location: Dallas, Texas
Posts: 364
Jeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really nice
Quote:
Originally Posted by rustybrick
So MSN Search can be considered the "Black Hat Search Engine"?
Hats? RB are you trolling?????

I suppose we could place this method into the grey shadded area that quite a bit of SEO falls into.
__________________
Jeff Martin - SEW Moderator
Vericlix
Jeff Martin is offline   Reply With Quote
Old 11-11-2004   #10
I, Brian
Whitehat on...Whitehat off...Whitehat on...Whitehat off...
 
Join Date: Jun 2004
Location: Scotland
Posts: 940
I, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of light
Quote:
Originally Posted by Jeff Martin
Legal issues? How?

G and Y! dont own any of the content.....we do.
Google claim ownership of their own SERPs:
http://www.google.co.uk/terms_of_service.html
Quote:
"You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site."
So it would be unlikely that MSN would be dumb enough to scrape Google.

Unethical? No, just commercial stupidity for a billion dollar corporation - if they did.
I, Brian is offline   Reply With Quote
Old 11-11-2004   #11
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Brian,

IF MSN is doing it, then it is NOT doing what this quote suggests:

Quote:
"You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site."
They would ONLY be taking the URLs to feed their crawlers, NOT reformat, NOT display them, NOT mirror the Google home page or results pages on their website.

From this quote the technique discussed in this thread is NOT breaking any laws that I'm aware of, UNLESS if google has legally (terms & cond.) or using accepted methods (ie. robots.txt) disallow any SE crawlers to go through their results pages using a scrape method. The Google API has its own terms & cond. that must be met, so I'm sure MSN would not be this dumb. It would be like spitting up to the sky, wouldn't you think?
Nacho is offline   Reply With Quote
Old 11-11-2004   #12
Jeff Martin
 
Jeff Martin's Avatar
 
Join Date: Jun 2004
Location: Dallas, Texas
Posts: 364
Jeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really niceJeff Martin is just really nice
Quote:
They would ONLY be taking the URLs to feed their crawlers, NOT reformat, NOT display them, NOT mirror the Google home page or results pages on their website.
Thats exactly what I think is happening. They can get the "quality" pages and recommnedation on ranking from G then send the bots and tweak their findings and publish them.

Quote:
"You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site."
I would like to see this go through the courts as search companies own an insignificant amount of web properties they display in their SERPS. Again, we own the content so it seems to me the only claim G could make would be to its algorithim and the style in which the SERPS are rendered.
__________________
Jeff Martin - SEW Moderator
Vericlix
Jeff Martin is offline   Reply With Quote
Old 11-11-2004   #13
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Quote:
Originally Posted by Jeff Martin
Thats exactly what I think is happening. They can get the "quality" pages and recommnedation on ranking from G then send the bots and tweak their findings and publish them.
SI mi amigo. IF THEY ARE, then what they are effectively doing is just getting a higher quality index universe. Call them thoroughbred seeds if you wish.

Again, IF it's legally and accessibly possible, then I think it's a genius strategy.
Nacho is offline   Reply With Quote
Old 11-11-2004   #14
hardball
Member
 
Join Date: Oct 2004
Posts: 83
hardball will become famous soon enough
Whats an SE to do with "no results found"?

log the query and go scrape.
hardball is offline   Reply With Quote
Old 11-11-2004   #15
Mikkel deMib Svendsen
 
Mikkel deMib Svendsen's Avatar
 
Join Date: Jun 2004
Location: Copenhagen, Denmark
Posts: 1,576
Mikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud of
Not being a lwyer, I am still pretty sure that databases and collections of data are protected under copyright law - as I know they are in most of Europe. You you can own the collection (the index) without owning the content (the URLs) and it is illegal (at least here) to reuse the entire data collection.
Mikkel deMib Svendsen is offline   Reply With Quote
Old 11-12-2004   #16
Elisabeth
 
Elisabeth's Avatar
 
Join Date: May 2004
Location: the wasatch front
Posts: 987
Elisabeth is a splendid one to beholdElisabeth is a splendid one to beholdElisabeth is a splendid one to beholdElisabeth is a splendid one to beholdElisabeth is a splendid one to beholdElisabeth is a splendid one to beholdElisabeth is a splendid one to beholdElisabeth is a splendid one to behold
Note MSNdude's response to this topic within the thread at wmw:
http://www.webmasterworld.com/forum97/229.htm

Very key answer there, and GG doesn't appear to believe this is the case himself, either.
Elisabeth is offline   Reply With Quote
Old 11-12-2004   #17
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
So far an up to post #31 MSNDude has only said:

Quote:
Originally Posted by msndude
Also regarding relevance, there has been some speculation on some online forums about MSNBot using Google search result pages to build our index. Let us set the record straight – that is simply not true. We respect robots.txt and as a result we will not crawl Google’s search result pages.
But who has said the you can only get Google's search results via crawling www.google.com???

I would expect a much better statement than that to cover something like . . . . we respect Google.com and as a result we will not use any of Google’s search results to build our index.

We'll just have to wait an see if MSN really clears this up for real.
Nacho is offline   Reply With Quote
Old 11-12-2004   #18
Lance Housley
Looking from the Searcher's Angle
 
Join Date: Jun 2004
Location: Canterbury, England, UK
Posts: 24
Lance Housley has disabled reputation
Quote:
Originally Posted by Mikkel deMib Svendsen
I am still pretty sure that databases and collections of data are protected under copyright law - as I know they are in most of Europe. You you can own the collection (the index) without owning the content (the URLs) and it is illegal (at least here) to reuse the entire data collection.
It's part of Intellectual Property law, alongside copyright and patents, and is usually known as Database Right. Within the European Union, it is enshrined in legislation, though I don't know the situation elsewhere.
To a large extent Database Right grew up as a development of copyright to make a distinction between owning the intellectual endeavour involved in creating original material and owning the intellectual endeavour involved in creating the compilation.
Since SE databases are basically compiled by computers, does that demonstrate that computers are intellectuals, I wonder?
Lance Housley is offline   Reply With Quote
Old 11-12-2004   #19
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
There are plenty of software packages that will screen scape search results in order to create search fodder for those trying to generate AdSense or other traffic.

It's entirely possible that MSN has simply crawled one of these pages. So yes, it would have crawled Google search results -- but these could have been Google search results that were copied and transferred to a different site.

That's far more likely than the idea that MSN is somehow scraping Google. I mean what, MSN starts jumping over to Google, entering site:someonessite.com commands for upteen million sites to do some guesswork on harvesting sites? Farfetched. Much more likely it ran across the results as I've described.

The actual story is also just incorrect. MSN never required a fee to be spidered. MSN still, on the flagship site, partners with Yahoo for its search results. Yahoo has operated a paid inclusion program but as many will attest, has also spidered pages for free aside from this. MSN dropped paid inclusion pages back in July -- but despite this, they already were and still are crawling the web for free via Yahoo (and via themselves, on the beta site).

And the fastest way to get relevant pages is to crawl Google for every page listed from a site? Not. You'd instead do what the other crawlers do, harvest links from across the web and start indexing the ones you see most often.
dannysullivan is offline   Reply With Quote
Old 11-12-2004   #20
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Quote:
Originally Posted by dannysullivan
And the fastest way to get relevant pages is to crawl Google for every page listed from a site? Not. You'd instead do what the other crawlers do, harvest links from across the web and start indexing the ones you see most often.
Excellent points Danny! However, as a crawling strategy, you could get outstanding entry points into the web like this. Then, let the spiders find the new stuff.
Nacho is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off