View Full Version : Microsoft Scraping Google and Yahoo! SERPS?
Jeff Martin
11-11-2004, 02:50 PM
Hot in my inbox is the WebProWorld article
"Microsoft Crawling Google Results For New Search Engine?"
They already have an interesting thread going at WPW, link (http://webproworld.com/viewtopic.php?t=31604)
I was questioned today by a developer who was watching a particular IP address scan his site.
The behavior it demonstrated made it look like a crawler, especially since it was spidering urls that were no longer in existence...and doing so at the rate of 1 page every 3 - 5 seconds.
So now you're saying, so what, big deal. But this really is a big deal. It's a big deal not only because the urls this visitor was making requests to don't exist any longer but because the only place these urls can be found is in Google’s search results using site:www.sitename.com.
You have to agree that if being done this would save time hunting down quality (subject to intrpretation of course) pages, examining link structure, etc.
One thing you cant say about MS is that they arent resourceful. :D
Nacho
11-11-2004, 02:58 PM
No comment :rolleyes:
Well, exept for one.
You have to agree that if being done this would save time hunting down quality (subject to intrpretation of course) pages, examining link structure, etc. and save a considerable amount of time.
What a very true statement that is. Good point Jeff!
rustybrick
11-11-2004, 03:05 PM
Seems a bit unethical to me, if true.
seomike
11-11-2004, 03:20 PM
They should by Fantomasters spiderspy list and redirect their spiders LOL.
They would be the definition or irony :D
rustybrick
11-11-2004, 04:28 PM
I was thinking of discussing this topic at the blog I write at. But I do not think its really worth it, unless this topic gets really hot. I know Jason, I actually speak with him on a regular basis. I respect him, the company he works for and his colleagues.
Ok that being said, I think MSN would never even consider this.
Couldn't they seed their index with the Yahoo! results they paid for and are still paying for to run the non beta version of MSN Search? Why would they use the Google API or screen scrape when they have access to Yahoo!'s index?
Of course everyone needs something to seed the index. I believe most new engines start off with the Yahoo! Directory and ODP. But I am sure, others, smaller ones, use Jason's method. But MSN, I highly doubt it.
Nacho
11-11-2004, 04:56 PM
From my understanding search engines feed their crawlers from just about anywhere they can get their hands on to find new relevant documents. I've talked with SE engineers about this and it amazed me :eek: to know sometimes where they can start looking.
Couldn't they seed their index with the Yahoo! results they paid for and are still paying for to run the non beta version of MSN Search? Why would they use the Google API or screen scrape when they have access to Yahoo!'s index?
Yes, I would assume they are, but I have nothing to prove such belief.
How the get as you call "seed" pages to request the crawlers to fetch is totally different that what the SEs use to analyze and store in their index modules.
From looking at results with limited testing I can perfectly tell their algorithms are completely unique and different to Google's.
They have a lot of tweeking in their hands to do, but I rather not speculate what MSN Search (beta) is doing unless I have enough testing to prove any theories.
If MSN Search is, then I would put attention to this comment:
It makes sense from a business case but I wonder if there are any legal issues there.
Jeff Martin
11-11-2004, 05:14 PM
It makes sense from a business case but I wonder if there are any legal issues there.
Legal issues? How?
G and Y! dont own any of the content.....we do.
If we as an internet whole placed 4 lines of text in our robots.txt file we would shut down the two most powerful search properties in a matter of months.
Since G and Y! dont own any of the content on what legal ground could they stand upon?
SEOMike's got it right (wonder where that I dea came from Mike?), get a subscription to Ralph's list and block em :cool:
rustybrick
11-11-2004, 05:19 PM
So MSN Search can be considered the "Black Hat Search Engine"? :D
Jeff Martin
11-11-2004, 05:26 PM
So MSN Search can be considered the "Black Hat Search Engine"? :D
Hats? RB are you trolling????? :D
I suppose we could place this method into the grey shadded area that quite a bit of SEO falls into.
I, Brian
11-11-2004, 05:42 PM
Legal issues? How?
G and Y! dont own any of the content.....we do.
Google claim ownership of their own SERPs:
http://www.google.co.uk/terms_of_service.html
"You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site." So it would be unlikely that MSN would be dumb enough to scrape Google.
Unethical? No, just commercial stupidity for a billion dollar corporation - if they did. :)
Nacho
11-11-2004, 06:15 PM
Brian,
IF MSN is doing it, then it is NOT doing what this quote suggests:
"You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site."
They would ONLY be taking the URLs to feed their crawlers, NOT reformat, NOT display them, NOT mirror the Google home page or results pages on their website.
From this quote the technique discussed in this thread is NOT breaking any laws that I'm aware of, UNLESS if google has legally (terms & cond.) or using accepted methods (ie. robots.txt) disallow any SE crawlers to go through their results pages using a scrape method. The Google API has its own terms & cond. that must be met, so I'm sure MSN would not be this dumb. It would be like spitting up to the sky, wouldn't you think?
Jeff Martin
11-11-2004, 06:38 PM
They would ONLY be taking the URLs to feed their crawlers, NOT reformat, NOT display them, NOT mirror the Google home page or results pages on their website.
Thats exactly what I think is happening. They can get the "quality" pages and recommnedation on ranking from G then send the bots and tweak their findings and publish them.
"You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site."
I would like to see this go through the courts as search companies own an insignificant amount of web properties they display in their SERPS. Again, we own the content so it seems to me the only claim G could make would be to its algorithim and the style in which the SERPS are rendered.
Nacho
11-11-2004, 06:48 PM
Thats exactly what I think is happening. They can get the "quality" pages and recommnedation on ranking from G then send the bots and tweak their findings and publish them.
SI mi amigo. IF THEY ARE, then what they are effectively doing is just getting a higher quality index universe. Call them thoroughbred seeds if you wish.
Again, IF it's legally and accessibly possible, then I think it's a genius strategy.
hardball
11-11-2004, 09:21 PM
Whats an SE to do with "no results found"?
log the query and go scrape.
Mikkel deMib Svendsen
11-11-2004, 09:49 PM
Not being a lwyer, I am still pretty sure that databases and collections of data are protected under copyright law - as I know they are in most of Europe. You you can own the collection (the index) without owning the content (the URLs) and it is illegal (at least here) to reuse the entire data collection.
Elisabeth
11-12-2004, 01:22 AM
Note MSNdude's response to this topic within the thread at wmw:
http://www.webmasterworld.com/forum97/229.htm
Very key answer there, and GG doesn't appear to believe this is the case himself, either.
Nacho
11-12-2004, 01:40 AM
So far an up to post #31 MSNDude has only said:
Also regarding relevance, there has been some speculation on some online forums about MSNBot using Google search result pages to build our index. Let us set the record straight – that is simply not true. We respect robots.txt and as a result we will not crawl Google’s search result pages.
But who has said the you can only get Google's search results via crawling www.google.com???
I would expect a much better statement than that to cover something like . . . . we respect Google.com and as a result we will not use any of Google’s search results to build our index.
We'll just have to wait an see if MSN really clears this up for real.
Lance Housley
11-12-2004, 06:12 AM
I am still pretty sure that databases and collections of data are protected under copyright law - as I know they are in most of Europe. You you can own the collection (the index) without owning the content (the URLs) and it is illegal (at least here) to reuse the entire data collection.
It's part of Intellectual Property law, alongside copyright and patents, and is usually known as Database Right. Within the European Union, it is enshrined in legislation, though I don't know the situation elsewhere.
To a large extent Database Right grew up as a development of copyright to make a distinction between owning the intellectual endeavour involved in creating original material and owning the intellectual endeavour involved in creating the compilation.
Since SE databases are basically compiled by computers, does that demonstrate that computers are intellectuals, I wonder? ;)
dannysullivan
11-12-2004, 07:28 AM
There are plenty of software packages that will screen scape search results in order to create search fodder for those trying to generate AdSense or other traffic.
It's entirely possible that MSN has simply crawled one of these pages. So yes, it would have crawled Google search results -- but these could have been Google search results that were copied and transferred to a different site.
That's far more likely than the idea that MSN is somehow scraping Google. I mean what, MSN starts jumping over to Google, entering site:someonessite.com commands for upteen million sites to do some guesswork on harvesting sites? Farfetched. Much more likely it ran across the results as I've described.
The actual story is also just incorrect. MSN never required a fee to be spidered. MSN still, on the flagship site, partners with Yahoo for its search results. Yahoo has operated a paid inclusion program but as many will attest, has also spidered pages for free aside from this. MSN dropped paid inclusion pages back in July -- but despite this, they already were and still are crawling the web for free via Yahoo (and via themselves, on the beta site).
And the fastest way to get relevant pages is to crawl Google for every page listed from a site? Not. You'd instead do what the other crawlers do, harvest links from across the web and start indexing the ones you see most often.
Nacho
11-12-2004, 12:03 PM
And the fastest way to get relevant pages is to crawl Google for every page listed from a site? Not. You'd instead do what the other crawlers do, harvest links from across the web and start indexing the ones you see most often.
Excellent points Danny! However, as a crawling strategy, you could get outstanding entry points into the web like this. Then, let the spiders find the new stuff.
orion
11-12-2004, 12:20 PM
Actually, one way to find topic-specific links consists in using an initial reference results page (from vertical portals, topic-specific directories, ODP, etc). This can be used as an initial seed. Then used it as Nacho says to find new stufff on the destination servers. This is an open secret.
As any harvesting strategy this has some pro's and con's since the process carries an error due to the initial seed.
Orion
orion
11-13-2004, 10:43 AM
How about MSN Beta scrapping old material that no longer is available or exist?
I just find out yesterday about some images from one of my clients that show in the results page for image searches. These images were in the first version of the site back in 2001-2.
Since then, that site has been updated and all old pages and images removed from the server. So, how to explain that? Are they crawling instances from their own old repositories or from old external repositories? Which way is up?
Orion
Nacho
11-13-2004, 12:01 PM
So, how to explain that? Are they crawling instances from their own old repositories or from old external repositories? Which way is up?
Hold on a sec while I go check my 2001-02 logs . . . .
.
.
.
.
.
.
.
.
I'm back . . . Nope, no records of MSN Bots around. :confused:
orion
11-13-2004, 12:40 PM
Today its search results are so wild and different from yesterday, but this is understandable since they are in beta. I just finished today's stability analysis for my test pages and the results are not encouraging.
In the interest of fairness, I think we need to wait and see how MSN comes out from this beta phase. I sincerely hope they do come out with a good face to show, honestly.
Orion
mcanerin
11-13-2004, 01:08 PM
I think Danny is right (and beat me to it).
I occasionally do searches on my name (which is unusual) just to see what I've been up to or accused of recently :)
Since it's also in my company and domain, I usually get a fairly large list that's related to me. One thing I see all the time are fake "directories" made with automatic software that takes a keyword, does a search on it, scrapes the page then adds their own link at the top, or whatever.
Today, they are more likely to change the URL to each listing to point to their own site, but a while ago they actually kept the original link. I ended up on a lot of "directories" simply because I was in the top ten in some popular searches.
There is a lot of software that does this and more is being created everyday. If MSN indexed these then yes, the net result would be the equivelent to a scrape from Google. The sites would be whatever ones were showing up at the time the original scrape was made, so even old sites that are no longer in the current G index would show up.
I see this as a much more likely scenario.
Ian
orion
11-13-2004, 01:36 PM
If MSN indexed these then yes, the net result would be the equivelent to a scrape from Google. The sites would be whatever ones were showing up at the time the original scrape was made, so even old sites that are no longer in the current G index would show up.
I see this as a much more likely scenario.
Ian
Agree. This scenario, however, raises the bar for search engines, as they would need to find a way to filter out the old, spurious material/results. Otherwise, and except for conducting interesting research experiments on archiving and intelligence, why the average user should be compeled to use a system that returns answers (files) that no longer exist in the intended server?
Let's not be paranoic, but I think the big boys may need to start thinking about a workaround for the described scenario.
Orion
orion
11-14-2004, 11:43 AM
Hold on a sec while I go check my 2001-02 logs . . . .
.
.
.
.
.
.
.
.
I'm back . . . Nope, no records of MSN Bots around. :confused:
Actually, chances are you will see nothing from 2001-2 logs
1. if the crawls were not conducted those days. (most likely since this is a beta engine).
2. if they use search results as discovery routes to find new material. (remember, the open secret).
3. if other repositories (containers) across the web have instances of the old material and those repositories were crawled. (“repositories” here does not necessarily mean other search engine collections)
Orion
jpanski
09-29-2006, 08:57 PM
I have created an application/website that is essentially a search engine that takes as input search results and outputs the same results in a different order(sounds stupid i know, but its not). right now I use G API, but I know I can't leave it that way becuase of their license that says I can't use it for a commerical product that is primarily for search. Y! essentially has similar restrictions on their API.. and reading this thread, it says that scraping google results is a violation of their IP as well.. how about scraping Y!? is there anyway I can legally get the results returned by one of these services and use it as input to something else? (Is there a difference if I use the search results as input to a webpage as opposed to use the info as input to an application or firefox extension?) thanks