Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Other Google Issues
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 01-11-2005   #1
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Fractal Spam?

Disclaimer. This thread is a spinoff of a work on the Fractal Nature of Semantics I'm conducting. Since this version focuses on spam, SEWF “no-spam” reporting policies apply. Due to time constraints I may not be able to follow/respond to this thread as much as I want. Feel free to comment/post your thoughts on this or similar on-topic issues. Please keep in mind SEWF policies.


Fractal Spam?

According to Benoit B. Mandelbrot, a fractal is a shape made of parts similar to the whole in some way (1-3). Thus, the term fractal describes shapes that are irregular, rough, and fragmented.

While making travel arrangements for the next weeks, I was thinking about this and related research I currently conduct on the fractal nature of relevancy, Web nodes and IR systems. Then it occurred to me there should be an iterative procedure I could study, strictly in connection with relevancy induction or, let say for spamming a search engine.

Then I decided to "command search" Google for 100 results and for several topics I have discussed with search engine optimization and marketing specialists (SEOs/SEMs). I came across some SERPs in Google showing entries (title, portion of text) that appear to be relevant to my query. Clicking a result sent me to a page (no redirection involved) about a commercial product. The content wasn’t related at all with my query or the SERPs. How could this be possible?

After several Google's AdWords and other marketing material found in the page, I saw a carbon copy of Google's SERPs pasted into the document and containing the top 10 results for my intended query. But I found no traces of the visible text in the top 10 entries relevant to the documents and displayed in the SERPs. Confused? Stay with me, please.

Then I checked the cache version and realized that the old copy contained similar marketing material as before and a carbon copy of Google's SERPs, but this time with an entry related to my post, which is the text shown in the original query results page. WOW! I can spam a system with its own output as input.

I have many examples of this. Then I decided to start a thread on this at the Search Engine Watch Forums. Since SEWF has a "no spam report" policy, I’ll not provide names and hope if you post at SEWF, please keep in mind that outing someone is out. I will not add anything else except this.

This is not a novelty and has been occurring for a while with static and dynamic pages and with single documents or small database-driven documents. It is not a mere case of affiliated search results embedded into documents. Are these probably cases falling through the cracks? I don't know. Still, nice recursive way of spamming a system.

Recursively spamming an IR system or search engine with their own top N search results entries, links, titles, urls, and descriptive texts that are already relevant to the initial query.... Hum.


References

1.B. B. Mandelbrot, The Fractal Geometry of Nature, W. H. Freeman, New York (1983).
2.J. Feder Fractals, Reference 2 page 11; Plenum Press, New York, 1988)
3.H.-O. Peitgen, H. Jurgens, and D. Saupe, Fractals for the Classroom Part One and Two, Springer-Verlag, New York (1992).


Orion

Last edited by orion : 01-11-2005 at 01:22 PM. Reason: Removing last line/typos
orion is offline   Reply With Quote
Old 01-11-2005   #2
I, Brian
Whitehat on...Whitehat off...Whitehat on...Whitehat off...
 
Join Date: Jun 2004
Location: Scotland
Posts: 940
I, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of light
Sounds like you're talking about scrapers there, Orion.

Adding Mandlebrot to the equation doesn't make the practice any prettier, though.
I, Brian is offline   Reply With Quote
Old 01-12-2005   #3
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Orion,

Very interesting post! I can see how a web developer may dynamically generate a page perhaps using the Google API with respect to the relevant keyword in question for optimization to use it as content and outgoing links to gain rankings on the SERPs. If this "output" is re-organized in such a way, yes Google would of course see it as relevant content and outgoing link relationships to more relevant content. Therefore awarding it points for a higher ranking. Are these points enough? Maybe in a non-competitive environment, but most likely not in a highly one.

Is it all bad? Perhaps. Maybe this is a very creative way to help website owners find and present to their users other good sources similar to their content after it has been through a descent editorial review. But maybe this technique was initiated to abuse the search engine, therefore it is now helping the wrong website that was sought as “the best” (by being in the top 10 in the serps for Google) gain even more power when it really shouldn't have and therefore giving Google less relevancy to its results by being caught in its own trap.

This is why I continue to believe that algorithms today are still in its toddler years by assigning way to much weight on factors that can be easily gamed with the right resources. Let's not forget the "How Fair is the Link Popularity Algorithm?" thread and great points made in the "Threats and Opportunities of Search Engine Marketing" thread as well.

So if this loop trap is hurting the relevancy of the search engines, then YES it is SPAM and search engine's should find a way to spot it and take effective measurement actions about it. IMO, I wouldn’t go all the way as to banning a site for doing this but perhaps assigning devaluating points against its content and outgoing links that match the SERP results (actual or historic).
Nacho is offline   Reply With Quote
Old 01-12-2005   #4
I, Brian
Whitehat on...Whitehat off...Whitehat on...Whitehat off...
 
Join Date: Jun 2004
Location: Scotland
Posts: 940
I, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of light
Scraping has been going on for some time - I don't believe the sites in question aim to rank high, as much as net a wide range of traffic for advertising.

The issue of scraping has already been discussed in a few threads at SEW:
Search Engine scrapers
Change in backlinks showing scraped pages
Tool bar hacks and scrapers

In fact, yesterday Daniel Brandt of Google-Watch (in)fame released the source code for a scraper that takes Google's results for use on other websites: http://www.scroogle.org/gscrape.html

Hope that was helpful, and not a distraction from the topic of fractal patterns in relevancy.

Last edited by Nacho : 01-12-2005 at 12:05 PM. Reason: Fixed links to forum threads
I, Brian is offline   Reply With Quote
Old 01-12-2005   #5
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Just stopping by

I'm just stopping by before I cannot longer have time to follow SEW forums within next days.

Both above are good and on-topic posts.

This thread http://forums.searchenginewatch.com/...ead.php?t=1014 shows how to grab barebone results from Google. No proxy needed and you don't need to enable cookies.

To spam a system, one could be tempted to design a page advertising a product/service, then copy/paste some SERP (play with L) for your intended keywords and check how it ranks. One of many experiments....

The goal is not exactly to spam an engine but to test how recursion with resized copies of SERP can be used to test the scoring performance of an IR system to

1. make it better
2. understand its recall/precision behavior
3. see how it scores SERPs from other IR systems or engines

It just happen that the very same experiments can be used to expose the "infancy" that affect crude scoring-based systems, as Nacho well pointed out.

I feel recursive approaches can be used to both expose/improve not only retrieval but link models; i.e. a web surfer moving through a pre-patterned link structure that is also statistically self-similar (reduced copies of itself at limited length scales) can hardly behave as a pure random walker, moreover, especially if the links are on-topic.


Orion

Last edited by orion : 01-12-2005 at 01:10 PM. Reason: typo
orion is offline   Reply With Quote
Old 01-12-2005   #6
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Spam analysis and devaluation by the major search engines shouldn't be too hard. If you analyze 100 "spam" pages against 100 non-spam pages, you'll find characteristics that catch 90% of the spam pages - grammatical consistencies between pages, use of certaintypes of punctuation or lack thereof, stopword frequency, use of style sheets and types of html, duplicate content, etc.

I think a careful analysis would reveal dozens of "spam" trends that could then be used to filter out these pages. Hopefully, this is in the works.
randfish is offline   Reply With Quote
Old 01-12-2005   #7
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
Quote:
Originally Posted by randfish
Spam analysis and devaluation by the major search engines shouldn't be too hard. If you analyze 100 "spam" pages against 100 non-spam pages, you'll find characteristics that catch 90% of the spam pages - grammatical consistencies between pages, use of certaintypes of punctuation or lack thereof, stopword frequency, use of style sheets and types of html, duplicate content, etc.

I think a careful analysis would reveal dozens of "spam" trends that could then be used to filter out these pages. Hopefully, this is in the works.
many good spam techniques are patterened after WHAT IS WORKING RIGHT NOW. sure old spam will stick out, or some of the search results snippets might stick out, but there will always be some techniques that are not easy to spot.
__________________
The SEO Book
seobook is offline   Reply With Quote
Old 01-15-2005   #8
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Patterns, everywhere

Good points, randfish and seobook.

BTW, I have updated and expanded content of link in post #1.

Once we have an eye for fractals, discovery, analysis and interpretation can be rationalized in terms of the underlying motifs and IFS. This is a useful approach since apparently dissimilar objects and processes can be derived from slightly modified initiators.

That we are surrounded by fractals and motifs is evident. In the particular case of Web patterns, these can be found not only in spam techniques but also in

1. structures and processes that conform to binary decision trees.

2. Web architectures such as themed sites/documents, categorized directories, search databases, etc.

3. clustering techniques that require the use of dendrograms, multidimensional matrices, etc.

Cheers

Orion

Last edited by orion : 01-15-2005 at 10:09 AM. Reason: typos
orion is offline   Reply With Quote
Old 01-15-2005   #9
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Done, already.

Quote:
Originally Posted by Nacho
Very interesting post! I can see how a web developer may dynamically generate a page perhaps using the Google API with respect to the relevant keyword in question for optimization to use it as content and outgoing links to gain rankings on the SERPs. If this "output" is re-organized in such a way, yes Google would of course see it as relevant content and outgoing link relationships to more relevant content. Therefore awarding it points for a higher ranking. Are these points enough?
Already happening Try a search for “fractal spam” (with/without quotes). SEs can be spammed with/without an API, too. I have seen other cases in which even the SERP links are rendered inactive in the 'relevant documents" or enbedded results are disguised/mix to look like blog entries. I’m not sure if after this thread, but some are getting creative.


On a research note, that has nothing to do with spam (and before I keep packing). For those interested in fractal geometry applied to IR, an example is given in the Visualization Tools for Self-Organizing Maps
paper. I have a collection of similar papers.


The referenced paper abstract reads

"Self-organizing category map is identified as a
powerful tool for information summarization. However,
visualizing a large-scale self-organizing map in a restricted size of
window is difficult. For smaller regions, displaying labels is
infeasible. In this paper, two visualization tools, fisheye view and
fractal view, are presented. It assists users to visualize a largescale
self-organizing map geographically and semantically."


One of the researchers writes

“The fractal views is developed based on fractal theory. By regarding the information structures of the complex objects, the objects are abstracted. Users select the threshold to control the amount of information displayed….The current fractal view is computed by physical method, that is, the physical distances among neighbors. The future fractal view will add semantics method, so the logical Eucledian distances among category keywords are used.” (http://www.cs.hku.hk/~yang/visualization/frac.htm) More info available at his home page http://www.cs.hku.hk/~yang/


Orion
orion is offline   Reply With Quote
Old 01-15-2005   #10
Mikkel deMib Svendsen
 
Mikkel deMib Svendsen's Avatar
 
Join Date: Jun 2004
Location: Copenhagen, Denmark
Posts: 1,576
Mikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud of
Allthough not directly related to the topic I think it is required to point out the fact that this sort of scraping of SERPS and republishing - even in a reorganised format, is violating both Google's and the websites in the results (that you scrape) copyright in several countries I am aware of.

It is, however, very likely that you would get away with this for quite some time but if a person like Orion, me or (one of many) others with the qualifications to backtrack this for a client you are F...'ed. I know how much I've got out of a few copyright cases here - how much it cost the offenders. This is very REAL.

So, if you do this you better have a good lawyer and plenty of cash in the bank if you live or have assets in any of the countries that have copyright laws that cover this situation.
Mikkel deMib Svendsen is offline   Reply With Quote
Old 01-15-2005   #11
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
Mikkel is correct, I believe.

This is very similar to the telephone book cases. A long time ago, there were numerous lawsuits over telephone books - many "entrepreneurs" would basically take the listings in a phone book, and believing that since the phone numbers and personal information were not owned by the phone company, re-printed it and sold it with their own ads.

This went to court.

The courts have held, in every jurisdiction that I am aware of, that although the actual phone numbers and names were the property of the people in question, the act of collecting, processing, compiling and presenting this information was protected.

If the phone book companies went out and collected the names (door to door, or whatever) then the original phone company could not do anything about it (there were cases about that, too) but you can't just take someone elses hard work, reprint it, and sell ads on it without permission. The act of stealing it from the original compiler and making a few changes was not considered sufficiently unique.

I see a direct analogy to to search engines - the individual website owners own their websites, domain, and content. A search engine is allowed to collect publically accessable information (ie anything not behind a login or prevented by robots.txt, etc) and compile it.

There is no problem with a second search engine doing the same, as long as they use their own spider and compilation system, even if the result may be identical at times.

There is a HUGE problem with taking the listings from a search engine and using them without permission, however. That's like photocopying someones book and trying to resell it as your own.

IMO, results scrapers are illegal in most jurisdictions, and will probably be considered illegal in the rest as soon as the courts there get around to making a ruling. You are directly benifiting at the expense of the people doing all the work. Put advertising on their and you are in real trouble - that revenue belongs to whoever you stole the content from.

I'm not sure how this applies to meta search engines - I think it's pretty borderline. Technically they are using their own algo to re-organize the results, so that may be sufficiently unique, but it's iffy. They would be best off getting permission (and probably revenue sharing).

A search engine is different from a phone book, but I think that there are enough similarities for a court to apply stare decisis based on established copyright law.

My opinion,

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 01-15-2005   #12
Mikkel deMib Svendsen
 
Mikkel deMib Svendsen's Avatar
 
Join Date: Jun 2004
Location: Copenhagen, Denmark
Posts: 1,576
Mikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud of
As far as I understand, not being a lawyer, following local cases I believe that there are two possible violations in place if you use this scraping/republishing:

1) The search engine. In Denmark, as well as other countries, collections of data - databases - are protected, even though the data (phone numbers, websites or similar) does not actually belong to the DB owner. It is the collection that is protected.

2) Copyright owners of text in search results. You do have the right to quote (I think the laws here migh differ a bit from country to country) but, at least here, it requires an act of "good intention" which this will absolutely not fall under. So, even though you would have the right to hand-pick the same title and description of a specific search result and use it as a quote in you blog, for example, then I believe that using it in an automated fashion in this way would not be legal. In the Newsbooster case the "repetitive and automated" nature was in fact one of the key issues that made the quoting Newsbooster did illegal. And, in this case Newsbooster actually provided value to the users, that I think should given them the right to "good intentions" but they did not. So, the SERP/scraper spam is, as far as I see, not going to pass either.

On top of this comes the brand damage it can do to your company - or your client. I definately do NOT recommend this sort of tachtics for any major, and valuable, brand. If you really have to do this, then at least set it up so that there is NO WAY to track it back to your company.

If you want to rob a bank, at least wear a mask

Last edited by Mikkel deMib Svendsen : 01-15-2005 at 04:51 PM.
Mikkel deMib Svendsen is offline   Reply With Quote
Old 01-16-2005   #13
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Spam and ethics

And my pc said to me "stop what you're doing and post".

Excellent points, Ian and Mikkel. Intention is right the word.

I’m not in favor or against scrappers, nor I justify spam. As a scientist I need to see both sides and formulate an unbiased examination. Still, at times I see a bit of human hypocrisy in search engines and choke with the words “collecting” and “scrapping”.

In my opinion, search engines are the big scrappers of information and data. They and without asking for permission are scrapping all over the place servers and their content owned by others. Yeah. And some still believe in the robots.txt farse (oops, file).

Some search engines are specialized scrappers that, indeed scrape and republish images, News, headlines, phones, addresses, brands, even SSN (if you know how to search for them) etc, and still can get away with all they “crapping scrapping”.

Then along come some guys and do the same to them. Then, we see different territories and court systems taking sides and hammering some but not others. What a hypocrisy and mess!

Let X = SEs and Y = average Joes

The truth is the “good” and big X and the “bad” and little Y scrappers will always find their way to get what they want.

Recursively spamming a search engine with reduced copies of their own search results posted after spam material in a document is just one way of illustrating the misery of both sides. The misery of both sides own misery because

1. current SEs are not smart enough
2. current spammers are miserable enough

This also shows how much work needs to be done to prevent abuse from both sides and from information placed in servers.

As for working around legal issues, here is another trick that is not new. Someone could query a search engine, check the top N search results. Instead of copying/pasting relevant results from SERPs, he/she could use a “search engine approach”. He/she could visit the listed sites and “collect” -as a SE would do- the important elements/info.

Then -and as a SE would do- he/she may reformat, sort to his/her heart needs and include the “collected” data or portion of data into the intended spam document. No need to do this repetitive and automatically. If now a search engine (the same target or other) scrap his/her server, index the doctored document, scores relevancy based on data interpreted as “relevant”, who to blame?

A lot of hand work, but easy to do with a “crawler” (“scrapper !?”) in “no auto” mode. No need to copy/paste SERPs at all and the final result is not a carbon copy of SERPs. At the end of the day I wonder if the effect is not the same.

Let’s not be hypocritical, please. How many are doing this right now? I don’t know. I wonder if some small vertical directories/portals/site map architectures or enterprise databases were/are constructed using a derivative of this or using home-made or form-based “crawlers” (“scrappers” !?).

Look to me the glass is half-full/half-empty. I see this and similar spam tricks more as an ethical issue than anything else. One in which, as Mikkel excellently pointed out, the “INTENTION” is what matters. So a lawyer may want to prove that there was this or that intention and defense may want to prove that there was a different intention by the client during his/her spare time.

Collecting, scrapping, publishing, spamming, crawlers, and scrappers. What a mess.


Orion
orion is offline   Reply With Quote
Old 01-16-2005   #14
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
It's a very valid point that many people will find links they like through a search engine, and then link to that site, which creates a bit of a feedback loop.

Link building springs immediately to mind.

What do you do if you want to show up better in Google? Why, you get links - good links.

How do you know what a good link is? Why - you ask Google. After all, that's the only opinion that matters if your stated intention is to rank well with them.

So where do you find links Google likes? On Google, of course! It's not fractal in this case (not identical listings, which was the OP reason for posting) but self-referencing nonetheless.

I seem to remember a recent article about the rich getting richer - Remember Mike Grehans article "Filthy Linking Rich"? that addressed this issue.

http://www.e-marketing-news.co.uk/Oc...chLinking.html

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 01-16-2005   #15
Mikkel deMib Svendsen
 
Mikkel deMib Svendsen's Avatar
 
Join Date: Jun 2004
Location: Copenhagen, Denmark
Posts: 1,576
Mikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud of
I agree with you on much of your post, Orian, except for your risk evaluation of the real legal risks at stake. There has been at least a couple of companies completely closed down in Europe over this or similar issues. It is very real and I am not sure could rightfully label it as an equal fight.

One thing is to discuss this, as theories - I think thats perfectly fine. But certain things should just not be tested (unless your protect yourself really well or are ready to pay the fine). Hacking is a good example of such. Just recently a guy made a post in a pulic forum in Denmark (own by IDG Publishing) about a security hole at a company. Other readers of the forum exploited the hole - used it. In a later case the guy that posted the security hole got free but the guys thet followed his advice did not.

I don't think there is a problem with the discussion of acts that might be illegal but I think it is wise to let people know of the possible consequences if they take it from there and start doing this for real.
Mikkel deMib Svendsen is offline   Reply With Quote
Old 01-16-2005   #16
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Fair enough, Mikkel. One, as X, Y is responsible for what he/she/it does. No advice given to the contrary.

Orion

Last edited by orion : 01-16-2005 at 02:23 PM. Reason: tyos
orion is offline   Reply With Quote
Old 01-16-2005   #17
Mikkel deMib Svendsen
 
Mikkel deMib Svendsen's Avatar
 
Join Date: Jun 2004
Location: Copenhagen, Denmark
Posts: 1,576
Mikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud of
> No advice given to the contrary

I know. Just wanted to emphasize it
Mikkel deMib Svendsen is offline   Reply With Quote
Old 01-17-2005   #18
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Recycled Relevancy Scenarios

I know, Mikkel. You and Ian are well intentioned and want the best for clients and the industry in general. Regarding the line that reads “your risk evaluation of the real legal risks at stake”.

No “risk evaluation” was given. Hypothetical scenarios were described and these should not be taken for any “risk evaluation” or “legal advice”. The scenarios above were properly qualified, I think. Legal is probably Ian’s waters or he could point others in the right direction as he probably no longer practices, I think.


”Collecting”, “Scrapping”

Ian, you have mentioned a good point; i.e., using a search engine to find relevant links. Some calls this link building. Fine with me.

“It's a very valid point that many people will find links they like through a search engine, and then link to that site, which creates a bit of a feedback loop.”

Scrapping SERPs, either in a selective fashion or in full mode still is scrapping, in my view.

Building programs for the purpose of scrapping search results in a systematic and automated fashion is even done by IR researchers. No secret here. In Hilltop: A Search Engine based on Expert Documents Bharat and Mihaila write

“We then used a script to submit each query to all four search engines and collect the top 10 results from each engine, recording for each result the URL, the rank, and the engine that found it.”

Intention is what count here. That is, what is done with the result. But this is a very subjective area. Why, when, how or by whom this happens is someone else’s guess. If you ask me I would probably say that “collect”, “scrap”, and similar terms make no difference, at least to me.

Some purits may argue this is done strictly for research to achieve/assess content relevancy or for the purpose of building or improving a commercial product. A little guy may argue he does it to provide the very same relevant content to his/her visitors. Some may even argue that settings and intentions not necessarily go hand-to-hand.

I feel that at the end both -the IR scientist as the marketing researcher- get the same in the form of relevant links, urls, keywords, etc. Scrapping the scrappers, makes no difference to me. However, I would not promote, suggest, or advice others to spam, game or deceive a search engine or IR systemm. No way. If someone think this is/was the thesis of this thread he/she is reading the wrong thread. Each individual or company is responsible for his/her/its actions.


RECYCLED RELEVANCE and The blog connection

That recycled relevance (not to be confused with “relevance feedback”, which is a query expansion technique) in the form of fractal spam or looped relevance in the form of selectively scrapping SERP content affects search engines is evident these days.

Some bloggers have discovered how to game SEs. I seem to remember a recent scenario.

A blogger finds that a document (from a site or a forum) is a hot topic. The document also ranks high for a given query. Then he/she enters the link into his/her blog or new blog document. Soon the entry is replicated through the intended blog network. It only takes few days later to see that his/her new or doctored document ranks high for the original query, pushing down the other site ranks. Whether this is done by systematically or selectively scrapping the target search engine for the intended query, or done unintentionally (no spam or gaming intentions involved but as a sincere service) is irrelevant, in my view.

Whether the blogger intentions were good or bad the truth is that often the target SE ends recycling relevance. In some cases no search engine optimization services are needed to rank high the indexed document, just a distributed link network for the relevancy loop to flow its damage. Fractal spam? Not in this case. Still a search for “fractal spam” or for your favorite hot forum (with/without quotes) illustrates the point.

Re-scaled or recycled relevancy is now everywhere. How to stop or detect it? Those are another twenty bucks.


Orion

Last edited by orion : 01-17-2005 at 12:13 PM.
orion is offline   Reply With Quote
Old 01-17-2005   #19
Mikkel deMib Svendsen
 
Mikkel deMib Svendsen's Avatar
 
Join Date: Jun 2004
Location: Copenhagen, Denmark
Posts: 1,576
Mikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud ofMikkel deMib Svendsen has much to be proud of
I don't think we really disagree, Orion

But to add the confusion of the legal aspect I have often found that the strangest arguments are accepted in cases and often seem to win them. The Newsbooster case from Denmark was such a wounderfull example of that, so I will use it again.

The CEO of one of the largest newspapers testified (I was there!) that even though Newsbooster only showed a title and snippet (exactly like any search engine) with proper linking back to the newspaper they did not like it. In fact, he said: (something like, translated) "We loose money on the website, so in effect they are stealing money from us with every visitor they send" - and, there is no law against having a money loosing website, so that became part of the reason Newsbooster lost.

I know this example don't have much to do with scraping, and I am sorry for that. It is just such a good example of how absurd these kinds of cases can be and what stupid arguments can win them. It makes it very hard to estimate in advance how any case will go. I am sure Ian knows examples to illustrate this even better.
Mikkel deMib Svendsen is offline   Reply With Quote
Old 01-17-2005   #20
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
Quote:
If the phone book companies went out and collected the names (door to door, or whatever) then the original phone company could not do anything about it (there were cases about that, too) but you can't just take someone elses hard work, reprint it, and sell ads on it without permission. The act of stealing it from the original compiler and making a few changes was not considered sufficiently unique.
I disagree with mcanerin's spin on this issue. Let me try my own spin.

You have a rather one-sided interpretation of the famous 1991 Supreme Court decision, Feist Publications, Inc. v. Rural Telephone Service Co., Inc.

The Supreme Court said, "Rural's white pages do not meet the constitutional or statutory requirements for copyright protection. While Rural has a valid copyright in the directory as a whole because it contains some forward text and some original material in the yellow pages, there is nothing original in Rural's white pages. The raw data are uncopyrightable facts, and the way in which Rural selected, coordinated, and arranged those facts is not original in any way. Rural's selection of listings -- subscribers' names, towns, and telephone numbers -- could not be more obvious, and lacks the modicum of creativity necessary to transform mere selection into copyrightable expression. In fact, it is plausible to conclude that Rural did not truly "select" to publish its subscribers' names and telephone numbers, since it was required to do so by state law. Moreover, there is nothing remotely creative about arranging names alphabetically in a white pages directory. It is an age-old practice, firmly rooted in tradition and so commonplace that it has come to be expected as a matter of course."

Google would have to argue that their ranking is the original feature. They would have a hard time arguing that their crawling -- the act of programming bots to go out and scrape 8 billion pages, most of which are copyrighted, without the express prior permission required by U.S. copyright law, deserves to be applauded by the courts.

But link popularity itself is not that original -- it came from the academic world. Remember, we're not talking about a scraper stealing Google's algorithm, we're only talking about a scraper displaying the results of that algorithm. Moreover, Google has already claimed in a prior court case that PageRank is merely their "opinion" of a page, with no real objectivity behind it, and therefore protected by the First Amendment. Finally, their ranking is nothing to write home about these days -- other engines are doing it much better.

What's original then -- attaching ads to the results? Okay, maybe, but I doubt it. Besides, I'm not scraping the ads -- I hate them and avoid them like the plague.

Therefore, the case is already weak for Google. Now add the "fair use" provisions on top of it. A certified nonprofit scrapes Google in a very limited fashion, for noncommercial reasons that have to do with Google's violation of the public trust through their outrageous privacy policies. Now you can see why Google is not likely to sue me. Canadian and European law may be different, but Google is in the U.S. and so am I. And as far as I know, the IP addresses I'm using for Google are U.S.-based.

So sue me, Google. With any luck I'll be able investigate, through discovery proceedings, the extent to which their ranking is original, and the extent to which Google is collecting and storing personal information on those who use their search engine.
Everyman is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off