Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Industry Growth & Trends > Search & Legal Issues
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 07-13-2005   #1
Jenstar
 
Jenstar's Avatar
 
Join Date: Jun 2004
Location: Starbucks!
Posts: 345
Jenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of light
Implications of the Internet Archive lawsuit

Archive.org was sued over their caching of historic versions of webpages.
Quote:
The Internet Archive was created in 1996 as the institutional memory of the online world, storing snapshots of ever-changing Web sites and collecting other multimedia artifacts. Now the nonprofit archive is on the defensive in a legal case that represents a strange turn in the debate over copyrights in the digital age.
NYTimes
Search Engine Watch Blog
Link to the actual complaint
From Search Engine Watch:
Quote:
At issue, a court case on trademarks were evidence of past usage was found through the Internet Archive. Healthcare Advocates said copies of its pages were made without permission. In particular, Healthcare Advocates says despite making use of a robots.txt file, there were 92 occasions when its pages still managed to be accessed.

In a further twist, the company claims the law firm getting those pages violated the Digital Millennium Copyright Act provisions of "circumventing" the robots.txt file exclusion.
The Wayback machine is an important tool for those who are researching copyright infringement and also for using it to prove first publication of an article attributed to specific websites. It is the site I recommend to people all the time and although it is not foolproof, it is great for using as reference for both trademark violations and copyright infringement.

If you are wondering how to know if people are checking out your site via the archive.org, you can check the referral for images on your page, since it hotlinks all images for all copies of the pages it indexes. You might be surprised to see how many people are peeking at the older copies of your pages, I have spotted the IPs of many competitors in those image referrals to archive.org.

I am guessing this is how Healthcare Advocates knew those pages were still active in archive.org for competitors to access, since they state exactly how many pages their specific competitor accessed. If archive.org no longer hotlinked images, then it would not be apparent who and how often those historic pages were accessed through the archive, so proving access would be a lot more difficult, although many pages wouldn't be as user friendly as they currently are.

This will be a lawsuit to watch and see how it will affect how things are done at archive.org and how it keeps older versions of webpages, particularly the hotlinked image situation. It will be unfortunate if it makes this tool less valuable for those researching trademarks and copyright infringement.

Last edited by Jenstar : 07-13-2005 at 05:53 PM.
Jenstar is offline   Reply With Quote
Old 07-14-2005   #2
massa
Member
 
Join Date: Jun 2004
Location: home
Posts: 160
massa is just really nicemassa is just really nicemassa is just really nicemassa is just really nicemassa is just really nice
> see how it will affect how things are done at archive.org <

I'm more interested in seeing how it affects others who don't see themselves as violating copyright laws by cacheing other people's contetn without any explicit permission.

Wayback Machine is a really neat little website and a lot of fun, but in my opinion, the suit has some merit. At the very least it is a question that needs an answer.
massa is offline   Reply With Quote
Old 07-14-2005   #3
rogerd
Member
 
Join Date: Jun 2004
Posts: 109
rogerd is a jewel in the roughrogerd is a jewel in the roughrogerd is a jewel in the roughrogerd is a jewel in the rough
This is an interesting case in a couple of ways. First, it may create some law on the validity of the robots.txt file. Up to this point, robots.txt has been mainly an "honor system", with well-behaved bots checking it frequently and obeying its directives and bad bots ignoring it completely. If this case turns on the fact that robots.txt was ignored, then it might open up additional grounds for suits, e.g., a search engine displays proprietary content because it didn't check robots.txt (admittedly, a dumb way to protect non-public material), excessive bandwidth consumption by bad bots, etc.

Equally interesting is that the possibility that the court will weigh in on the caching of content from other sites. I'd hate to lose the Wayback Machine, and I'm guessing Google (and others) would hate to lose the ability to store cached versions of pages.

This is probably a good case for those who think caching is OK - the Internet Archive is quite a benign application, and it would be hard to argue about loss of advertising revenue, etc., as one might with Google's cache.
rogerd is offline   Reply With Quote
Old 07-14-2005   #4
rcjordan
There are a lot of truths out there. Just choose one that suits you. -Wes Allison
 
Join Date: Jun 2004
Posts: 279
rcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to all
Everything credible I've seen tends to dismiss the robots.txt part of the complaint because it's a voluntary standard. But a point mentioned in the comments quoted below is where I think the courts are most likely to go hmmmmm, if not in this case, soon; Opt-in vs opt-out. If anything, I think caching will be forced to ask permission, which will largely kill the practice.

Quote:
I don't care for the complaint as I think that they got it legally and technically wrong.

That said, however, I think that there is something important that is often overlooked in these archiving schemes which does not sit right. Under the terms of WbM's use, the author has to opt-out not opt-in. Whether you like the DMCA or not, that doesn't sound like traditional copyright at all. The WBM isn't just excerpting sections, it is copying verbatim everything and redistributing it. Worse yet, it may be "taking" content and author's may not even know it.
Patry Copyright Blog

Research credit goes to techdirt
rcjordan is offline   Reply With Quote
Old 07-14-2005   #5
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
Quote:
Originally Posted by massa
I'm more interested in seeing how it affects others who don't see themselves as violating copyright laws by caching other people's content without any explicit permission.
That was my first thought too, Bob. I've always held the opinion that any site that displays other people's pages within their own site (e.g. search engine caches) without permission is outright theft. Regardless of whether or not it is against any laws, it is outright theft. I've had a big chip on my shoulder about it for a long time.

I hadn't thought about it in the same way with archive.org, but I suppose the same has to be said about that.
PhilC is offline   Reply With Quote
Old 07-15-2005   #6
Jenstar
 
Jenstar's Avatar
 
Join Date: Jun 2004
Location: Starbucks!
Posts: 345
Jenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of lightJenstar is a glorious beacon of light
Does archive.org send a spider to each page before it displays the requested page to each visitor who wishes to see an archived page? I have always thought that the robots.txt exclusion would take place on the next scheduled crawl of the site, and it was not immediate. But this complaint seems to say otherwise.

I was reading the actual complaint, and it stems from the Healthcare Advocates putting robots.txt on the site on July 8, 2003 and then most of the unauthorized accesses taking place on the following day. The complaint also states that Harding, Earley law "hacked" archive.org in order to see those pages, when in actuality it was archive.org admitting that it was their fault that their robots.txt checking mechanism was broken.

The complaint does have some interesting info in there, if you are up to reading a 48 page legal document
Jenstar is offline   Reply With Quote
Old 07-15-2005   #7
The Generator
Member
 
Join Date: Aug 2004
Posts: 71
The Generator will become famous soon enough
Stepping away from the realm of Internet marketing for a hot minute, Generator thinks that US doctors should simply skip med school and go directly for their MBAs. In this country, you would literally be left to die in the streets if you didn't have insurance. This ignominious state of the medical industry leads me to beleive that healthcare professionals are no longer altruistic, but are rather businessmen who are even more hardball than the most hardball of hedgefund managers. Shame on that company for starting that law suit, especially considering that Internet Archive is most likely more useful to society as a whole than they are.

Last edited by The Generator : 07-15-2005 at 10:57 AM.
The Generator is offline   Reply With Quote
Old 07-15-2005   #8
massa
Member
 
Join Date: Jun 2004
Location: home
Posts: 160
massa is just really nicemassa is just really nicemassa is just really nicemassa is just really nicemassa is just really nice
>Shame on that company for starting that law suit<

Oh brother! Here we go with the "frivolous lawsuit" rhetoric again.

There is no such thing as a frivolous lawsuit. Due to the serious implications of the decision and the resources that have to be committed to the simplest suit, trust me, it is not frivolous to the parties involved. Or their attorneys or the courts.

It is also not something that we can shame on anyone for taking part in. It is like saying shame on you for feeling that you need to seek legal protection because you feel your livelihood or security is being threatened.

It is also not about "getting your way" or "getting even". It is about protecting the rights of the general public within the confines of the laws on record because those laws were created to solve a previous problem that threatened the well being of the general public.

I don't have any idea who will win in this case if it is even a case that has a winner or loser, (it doesn't look to me like it does), but I do know that someone feels wronged and if we're talking cacheing, there is no question there is a problem or at least a need for a look.

Technology has certainly crossed a lot of perceived lines in relation to copyright without so much as a glance to the legality. If technology has a right to ignore accepted practices, (maybe even laws), because no one wants to take the time to understand the technology, then what other laws are we all bound to yet they are not?

Last edited by massa : 07-15-2005 at 12:17 PM.
massa is offline   Reply With Quote
Old 07-15-2005   #9
Rob
Canuck SEM
 
Join Date: Jun 2004
Location: Kelowna, BC
Posts: 234
Rob will become famous soon enoughRob will become famous soon enough
This is an interesting debate because here in Canada our law makers are reviewing a bill which essentially would do the same thing - make it illegal for services like Google and Yahoo to cache web pages. This would then also make it illegal for the Internet Archive to do the same.

I'm not sure what the impact would be - my thinking is that the site owner would have to bear the burden of proof as well as take any actions necessary to get the cached copies removed before legal action could be taken.
Rob is offline   Reply With Quote
Old 07-15-2005   #10
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
Well done Canada - if they make a law like that, or even clarify existing laws to cover it.

I don't think there would be any burden of proof on the owners though, as the sites that show caches claim that the pages they display belong to other people.
PhilC is offline   Reply With Quote
Old 07-15-2005   #11
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
It looks to me like I know a couple of things about the Archive that may be news to some of you. I'm copying the paragraphs below that I posted on Threadwatch, for your information.

I had a domain blocked for years with an ia_archiver exclusion in robots.txt. Then I sold the domain. The new owner doesn't use robots.txt. All my old pages suddenly appeared at Archive.org.

The crawling for the Archive comes from Alexa, about six months later. You need to do a route-table block on Alexa, because they don't honor any exclusion protocol that I'm aware of. Brewster Kahle founded Alexa, and then sold it to Amazon, but he retains some influence based on the terms of the sale. His Archive is a nonprofit spin-off from Alexa. The funding is complex, but you basically have web services from Alexa going to the Archive, in addition to the crawling after six months.

In my opinion, from looking at the Archive.org's 990 form, Kahle's nonprofit is a sneaky nonprofit. Good luck tracing the funds.

When you request a page at Archive.org, it does a real-time check for a robots.txt and provides the page if it doesn't find an exclusion. It also provides the page after a 20-second timeout if it cannot connect. When you think about it, that's just about the only way it could work, because the Archive is in the business of stashing a lot of obsolete pages that will never connect.

It's a privacy violation, in my opinion. I own the domain archive-watch.org, but haven't built a site yet.

I sent a fax to Kahle almost a year ago and asked him to block all of my pages on all of my domains -- past, present, and future. Within two days they were all blocked. He knows he's on shaky ground.
Everyman is offline   Reply With Quote
Old 07-16-2005   #12
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
So what you are saying Everyman is that in fact ia_archiver does in fact ignore the Robots.txt protocol, spidering and storing the page even if there is a robots.txt exclusion, but then showing the page only if there is not a robots.txt exclusion in place at the time of the request?
This would seem to agree with a couple of sites that I looked at which I know were archived regularly prior to the time that they were placed off limits by the robots.txt file but at present nothing is shown except a notice that the sites are blocked by a robots.txt exclusion, but then strangely it offers to and in fact does show a copy of the sites robots.txt which should also be off limits as it is a part of the site excluded.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 07-16-2005   #13
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
Yes, that's what I'm saying. The domain that I sold was pir.org, which I owned since 1996 (I'm Public Information Research, Inc.), but which I sold to Public Interest Registry when they asked me and agreed to my price. I asked Kahle in my fax to block pir.org when after the domain transfer, all my old pages were showing up under pir.org. In fact, I could even pull up the old robots.txt for pir.org on the Archive, and they showed that ia_archiver was properly blocked in that file! That's about as close to a "smoking gun" as you'll find on the Internet! They were showing up because the Public Interest Registry wasn't using a robots.txt.

I aske Kahle to block all my domains, past, present, and future, and asked him to block pir.org up to the date that it was transferred. They all got blocked. This kind of blocking is all or nothing per domain, because you will note that pir.org is still blocked, even though my fax didn't ask for anything past the transfer date for pir.org, since that's now none of my business.
Everyman is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off