View Full Version : Time For An Indexing Summit?
dannysullivan
01-05-2005, 06:57 AM
Bloggers continue to be upset over comment spam? Forum owners, those with guest books and others have dealt with the problem of link drops for ages. Web site owners have a host of other indexing problems and needs they'd like dealt with. The last time search engines came together to consider how to index the web in a coordinated fashion was 1996! Isn't it time for new developments to be considered? And what types of special tags or features would you like?
My blog post today (Comment Spam? How About An Ignore Tag? How About An Indexing Summit! (http://blog.searchenginewatch.com/blog/050105-055807)) looks at the issue in more depth. Feel free to read if you want more background. But definitely please contribute here about what you'd like to see offered to site owners.
I, Brian
01-05-2005, 08:33 AM
I still think the entire issue boils down to one single feature: publisher responsibility.
Search engines will simply index what is published - it is entirely up to webmasters to take responsibility for what they publish - and if there is a problem with the process then it is up to the software developers themselves to help make it easier for users on the issue of perceived spam.
Publicly publishing content of your own to the internet is one part of the responsibility - but when anybody creates opportunities for third-parties to upload their own material to be published on a site, then it is completely up to the site webmaster to continue to take responsibility for it on their site.
That means working actively and proactively.
I admin multiple forums - part of our remit for myself and forum staff is to provide a moderated community, with responsibility for what is published - that means removing obvious advertising, objectional material, and anything else in clear and persistent violation of whichever applicable TOU is in force. I accept responsibility for what I publish on my domains.
Blog software has been a big problem - firstly, because the blog developers never took it upon themselves to install key default safety features such as redirected URLs and disallowing HTML in comments.
Six Apart recently moved to try and reduce the load on their servers from comment spam - especially automated scripts - but they didn't really tackle the key issues of how to prevent the problem in the first place - just load issues.
Ad tags and the like I find to be an utter distraction, and simply a way in which software developers and end users can avoid taking proper responsibility for their own publishing.
dannysullivan
01-05-2005, 09:56 AM
I still think the entire issue boils down to one single feature: publisher responsibility.
But how do you be responsible when some of the things you want to do might be viewed as spam.
I might want to be responsible by not "publishing" my site navigation or blog comments to a search engine. I can't do that. If it's on a page, I'm suppose to show the entire page to the search engine. If I strip portions of it, then I get accused of cloaking :)
Search engines have provided virtually nil input into helping content management systems and webmasters become more creative or sophisticated with what they deliver.
greenleaves
01-05-2005, 12:35 PM
I might want to be responsible by not "publishing" my site navigation or blog comments to a search engine. I can't do that. If it's on a page, I'm suppose to show the entire page to the search engine. If I strip portions of it, then I get accused of cloaking
Search engines have provided virtually nil input into helping content management systems and webmasters become more creative or sophisticated with what they deliver.
The thing is, if the SEs allow people to publish certain parts of a page, and others not, then that oppens up a loop hole which spamers can take advantage of. The SEs are looking out for themselfs. They try to deliver the most relevant results in the most convenient manner (by convenient I mean for themselfs and for the end-user). They ow us nothing. Why would they go out of their way to help publishers... its in our interest (more then theirs) to be in their index. 6,000,000,001 or 6,000,000,000, pages indexed, whats the difference to them?
Publishers and SEs need each other, but it is the SE that has the upper hand, always will. It would be nice of them to give us more options, but I don't think they are going to be willing to spend the man-hours developing and researching something that, as far as I can see, would not bring them any direct benefit, only potential losses.
Mikkel deMib Svendsen
01-05-2005, 01:11 PM
Danny, I certainly think such a summit would be great - I am just not sure the time is right. I am afraid that the general trust between the different participants in such an event is just not good enough at the moment. There are too many hidden agendas, general mistrust and bad behavour (on both sides). I think we need a better enviroment for sumit like that to actually come out with valuable results. But, I like the idea :)
Chris Boggs
01-05-2005, 01:50 PM
Danny, I certainly think such a summit would be great - I am just not sure the time is right. I am afraid that the general trust between the different participants in such an event is just not good enough at the moment. There are too many hidden agendas, general mistrust and bad behavour (on both sides). I think we need a better enviroment for sumit like that to actually come out with valuable results. But, I like the idea :)
Thanks again Mikel for a piercingly accurate description of the state of our industry. I too wonder who could be a part of this summit, other than perhaps Danny Sullivan who may be the only person in the world truly considered to be "impartial" about search engine rankings and companies within the industry.
I am of the opinion that mature and business-evolved people can be found to participate in such a summit. Perhaps it could be moderated by the new voice of Google: Leslie Stahl :p
rustybrick
01-05-2005, 02:45 PM
I think there is more then one individual in this world that is impartial and unbiased in this regard.
I think the time is right and let Danny Sullivan pick those who should manage/lead the summit. ;)
Mikkel deMib Svendsen
01-05-2005, 03:05 PM
I don't think it has so much to do with being impartial - I don't think speakers or panelists of such a summit have to be that but they have to WANT the summit and WANT to truly share great ideas for the better of all. If we don't have the right degree of trust and honest will between the "brains" in the summit I am just not sure how well the outcome will be.
I very much like Noel McMichaels "Search Ecosystem" ideas but the realities of the industry right now are that some participants are pissing in the water others have to drink... so to speak :)
But having said that I would not hold anyone back that wanted to start such an event or ague against actually doing it. We just have to be realistic about the results. Who know, hopefully by time we WILL get a better environment than we have now - some might say, a more mature one :)
I, Brian
01-05-2005, 05:43 PM
But how do you be responsible when some of the things you want to do might be viewed as spam.
That's why I mentioned about developers helping webmasters make those decisions. Someone may wish to admin a blog, but not have time to deal with the sometimes time-consuming task of removing comment spam - that's why software developers need to address these issues.
If all blog software came with:
redirected URLs
no HTML allowed
registered users only
and there was a campaign to move all users to the new versions, you could strangle the practice of comment spam within 18-24 months.
The question is really why have the developers not addressed the concerns as yet?
Even the recent moves by Six Apart to make comments moderated by default will not apply the vast number of Movable Type blogs already in operation.
I might want to be responsible by not "publishing" my site navigation or blog comments to a search engine. I can't do that. If it's on a page, I'm suppose to show the entire page to the search engine. If I strip portions of it, then I get accused of cloaking :)
Flash or Javascript your navigation, and make your comments area accessible for registered users only.
That's effectively addressed the spiders, but it hasn't at all addressed the practice.
Search engines have provided virtually nil input into helping content management systems and webmasters become more creative or sophisticated with what they deliver.
It's not the search engines who need to be there - just the software developers with an understanding of how and why their software distributions may be targeted for search engine purposes, and how they can help webmasters responsibly address such issues more easily.
Or am I off-topic now? :)
KeywordMonkey
01-06-2005, 06:35 AM
Just found this - FyberSearch Fights Comment SPAM with a New HTML Tag (http://www.fybersearch.com/blog/news-link.php?post_filename=01-05/2-post.html)
(Hat tip to where ever it was posted first, can't remember where).
dannysullivan
01-06-2005, 10:03 AM
The thing is, if the SEs allow people to publish certain parts of a page, and others not, then that oppens up a loop hole which spamers can take advantage of.
Spammers already do this. They can use noscript tags, cloaking, noframes areas and so on. I'm sure they can abuse new techniques. But the problem is people who don't want to abuse things but want certain capabilities don't have them at all. Look back at my piece and about Brad Choate. Here's a guy who is cloaking but had no idea he was doing it -- and really, shouldn't even care. It's his content. Yep, we get traffic from search engines. But we should be able to push for changes that will really help content owners, as well.
I don't think they are going to be willing to spend the man-hours developing and researching something that, as far as I can see, would not bring them any direct benefit, only potential losses.
If only on the comment spam thing, Google will need to act or continue to be (unfairly) the whipping boy of all the bloggers upset over it. So they have a public relations interest to begin with. Secondarily, as more and more content spills into the web, they've got an economic reason to try and work more directly with publishers. And they are -- look at Yahoo with its new RSS enhancements, or Google with its print partnerships. What they aren't doing is talking to each other or to the publishers as a whole for any type of overriding stuff that might be needed.
You could strangle the practice of comment spam within 18-24 months.
And that helps the bloggers. Guest book people? People with other indexing problems? To be, the blogger complaints are the visible tip of the iceberg. We have publisher needs, and I think the search engines need to get more coordinated on some publisher solutions.
Tell you what, thinking about it more, I'm almost certainly going to do a session on this at our next show. I'll try to get some reps together on a panel, then let the audience throw out what they want and see if we can't at least discuss some of the problems and practicalities of what's raised. I'm not foolish enough to assume that we'll see immediate changes -- but I'd like to at least see some ideas put out there and be discussed.
I, Brian
01-06-2005, 11:56 AM
Tell you what, thinking about it more, I'm almost certainly going to do a session on this at our next show. I'll try to get some reps together on a panel, then let the audience throw out what they want and see if we can't at least discuss some of the problems and practicalities of what's raised. I'm not foolish enough to assume that we'll see immediate changes -- but I'd like to at least see some ideas put out there and be discussed. If you do, it would be great to get software development companies up there - holding search engines as responsible to deal with spam issues could be dangerously close to asking them to censor parts of the net - a far more dangerous issue, IMO - better to empower webmasters to responsible "voting" via linkpop, than asking for their votes to be completely discounted.
2 opinionated c.
orion
01-06-2005, 03:06 PM
I found the following as great ideas.
1. an indexing summit
2. a special tag for no indexing specific links in a document as mentioned in the blog.
My comments:
#1 is much needed.
About #2, I look at the W3C specifications but could not find anything for this type of specific tags. I will check a bit deeper later. Meanwhile, it occurs to me a possible solution.
As part of some old HTML DOM and SERPS experiments, back in 1998 I played with making links illegal and checking the effects on SERPS. One way to do this is using nested anchor tags. IE renders the inner link active but still are illegal, according to the W3C specification. See Nested links are illegal (http://www.w3.org/TR/REC-html40/struct/links.html#h-12.2.2).
For instance, the following links are illegal. To render the urls, I removed the prefix part, but you get the idea (the http prefix and www part should be used in a real example)
<a href='****://***.google.com/'><a href='****://***.msn.com/'>MSN</a></a>
<a href=' '><a href='****://***.msn.com/'>MSN</a></a>
My data is old, so I'm not sure which modern engines are ignoring nested links. Still the inner link (MSN) is active and clickable in IE. (I don't know about other browsers).
Just a pausible solution to the problem of links placed in forums we don't want to be indexed by search engines.
However, I cannot assure if this idea could/could not be used for spamming, too. :) and due to lack of time, I don't have current data to argue one way or the other.
Orion
mcanerin
01-06-2005, 03:07 PM
One issue is that there has to be something in it for the software people to want to do it. Generally, the push in software is to add more functionality and options, not less.
First, it gives a competitive edge selling to people who often buy the package with the biggest laundry list of abilities because they are not sure what they want yet, and second, it keeps the programmers employed...
Ian
Mikkel deMib Svendsen
01-06-2005, 03:26 PM
There are plenty of ways to hide a link from being indexed - I don't think thats the issue. The issue is that the methods available could all suffer from editorial or algoritmic penalties. What we need (if we need it) is a recognised tag that we know we can use and recomend clients to use without the fear of getting penalised for the use of it.
Most of the enterprise search solutions (on-site search) I've seen and used have such a tag available. This is the way they most often exclude navigation or other site-wide information that dosn't contribute to the direct understanding of each page.
telNform
01-06-2005, 04:52 PM
An Indexing Summit is a great idea ... I would love to help out.
telNform
greenleaves
01-06-2005, 05:32 PM
Tell you what, thinking about it more, I'm almost certainly going to do a session on this at our next show. I'll try to get some reps together on a panel, then let the audience throw out what they want and see if we can't at least discuss some of the problems and practicalities of what's raised. I'm not foolish enough to assume that we'll see immediate changes -- but I'd like to at least see some ideas put out there and be discussed.
I will be glad to assist. And if any of the topics brought up during the event cause a change (for the better) for publishers, I'll be twice as glad. I guess I just can't help being cynical.
orion
01-06-2005, 07:10 PM
There are plenty of ways to hide a link from being indexed - I don't think thats the issue.
1. I don't think is about hiding links but about selective indexing according to editorial guidelines and instructing an engine not to indexing a visible link.
2. Still I conceed such summit could expose even more broader and important issues than mere tag instructions.
3. Timing. And I do feel the time for that kind of summit is right, but need to be carefully crafted. The question is how many would be inclined to attend/contribute to it. I know I'm more than inclined to attending it.
This post is given in the spirit of Danny's invitation.
My blog post today (Comment Spam? How About An Ignore Tag? How About An Indexing Summit! (http://blog.searchenginewatch.com/blog/050105-055807)) looks at the issue in more depth. Feel free to read if you want more background. But definitely please contribute here about what you'd like to see offered to site owners.
Orion
figment88
01-06-2005, 07:49 PM
I'm not so sure an indexing summit is such a good idea. I think the main problem with SPAM of all types is that the engines already work too much alike.
Back in the olden days when none of the engines worked very well, each at least had very different indexing rules and ranking algorithms. A site could do great on one search engine and be nowhere on all the others.
Now all the engines have pretty much the same secret sauce. Google has toned down the importance of links, but still all major engines favor them leading to all the the types of link spam mentioned. Well if there was a major engine or engines that didn't care about links there would be less incentive to link spam.
Search engines should look at things like comment spam as golden opportunities to differentiate themselves from the pack.
I, Brian
01-07-2005, 06:50 AM
One issue is that there has to be something in it for the software people to want to do it. Generally, the push in software is to add more functionality and options, not less.
First, it gives a competitive edge selling to people who often buy the package with the biggest laundry list of abilities because they are not sure what they want yet, and second, it keeps the programmers employed...
Ian
When vBulletin were informed of the "memberlist spamming" technique, they very quickly implemented a solution where vBulletin admins could automatically close the memberlist or else only show members who had made x posts. Both empowered the webmaster to combat "memberlist spamming" according to different requirements.
glenn
01-07-2005, 07:49 AM
Yes, a summit and meeting of the minds would be great, but who represents the SEO community? Clarification on this and many other "acceptable or unacceptable" techniques is the issue. We need active communication with the search vendors and a process to document and approve best practices. Does anyone have any sort of online knowledge library plan? SEMPO...SMA-UK/EU...SEOBy...SEW?
dannysullivan
01-07-2005, 09:13 AM
So to clarify a bit more, we have our next SES show happening in New York at the Feb. 29-March 3.
I'm 99 percent certain I'll add an "Indexing Summit" to the agenda. I'm going to invite the major search engines to sit on the panel. Can't guarantee that any or all will take part, but we get a pretty good track record on panels in general.
I think we'd put to the panel suggestions that come out of both this thread and from the audience. I wouldn't expect them to say -- "Yes, we'll do it!" to anything. But I would ask them for any immediate feedback, thought, plus we can go back to the audience as well. The idea would be to at least start the conversation going and help the reps go back with thoughts. Then maybe they might decide to take the next step of actually working together on some solutions.
As for representation, the conference has plenty of SEO folks in attendance. So you can be sure they'll be well represented :)
Pulling from this thread is another way to get viewpoints and though it may also be heavily SEO-influenced, others may contribute.
Overall, I say skip the idea that this is an "SEO thing" or a "blogger" thing or whatever. It's a publisher thing. Anyone who publishes online has various concerns on how they are indexed (indexed -- not ranked. there are concerns about ranking, as well -- but indexing is talking about what actually gets recorded and how that is done).
So as said, a fairly informal start -- but that's because it is just a start.
Amanda W
01-07-2005, 03:38 PM
Having a Summit is a great idea, if for no other reason than to bring the issues to the forefront.
Couldn't agree more that comment spam is an indexing issue brought into view by blogs, so too guest book and forum spam brought into view by those media.
As the thread has shown there are many issues here:
1) Tweedle-dee - tweedle-dum algos from the engines
2) Publishers urged to clean up their own mess
3) Software for blogs and forums that allow the creation of spam
4) The lack of standards overall, which begs the obvious question that like so many other things -- there are those who don't care if there are any.
5) Other issues of concern . . .
IMHO, the "Summit" at this point should really provide the contours of the indexing landscape. The solutions rest with all of the parties, but at this point each party sees the problem through the lens of their specific interest -- hence the question on who represents the seo community, the deep dive to html coding solutions, etc. With a 30,000 foot view additional problems/issues might emerge. This might make it easier for all of the parties to see where they fit and how they can contribute to solutions.
Alan Perkins
01-09-2005, 06:01 PM
Here are some ideas for various topics that could be raised. Since search engines basically request pages, index text on those pages and follow links from and to those pages, that's where I think we should start. I believe that the provision of mechanisms like this would allow people and publishers to take more control over the abuse of their sites and systems.
Indexing text: Idea for partially preventing content being indexed
Some search engines already support partial indexing of page content. For example, AtomZ uses the non-standard <noindex> and Fluid Dynamics uses a comment, <!-- robots content="noindex" -->, for standards compliance.
Following Links: Idea for partially preventing links being followed
A simple attribute added to the anchor tag would help:
<A HREF="http://site/page.htm" ROBOTS="nofollow">
Indexing text and following links: Idea for labelling ads
Either or both of the above ideas could be extended to labelling ads:
<!-- robots content="ad" -->
<A HREF="http://site/page.htm" ROBOTS="ad">
Requsting pages: URL translation
Every time I modify a dynamic site to help search engines cope with it, I wish that I could just instruct search engines how to cope with it directly.
RFC 2396 defines Generic Syntax of Uniform Resource Identifiers (URI) (http://www.ietf.org/rfc/rfc2396.txt). The syntax of a fully qualified Web address is something like this:
http://domain[:port]/[path-to-resource][;path-parameters][?query-parameters]
The main problems search engines have is with the path parameters and the query parameters. These problems could be overcome by, for example, extending robots.txt to specify how to use these parameters using commands such as "assign", "index", "noindex", "follow" and "nofollow".
Example: a search engine sees a link to http://www.domain.com?page=123&value=10&sessionid=123456 and http://www.domain.com/robots.txt contains the following commands:
#define query parameters
#assign value 20
#noindex sessionid
#nofollow sessionid
In this case, the search engine would fetch and index http://www.domain.com?page=123&value=20.
Now is not the time to publish a full spec, but the use of such technqiues would, I believe, remove the need for mod rewrite, ISAPI filters and 404 handlers which are implemented solely to create search engine friendly URLs.
Requsting pages: predefined HTTP request header fields
The robots.txt file could be expanded to define fields to supplement the search engine's HTTP request header, allowing options like Accept-Language and Cookie to be defined by the site rather than basic defaults to be used by the search engine.
Mikkel deMib Svendsen
01-09-2005, 06:27 PM
Good post Allan, you show how simple it could really be.
... the use of such technqiues would, I believe, remove the need for mod rewrite, ISAPI filters and 404 handlers which are implemented solely to create search engine friendly URLs.
I do not agree thats the only reason for using URL-rewriting. I have done rewriting on a couple of .NET sites primarily to end up with a more userfriendly structure and file naming. The sites could have been indexed without the rewriting but it was done anyway. In other cases it's just been an added benifit. I have also used rewriting on some sites to be able to remap generic areas to IDs that periodically may change so the URL stays the same even if the system ID change (for example as a new node in a CMS). On top of this comes all the "funny" stuff you can also do, like dealing with image theft.
Alan Perkins
01-09-2005, 06:34 PM
Yes, I didn't mean it how you took it. :)
I meant the use of such technqiues would, I believe, remove the need for mod rewrite, ISAPI filters and 404 handlers on sites when the only reason those things are used for is creating search-friendly URLs. I agree they can still be used for other purposes.
mcanerin
01-09-2005, 09:20 PM
I really do think that a "do not index" tag or attribute would be not only very useful to designers, SEO's, site owners and search engines alike, it's also in harmony with the whole idea behind the robots.txt file and robots meta tag.
It's a good thing all around.
The key, I think, is to get the W3C in this - many designers (rightfully) are proud of having compliant code and although technically this could be seen as compliant I think a W3C blessing would be incredibly helpful. Additionally, a W3C opinion would carry a lot of wieght with designers and programmers alike.
May I suggest that the W3C be told we are thinking about talking about this at the SES? They may be interested in attending, since the experts in the SE industry will be there and a lot could be accomplished very quickly with the right people in attendance.
One of the key goals I'll be pushing for the SMA to do is to join the W3C (or vice versa) for just this type of thing - I already discussed this informally a couple of weeks ago with some members of the NA working group, but it's not something an unelected working group can do.
In the meantime, I'd like to call upon anyone who has a phone number or email address of someone at the W3C who could be useful regarding this to help the industry out. Don't spam them, just let them know.
Even if the result of the meeting is that we should not have this type of tag (and I doubt that will be the result) that would be good information, too.
My personal opinion,
Ian
Alan Perkins
01-19-2005, 05:45 AM
Following Links: Idea for partially preventing links being followed
A simple attribute added to the anchor tag would help:
<A HREF="http://site/page.htm" ROBOTS="nofollow">Interesting development: Google, Yahoo, MSN Unite On Support For Nofollow Attribute For Links (http://blog.searchenginewatch.com/blog/050118-204728)For example, this is how the HTML markup for an ordinary link might look:
<a href="http://www.site.com/page.html">Visit My Page</a>
This is how the link would look after the nofollow attribute has been added, with the attribute portion shown in bold
<a href="http://www.site.com/page.html" rel="nofollow">Visit My Page</a>
Mikkel deMib Svendsen
01-19-2005, 05:50 AM
Yes, but it just won't stop comment spam as announced :)
Did spam filters stop spam? It probably did limit it some. At least it limited the effect on me but as long as some people keep responding to spam e-mail and actually buy from it just as long will we have someone trying to !"break" into my inbox.
The same goes for comment spam. As long as there are unprotected blogs, guestbooks and similar available you will have comment spammers. It may work a little less effective tomorrow that it did yesterday but I am sure that for a long long time there will be plenty of open blogs to still make it a pretty profitable strategy. It's a numbers game - just like e-mail spam. And, the numbers are very high :)
Alan Perkins
01-19-2005, 06:24 AM
Agreed that it won't stop blogs being spammed anytime soon. But in a lot of cases the spam may stop being effective quite quickly. :)
Mikkel deMib Svendsen
01-19-2005, 06:35 AM
I agree, Allan. The question is who will it limit and how much of the comment spam do they account for.
I will claim that just like with e-mail spam the majority of comment spam is generated by a few very advanced spammers that send out millions of comment spam posts. Today this activity is VERY profitable for them. With this new tag it might get a little less profitble but as long as it IS in fact profitable such "pro comment spammers" will continue what they - just like e-mail spammers do.
Smaller, less professional, comment spammers might not even know about this tag. How would they. So they may just keep on spamming ... Unless this tag is promoted so much that most people know about it but I doubt that will ever happen. Look at how long robots.txt has been around and then go and check how many actually know how it works :)
AussieWebmaster
01-19-2005, 02:56 PM
I think if the approach outlined by Alan was merged with the view of Ian and there was developed a greater set of instructions for the robots.txt file it may make it easier for people to install these... otherwise we are going to have to incorporate them into the css pages etc.
There is the need to develop more uniformity, there is the need for search engines, software providers, marketers and publishers all to develop a symbiotic relationship to make everyone's job easier... unforunately there will always be the 'black hats'... but we know that and while the battle against that is ongoing it would be great to have a more unified and simplified methodology for those try to work within the guidelines.
We need better guidelines!!!!
Alan Perkins
01-19-2005, 05:58 PM
The robots.txt file can be extended just by using comments (i.e. lines beginning with a #). All the robots.txt lines I posted in my earlier post were valid. :)
dannysullivan
02-10-2005, 02:43 PM
Thread now closed, as a new thread has been started for ideas to put to the panel. Please see Ideas For The Indexing Summit (http://forums.searchenginewatch.com/showthread.php?t=4139) to contribute.