View Full Version : Ideas For The Indexing Summit
dannysullivan
02-10-2005, 02:37 PM
At SES NY, we're having an Indexing Summit where the four major search engines will hear and discuss ideas on what marketers want in terms of better indexing. For more background, see Big Four Confirmed For Indexing Summit (http://blog.searchenginewatch.com/blog/050210-134417). In this thread, please share what you'd like considered.
How do they feel about the future of search and search engines in the future in an ever changing search space?
randfish
02-10-2005, 04:16 PM
Orion recently made a post about a paper from Baeza-Yates on PageRank's (and other global popularity systems') preference for older sites. It might be more appropriate for an engineering summit, but perhaps you could ask if the search engines are addressing methods for making search more temporally aligned with content.
Chris_D
02-10-2005, 10:22 PM
An email address where a webmaster can send an email titled
"Is www.examplesite.com banned?
and get an answer......
Robert_Charlton
02-11-2005, 01:56 AM
Danny - Thanks for the opportunity to present input.
It would be nice to have some uniform standards across engines on a number of indexing issues. Several have been widely discussed, and you touch on them in your Indexing Summit article....
- the treatments of 302s and meta refreshes...
Yahoo seems to have solved it (I'm not certain of that, but I think they have). No link from another domain should be capable of knocking your url out of the index. How about the rest of the engines going along? Can you imagine the web in a few years if this isn't uniformly resolved?
- the indexing of parked domains...
Registrars and IPs generally "point" parked domains with 302s to your main domain, but this is not a good idea. I'm sure all of your panelists will understand why, and they're the ones who need to fix it.
- the inclusion in the index of links to blocked pages...
There should be a way to keep these links from being listed publicly without having to block the pages that contain the links. I don't believe there's any clear understanding among some very knowledgeable webmasters, let alone a standard, of how Google, eg, treats the robots meta tag and robots.txt with regard to indexing links. This needs to be uniform across engines, I feel.
At the least, an engine's individual treatment of the above should be public knowledge and displayed on its help pages.
- the use of the no follow attribute...
There's been much discussion of whether it's legit to use this attribute for anything beyond blocking blog spam. I agree with all the concern about using such a link for hoarding PR, screwing your link partners, and distorting the ecology of the web, etc, etc... but, I also think there's a need for a link that is just a link. Lots of clients want to cross-link sites or pages a lot more than I'd recommend, and I feel there ought to be a way for them to do whatever they want in that regard without fear of crosslinking penalties. They'd get no benefits, but they'd suffer no harm. This is not a cut and dried issue, I understand, and I could argue both sides of it myself. In any event, we should know what the no follow attribute can be used for.
- Referral ID strings and referrer info...
Ammon has started an excellent thread on this topic, and I'll simply point a link to it...
http://forums.searchenginewatch.com/showthread.php?t=3796
- hidden scraper links...
I'm not sure that this is an indexing question per se, but it's one that's related to whether links from other sites can hurt you, and as such might be appropriate to the panel. I'm seeing lots of scraper sites that are hiding a lot of outbound links (ie, hidden text) in their content, probably to raise their hub scores while not providing alternatives to the AdSense links on their pages. How worried do any of us need to be about these hidden links if they point at our sites?
- www/non-www ...
And yes, as you mention, the www/non-www issue is a drag. While the mirror site issue can be controlled if you are able to use mod_rewrite, the majority of sites and a great many hosting situations don't offer that control. Even though a simpler way would end up costing me some income, I'd be all for it. There are other problems I'd prefer to spend my time on.
Wish there was a paid support system where you could get express answers on indexing questions?
You've mentioned this before, and I've posted that I thought it was unworkable, and that it would greatly un-level the playing field.
You don't know how many times during the sandbox, though, that I've wished there was such a system, but how could you possibly get your real indexing questions answered? If you can make some plausible concrete suggestions, I'm sure they'd make for a great discussion on the forum.
storyspinner
02-11-2005, 09:28 AM
i've been pondering two areas that seem to be growing in seo
1) how to have my site optimized to take advantage of the image searches (and now that Google's adding pictures at the top of some SERPs)... basically... are the rest of the 3 going to follow suit and how do they index for these results?
2) how to have my site optimized to have my site appear in all of the "Answers" areas that are now cropping up. Example: "How Do I Make Beef Vegetable Soup" ... we've got recipes, how can we optimize to work with these areas?
Thanks for the opportunity for giving input! :)
I, Brian
02-11-2005, 09:39 AM
Real channels of communication would be great.
At the moment Google especially lacks communication on how it functions - other than to instruct how webmasters will use the internet, or else put marketing spin on any issues forced into public media.
For example, Google has applied very serious changes to it's search engine results. The only public word I've seen from Google is GoogleGuy posting in WebMasterWorld of a GoogleGroups address to e-mail comments to, and even then, he tried to spin it as "spam clearance". No admission that there may be technical issues, no requests for patience while Google engineers figure out where actual issues are.
Webmasters build the internet, Google processes information relating to that - but Google seems to treat the process as entirely one way. Google doesn't seem to realise that a lot of webmasters would be much more supportive of Google if they actually engaged - or held open channels - for dialogue on Google issues. We would have a working partnership - but instead, Google simply issues dictates and we will obey, or else be damned. After all, how could a billion-dollar company "do evil"?
That creates two separate sets of intent for the internet - what Google wants; what webmasters will do. That leads to absolute crap being dumped into the internet in order to satisfy two separate agendas, precisely because there is no agreed convergence of interests.
While Microsoft and Yahoo! have their own aloofness, at least they have human faces in names such as Scoble and Zawodny - whilst all Google has is the mysterious mask of "Googleguy" who drifts in and out of forums, leaving a trail of the most appallingly sycophantic dribble from other webmasters in his wake. That doesn't make for constructive dialogue.
Ultimately, there needs to be some ground laid out for common action and shared interest among Search Engines, SEO's, and Webmasters, because at the moment there are separate sets of interests between all, and this is leading to the complete dilution of quality content on the internet.
And the only way to even create a common ground for action and shared interest is to have constructive channels of dialogue open - no matter the limits.
If Search Engines, SEO's, and Webmasters, can only enter constructive dialogue, then perhaps we can all agree certain basics together, which share all of our interests - Search Engines are better helped to separate the wheat from the chafe; SEO's as having real working guidelines where search engines accept that search engines are a real market effect that needs to be considered; and webmasters in general having a clearer idea of how to present useful content in the most useful way.
There are real issues at the moment in various topics, not least that Google guidelines seem to imagine that webmasters should pretend that search engine spiders do not exist - and problems of what constitutes compliance, especially in areas of advertising where useful text link adverts with crawlable links fall in the grey area of "manipulation" as opposed to being genuine means of advertising.
And while it's great that members of search engines will sit on an "indexing committee" and maybe even exchange pleasantries with the plebian masses, is it simply because they need to find out how to promote their own set of interests, at the exclusion of all others? Or is it a genuine step towards some form of working dialogue, where search engines can acknowledge the formative effect they have on the internet, and work with those who build the internet so as to have a common goal that benefits the internet, search engines, and webmasters? Or are we going to see session spin, where webmasters must obey or else be labelled "spammers" for the smallest of infractions?
The internet is a global community, accessible by nearly 2 billion people on this earth. On the internet we are all equal as surfers or webmasters, regardless of income, race, religion, culture. Yet the keys to the information processing of that equality are held by billion-dollar multinationals.
If these companies wish to promote their services and the internet together, I'm sure the internet majority would be happy to help. But if such companies are seen as authoritarian - even oppressive - in their demands, then the interests of every party will diverge, and we will all ultimately suffer for it.
Even some limited degree of dialogue and communication are the first defence against that.
2c.
I, Brian
02-11-2005, 09:54 AM
Also - get Mike Grehan's complaints about Filthy Linking rich (http://www.e-marketing-news.co.uk/Oct04/RichLinking.html) up there. A lot of newer sites and webmasters feel highly pressured to attempt more and more risky strategies, simply on the basis that it's harder for good new content to be visible.
Google's apparent sandboxing of new sites is an especially driving factor behind this, and simply encourages more desperate means just to get any kind of visibility - especially where established sites even mentioning a subject in passing makes it rank higher than in-depth commentaries on the same subject.
Also - someone please point out to search engines that if we really wanted to search the Yahoo! directory, Google Directory, or DMOZ, then we would have tried searching those instead of having to rely on a search engine. :)
I would like to know how the up-and-coming trend for geo-redirection will affect globally targetted (and globally useful) sites, and whether there is a sensible way for sites with universal appeal to escape being confined to the index of their host's country.
projectphp
02-11-2005, 12:01 PM
My $0.02:
- A few (more than one) URL variables that are essentially invisible. Don't care what they are, but it would help both parties, as tracking would be better, and there would be less rubbish URLs in the indexes.
- More options to control things e.g. better robots.txt options, and / or a few more metatags, e.g. <meta name="newsbot" content="noindex" /> and no image tags.
- Ratification and standardisation of filetype handling / banning, e.g. the ability to ban crawling of speific filetypes that is agreed upon and not vendor specific.
- Ratiification and standardisation of a few vendor specific commands like the MSN robots.txt crawl-delay.
- Ratiification and standardisation of current defacto standards e.g. noarchive (never ratified anywhere I can find).
- Some crawler base standard feature that SE compare themselves to, e.g. a document that outlines all the official features (tags etc) and what each does, that an SE can then satte which they do and do not support.As an example, the nofollow link attribute would be a yes for G, Y and MSN, but a no for Ask, while noarchive would be a yes for G and Y etc etc. This would help site owners compare what control they do and don't have with a specific engine in a common format. This should be prominently linked to, and should be a proforma that all have the same format to make it easy to understand.
- For all SE to adopt tools similar to what Google has here: http://services.google.com:8882/urlconsole/controller?cmd=reload&lastcmd=login, i.e. the ability to have pages removed within 24 hours. Well handy in more than a few situations, and this should be something we all demand of the SEs.
- A voice for webmasters when making these decisions. We should have some input into this stuff, and the opportunity to be involved in drafting these ideas.
St0n3y
02-11-2005, 12:20 PM
Also - someone please point out to search engines that if we really wanted to search the Yahoo! directory, Google Directory, or DMOZ, then we would have tried searching those instead of having to rely on a search engine. I second that, this is one of my primary areas of concern.
I also think that the emphasis placed on "authoritative" sites is far too great. When an authoritative site gets a top ranking for keywords its really unrelated to (from a simple mention on the page), it shows that too much weight is being assigned on that basis alone. Authoritative is great, but it still needs to be relevant.
Black_Knight
02-11-2005, 02:26 PM
Well, Robert already mentioned the Referrer IDs in query strings issue I have (Thanks Bob) so I'll go straight to my other major gripe of the past year:
Could the engines perhaps remember that the important thing about the Robots Exclusion Standard is for it to be standard. Unilateral extensions to the robots.txt commands are breaking the standard, and could all too easily lead to a robots.txt being invalid through having non-standard additions, such as the throttling control ( crawl-delay: (http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexi ng.htm#D) ) added by MSN or the several extensions added by Google (including wildcard support and an explicit ALLOW statement (http://www.google.com/webmasters/faq.html#1)).
If a unilateral addition is necessary, then find a way to ensure the non-standard addition cannot affect robots looking for a fully valid, standard robots.txt
In other words, use the comment demarkations for non-standard additions, just as JavaScript in HTML uses comment tags to hide it from user-agents that don't support JavaScript.
krisval
02-11-2005, 04:01 PM
A published somewhat detailed best practices. of what NOT to do.
Expaned list similar to what Google has now, but Yahoo and MSN should have one along with more detail.
- Don't use this type of redirect
- Don't create duplicate content
- Don't do whatever
I have seen a couple of friends of mine in the local real estate business who had a site created for them, then accidentally did a couple of no-no's like a bad redirect in the site. Innocent, but penalized.
Chris Boggs
02-11-2005, 04:02 PM
I know this is about what marketers want, but perhaps offering the chance for actual users to "Rate" SERPS based on whether the search was usefull. If I perform a search and find that three out of five links are not really relevant, I would like to be able to comment on it. this could help improve SERPS one page at a time? perhaps this would be too micro-level, but I think it could help the SE's get a better idea of trends.
I also agree with St0n3y and I Brian that Google should not come up with a Yahoo! directory page as its number one result...I could have gone there if I wanted that.
projectphp
02-12-2005, 08:23 AM
Some crawler base standard feature that SE compare themselves to ...to make it easy to understand.
Ok, that made very little sense. Let me try again.
It would be super nice if there was some agreed format that outlined what commands each SE has and has not implement, both standards based and proprietary.
There are currently a multitude of such commands that can be issued to a robotic crawler by webmasters, from standard based robots.txt commands like dissalowing folders, to proprietary commands like noarchive and MSNBot's crawl-dealy:. Such features are documented at a Search Engines discretion.
NB: not implying they hide these commands, just that there is no standard way to document them.
Having a standard format for robotic features would make it easier to understand the features a specific crawler has implemented, and have a known place documenting when.
The w3c has a bunch of these documents to outline various doctypes, e.g. http://www.w3.org/TR/REC-html40/loose.dtd.
As part of this initiative, crawlers would be encouraged to include the URL of such a page as part of their User-Agent.
IMHO having one place to find out all this stuff would simply be better house keeping, and would make new feature additions more transparent.
As far as the timeliness of reindexing, SE's can follow MSN's new example. I have no complaints about MSN's new search engine, or how quickly it reindexes. Very current and accurate. Good relevant algo, too.
carpediem
02-14-2005, 03:13 PM
I believe being able to get feedback from Google would be extremely beneficial. We had a problem with a site in Yahoo! We filled out an online form at Yahoo! and a rep got back to us within 1 day telling us they felt our site violated one of their guidelines (and actually listed the possible issues). That was amazing and very helpful!
We had a similar problem with Google in Sept. I spent 2 months trying to figure out what the issue was only to later find out that it was a manual penalty because our company owned more than one site at the top of the SERPs.
The sites did not interlink, nor did they not violate any of the other listed guidelines (the content was unique and the sites had vastly different features). Our site was affected by something that they do not publish.
And I am fine with the entire situation if this is a guideline established by Google - and even if they don't post it, then at least give me a way of determining an issue. I was just pure lucky to actually talk to a Google Rep at an SES conference that was aware of our situation.
I was frustrated by this because I simply want to be on a level playing field. I could point out a dozen other sites that should be penalized for the same thing. And if the algo was the only thing determining a penalization, then we would either all be hit or none of us would – at least it would be a level playing field. But when someone goes in manually and takes action against a specific site, I think it would be appropriate for me to either give a case why these 2 sites should continue to rank. Or I should be able to say, "If I violated this unwritten guideline, can you look at these other sites doing the same thing and take the appropriate action?"
And if Google disagrees with me, then fine. In fact I have no complaints; I know that by doing SEO I have no guarantees. If Google does not think my site should rank (either by a manual decision or an algorithmic one), then it won’t rank well and I will accept that. But the total lack of any communication and penalizing sites that violate unclear (and in some cases incomplete) guidelines.
If we don't know what the rules are, how can we play by them?
I do have one major concern with Yahoo especially. Although, I am glad they take steps to provide feedback to site issues as the previous poster mentioned. Google should as well.
My issue with Yahoo is their inclusion of their Directory in SERPs. It doesn't help anyone, but it can be a hindrance for everyone. It's great if they use a directory, for example as Google uses DMOZ, but that should only be a factor in where to point the indexing bot (in Yahoo's case, the 'slurp').
It doesn't do the searcher or webmaster much good when a directory is listed in the serp and there's little the webmaster can do to alter its description to better match the site's most recent modifications, thus describing its relevancy. As it stands now, if a homepage of a site is included in the directory, it may sit stagnent in SERP's (in no-man's land) even though the site may have changed and is relevant to a slightly different query. Site owner's prerogative. Content changes, whether slightly or by leaps and bounds, site owner makes the call as to how it's described in tags and on-page content. That's normally what SERP's display (and should) -- title, description and/or page content that matches the query. When a directory listing shows up in a SERP "as was", the webmaster (and searcher) is at the mercy of Yahoo to change that description. Therefore, it can hurt relevancy unless the webmaster has access to change the description at any given moment to inform searchers what the site is really about these days. For now, the webmaster may fill out a form to "request" a change in the directory description, but Yahoo has a disclaimer: (paraphrased) don't hold your breath for it to be done in a timely manner. Or they may not change it at all if they fail to see the need for a change. They're only seeing it from one perspective, that such changes are merely for SEO. While true in some cases, overall I think it's still a bad practice (directory in serps) that suffers the searcher. Meanwhile, the other SE's have already reindexed the site and list it in SERPs with an up-to-date title and description, freshly obtained from the site itself, in its current state.
The way I see it, search relevancy is accomplished in 2 parts -- in the search engine looking at the current site, and the webmaster's ability (and accessibility) to show relevancy.
Otherwise, it's almost a deterrent to recommend being included in Yahoo's directory.
Solution -- take directory listings out of organic search results, and instead display the site (homepage) as the bot sees it, in its current state. Then let your algo make the call if it is relevant.
Chris_D
02-15-2005, 02:10 AM
<mod hat off>
I personally think the issue of spammy scraper sites - sites/ directories built using SERPs as content and serving adsense - is the 'next' major spam issue facing web search.
I'm seeing tons of these kinds of scraped serp sites ranking highly in Google - despite: "No Google ad may be placed on pages published specifically for the purpose of showing ads, whether or not the page content is relevant."
https://www.google.com/adsense/policies
If you think about it for a minute from a commercial perspective - despite the policy - it's clearly in Google's best financial interest for these 'SERP scraper' sites to rank - after all - these sites are Google's advertising partners - and Google gets revenue based on their click throughs.
Have the other SE's considered how to address this new growing 'scraped SERP spam' issue?
<mod hat on>
Chris Boggs
02-15-2005, 09:14 AM
Have the other SE's considered how to address this new growing 'scraped SERP spam' issue?
<mod hat on>
Seems like many SEO's/SEM's have: build a directory site relevant to each large client's industry. :D
Seriously though, the idea of having a system allowing us to find out if a particular URL is in any penalty status would be great. Although it would be nice to have a detailed response like that which carpediem received from Yahoo!, even a simple status checking tool would be awesome with just a few levels:
"No penalty" (Hey it's not the SE's fault they aren't ranking)
"Minor penalty" (Possible duplicate content)
"Major Penalty" (Difficult to find without brand name in search)
"Game misconduct" (Site banned)
(can you tell I miss the NHL?) :p
KeywordMonkey
02-16-2005, 07:09 PM
I'd really like to know more about how they detect duplicate content and what is and is not duplicate content in the opinions of the engines - so we can help clients stop producing it by mistake.
(See this thread (http://forums.searchenginewatch.com/showthread.php?p=34657) for a new discussion around this).
projectphp
02-21-2005, 04:44 AM
*bump* This is fast approaching, and I have a few last ideas.
I would like to see Seach Engines ruled compliant or otherwise with certain levels of standards, similar to the W3C Web Content Accessibility Guidelines 1.0 (http://www.w3.org/TR/WAI-WEBCONTENT/full-checklist.html), which has three levels of Accessibility, and the various SQL standards.
That would mean that rather than the hodge podge implementations we are likely to get with non-standard ideas link the nofollow attribute (do search engines follow the link, or just not parse link pop / PageRanl??), we would have an official standard to compare a crawler to.
To do this, the W3C (or some similiar body) could create a site that tests these standards, via the implementation of various robots.txt files and pages designed to test a crawlers observance (or otherwise) of specific features.
Actually, I have a couple of ideas :)
- A few different robots.txt protocols, that are of course backwards compatible, with a few more generic robots.txt "user-agent"s. this would allow a crawler to parse the file based upon their own compliance. This would work similiar to how different JavaScript languages are parsed by browsers.
So, a robots.txt 1.0 compliant robot would parse the commands sent to it, and a robots.txt 1.1 compliant robot would parse different commands. Preferably, this would be done in the comment tags, e.g.
User-agent: * # protocol1.0
Disallow: /stuff/
User-agent: protocol1.1
# Disallow: /stuff/*.pdf$
# Disallow: /stuff/*.doc$
# Disallow: /stuff/*.xls$
- For all hacked on vendor specific commands that are user-agent specific to be implemented in the comment area e.g.
User-agent: MSNBot
### Crawl-Delay: 20
That would stop any accidental problems.
Interesting to see how this all develops.
projectphp
03-20-2005, 07:05 AM
*bump*
Given the news that Agence France Presse Sues Google over News Content (http://www.afp.com/english/home/), it is a bit of a pity that the idea of more control for webmasters didn't come to pass.
IMHO, webmasters need more control, and Search Engines should be providing more ways to opt out of the newer features they offer (image search, product search, news etc). Anytime a new feature is launched, I should have a means of opting out.
Having the ability to opt out of certain types of niche search engine indexes is about overdue, and not terribly hard to implement. As part of that, I really believe we need meta-useragents that are generic for all search engines, for example newsbot, shoppingbot and imagebot should be the generic user-agents that a webmaster can issue commands to, and all SE that have a news section will obey.
In so many ways, robots.txt is outdated for modern needs, and needs a complete rethink, and I hope we get that at some point soon.