View Full Version : .html / .htm
10-31-2004, 11:09 PM
Out of .html / .htm which are the best file name extensions for our website we have some files in .html and some in .htm which is the best one from the point of view search engine optimization.
11-01-2004, 12:15 AM
as far as I know there is not really much difference.
11-01-2004, 06:04 AM
File extensions really sould have little to do with any thing. I haven't seen any major search engine suggest this is an issue for some time. Google has even joked in the past that .abc or .xyz or other "made up" file extensions work OK. As long as the file is read as an HTML file (which is independent of the file extension), the extension doesn't matter.
11-01-2004, 07:32 AM
As already stated in this thread - the choice between .htm and .html file extensions doesn't actually matter.
I tried a search on Google - just for fun really. Here are the results:
11,600,000 for .swf
33,800,000 for .txt
372,000,000 for .htm
381,000,000 for .php
381,000,000 for .net
710,000,000 for .html
"At Google, we are able to index most types of pages and files with very few exceptions. File types we are able to index include: pdf, asp, jsp, hdml, shtml, xml, cfm, doc, xls, ppt, rtf, wks, lwp, wri, swf."
11-02-2004, 10:52 AM
for SEO purposes it doesn't make a difference, but my server guy believes that all pages should be .html. Some mumbling about microsoft was included in his explanation but I tuned it out.
11-03-2004, 10:55 PM
thanks for the reply
11-16-2004, 08:03 AM
How about with no file extension at all? First time I've come across it - you can do it with that Drupal CMS when you give the pages "custom" URLs.
11-20-2004, 04:37 AM
Doesn't really matter. I lean toward .html because I'm a Unix-y guy. Windows-y folks tend to use .htm from back when Windows preferred three character extensions. I would use some extension though, Marcia. If you don't, then the spiders have to think. That violates a good rule of thumb of mine: Never make a bot have to think. They might get it wrong. ;)
11-20-2004, 08:25 AM
Ummm, never thought about the URL aliasing Marcia. Good point! How does Google handle that then Googleguy? If your aliasing with something like Drupal, does G bot look for an index/default file, or what? As we all know, when you alias, your left with nothing but a / generally. I wouldn't think many would alias using the index.html or default.html filename. Google obviously has plenty of pages listed from the million or so drupal sites, and the quantity of those that use aliasing.
Can you explain that please GG?
11-20-2004, 08:36 AM
>>I would use some extension though, Marcia. If you don't, then the spiders have to think.
None of my drupal intalls use extensions of any kind and google and other spiders seem to like them just fine. In fact, although you can "alias" drupal paths to whatever you want, it's not necessary to specifically do that as they come like that right out of the box
Typically (with clean urls enabled as most do in the control panel) paths look like this:
Spiders just eat em up as they should......?
11-20-2004, 05:00 PM
We just follow links, Anthony. If the link points to .html, we try to fetch that page. If the link points to .htm, we try to fetch that. We're completely agnostic. But in general I would pick one convention and just stick with that consistently. That will also make your muscle memory work for you when you're typing. :)
NickW, it's almost never a problem to not have an extension. But I can imagine that if you had a text file that began with file magic that looked almost like a PDF, a bot could get confused and try to convert it to html. Again, I haven't actually seen this happen, but the keep it simple for spiders rule is always a good one. 99.999% of the time a bot will be able to disambiguate relative urls, for example, but why leave that little bit of chance? I'm not saying that people should always use absolute urls--just that the simpler you make things for bots, the less likely you are to run into problems. Google/Yahoo/MSN can probably handle ill-formed html, for example, by why not lint check a template to make sure every tag is closed before pushing the template live? That sort of thing, ya know? If the current state of things works for you, I wouldn't bother changing though.
11-20-2004, 05:23 PM
Thanks GG, i had no intention of changing - the html whilst not as clean as i would make it (its a doctored vanilla drupal template) ist pretty good with a strict xthtml doctype and could not be confused for anything other than an html file :)
You might clear up another point while we're on the subject, although i know this, i'd love to get a confirm from a SE rep to really quash the illusion that using html3.2 is best for spiders....
all thos <font> tags? do me a favour.. Nice, clean, regular (x)html would surely be equally as good at the very least, if not a great deal better with all the style formatting exported out to a css file....?
11-20-2004, 09:44 PM
Thanks GG. As Nick said, that would be nice to squash on the head.
I'm already a big nut of W3 compliancy for that exact reason. If the page complies to web standards, and the bots are built on web standards, then factored accordingly, then a compliant page just can't go wrong.
Again, thanks GG. It is good to have a rep clear up some minor points without giving anything away. More the rumours that we believe are rubbish, but just travel through an endless source of online gossip mungering.
Next we'll have people insisting that not only does PageRank decide the ranking of your site, but it must be HTML3.2 as well. Some people are just clueless I guess. Commonsense isn't a strong point with some IMO.
11-21-2004, 04:18 AM
i'd love to get a confirm from a SE rep to really quash the illusion that using html3.2 is best for spiders
Sure, I'll quash that illusion. At least for Google, basic HTML is just fine. Clean, well-formed pages are great, but it's just not realistic to expect hand-made pages out on the web to be error-free. Let me see if I can find the canonical Eric Brewer paper from Inktomi days. Ah, here it is on scholar: http://scholar.google.com/scholar?q=investigation+of+documents
and I'll just quote the relevant part:
weblint was used to assess the syntactic correctness of a subset of the HTML documents in our data set.... Observe that over 40% of the documents in our study contain at least one error.
If someone wants to code in HTML3.2, power to them, but search engines need to be fairly generous in interpreting pages. Just because a page doesn't validate HTML-wise doesn't mean it doesn't have good information on it.
11-21-2004, 05:02 AM
If someone wants to code in HTML3.2, power to them, but search engines need to be fairly generous in interpreting pages. Just because a page doesn't validate HTML-wise doesn't mean it doesn't have good information on it
Agree entirely (of course..) - if only it were a perfect world eh?
thanks for the confirm, it's a no brainer really but you'd be surprised how many professional search marketers beleive otherwise...
11-21-2004, 08:10 AM
Thanks GG, much appreciated mate.
11-28-2004, 04:35 PM
I guess the real question is, are there errors in HTML that cause a spider to miss some of our content. Given everyone's experience in browsing a site designed with IE in an early version of NS, or vice-versa, it is conveivable that certain tag errors could cause your content and keywords to be missed.
A bit of a stretch of imagination, but it *could* happen with just about any spider. That said, it could also be that spiders miss content in perfectly formed HTML.
If either of these have known cases, they are bugs and presumable would be fixed.
11-29-2004, 01:32 PM
its always best to err on the side of caution. Validated html is, at worst, no better than error filled html, but at best, assures that errors are not causing the spiders to flee.