View Full Version : Accent Marks in Other Languages
Jorge
06-01-2005, 09:40 AM
The fact is that if you search in google for a word with an accent mark and without it, you will get similar but different results.
What if the word you want to optimize for has an accent mark but people never type it in when searching?
In Spanish you have the following examples:
médico
información
pornografía
mcanerin
06-01-2005, 12:09 PM
I run into the same issue here in Canada, where the proper name for Montreal is Montréal, for example.
However, look at this Wordtracker report:
Keyword Count Predict
montreal 1293 1107
montréal 35 30
Obviously, it would be better to go for the term that most people search for - but there is a catch.
One of my clients is the Canadian Heritage Dept - a government organization that CAN'T misspell Montréal! There are ways around this, but it takes time and planning.
We do know that G will automatically (unless quotes are used) look for a word with the accent if a search is done without it, but it's not considered a good match, so the rankings change - the results that most closely match the searchers query are given priority.
This means, in practice, that people who spell things correctly (both searchers and webmasters alike) are given worse results than people who misspell things, but in a common way.
Although most of the time I support a SE matching the users query in the most precise manner, I believe that the results and quality of searches would improve for these terms if the accented (and "proper") version of the spelling is given priority over common spelling, on the basis that most of the time people don't even know where to find the accented characters using the most common keyboards (ie US English, Chinese, Japanese, Korean, etc, etc, etc).
Ian
Jorge
06-01-2005, 12:36 PM
I agree with you.
And regarding making the accent on or off choice it is very tricky.
In Spanish you can't really not use the accent in a commercial text. It would look very bad. So you could find yourself having either good G ranking and causing a poor impression on your readers, or bad ranking but no misspellings.
I personally do the following (and I would like to know if anyone thinks this is a good idea), I use the word without the accent in the TITLE tag, and description and keyword tags, but always correctly in the body of the text. When someone searches for a certain word they may notice every result has the same misspelling in the title and accept it better than in the text itself once they enter your site.
..and I have had good luck so far with this technique
Nacho
06-02-2005, 02:04 AM
Jorge,
I completely agree with both your comments. Good post! We follow the same strategy, specially because most of our clients look for U.S. Hispanic users which most likely have U.S. keyboards.
The challange really is, when you get a scenario like trying to get users (ie. from Mexico) that do have regular use of accents and your goal is for them to access your client's U.S. based content in Spanish without accents (or less optimized for them).
Jorge
06-02-2005, 05:54 AM
The accent and the letter "ñ" are definitely a challenge for us. The letter ñ could have a thread of it own but it sort of is the same problem so I'll include it in the lot. Google actually also deals similarly with that words written with Ñ and with N. So if you look for niño you get many results with niño but some with nino.
Andy AtkinsKruger
06-03-2005, 05:03 AM
I just did a Powerpoint on this at SES London - if you want a copy PM me and I'll send it to you.
Note that: -
Yahoo treats accents very differently to Google
Google has changed the way it deals with accents - it used to normalise them much more than it does now
Accents are not the only linguisitic issue - there's also alternate - but correct spellings (eg German) - declensions - eg Russian.
Nacho
06-04-2005, 12:41 PM
Good post Andy!
Yahoo treats accents very differently to Google
Can you please expand on that.
Nacho
06-04-2005, 12:45 PM
For Google, here is an example: Doña María (http://www.google.com/search?q=Do%C3%B1a+Mar%C3%ADa) vs. Dona Maria (http://www.google.com/search?q=Dona+Maria).
Jorge
06-06-2005, 10:39 AM
Nacho, that's a perfect example of the nightmare that accents and eÑes can be. Does anyone deal with accents any differently? What about French?
Andy AtkinsKruger
06-06-2005, 10:52 AM
Nacho asked me to come back with more detail on accents which I'll endeavour to do - just a bit pushed right now - however, the key thing to look for in French (and other languages) are accents affecting meaning. These are more often normalised - accents without any impact on meaning (frequently the grave and circumflex) more often give different results. :)
Andy
I think the main difference between Yahoo and Google on accents is that Yahoo will treat an accented word as an alternative to the non-accented word in it's concept network (reference: patent ap: US 2005/0080795 A1). The two searches will be covered in the same superunit because their characteristic signatures are close enough matches. Google sees the two as synonyms.
Essentially, Yahoo sees the accented word in the same way it would see a common misspelling and ranks accordingly.
Here's the alarming thing though; cut'n'paste Doña María into the Yahoo Toolbar (v 6.0) and search. The search runs for Doa Mara! The toolbar can't cope. Google certainly doesn't have this problem.
mcanerin
06-06-2005, 12:40 PM
I actually started a test on these issues a few days ago (h ttp://www.mcanerin.com/articles/keyword-misspelling-test.htm) feel free to check it out and suggest additional tests, but please DON'T link to it (at least, not while the test is running!) - I'm trying to control the anchor text for testing purposes.
This link has been deliberately broken - other mods - don't fix. :)
Ian
patrickdeese
06-06-2005, 05:45 PM
What seems to work pretty well for me is to put the "misspelled" versions of words (diseno / diseño) in the meta KW tags for Yahoo, and in the sitemap's as anchor text for Google.
Jorge
06-07-2005, 05:18 AM
mcanerin, I hope you share your results, at least the part pertaining to accents and Ñes!
rodrigotanco
06-09-2005, 12:29 AM
.
However, look at this Wordtracker report:
Keyword Count Predict
montreal 1293 1107
montréal 35 30
I think its important to focuous in the words people does search more, but pay ads for the correct form because people who know the correct form is a posible informed customer.
Jorge
06-09-2005, 05:47 AM
Well yes, but you cannot have misspellings in a serious text. That is why I suggest putting the word in the title (and as someone else suggested in the metatags) without the accent, and in the text with the accent.
You will have two possible types of sites listed in the SERPs. Some that have misspellings and therefore no credibility (would you buy an airplane ticket from someone who advertises "cheep flites to ansterdan"), and serious businesses that spell correctly.
We cannot forget that what you say and how you say it is as important as being well positioned. You need to establish credibility as a company if you want people to give you their credit card information.
mcanerin
06-09-2005, 04:35 PM
Agreed - getting someone to your site for a term is no good if they don't take you seriously enough to convert once they are there - and pointing out that they can't spell so you had to optimize for that probably won't win friends and influence people, either.
It's a very tight line, and probably the only time I've ever seen "cloaking" actually have some use.
Since I certainly DON'T recommend cloaking, I'm always looking for alternatives to deal with this issue of trying to give a visitor what they were looking for, rather than what they typed in as a search term.
Ian
mcanerin
06-09-2005, 05:13 PM
I started the Keyword Misspelling Research (http://www.mcanerin.com/articles/keyword-misspelling-test.htm) on June 4, 2005 and it's now June 9, 2005 - Google already has results (nice going, guys! :) ) The other 3 search engines have not even indexed a single page yet :mad:
Here is the initial data:
The word "altwrittén" was chosen because it allows me to test the two most common spelling issues on the internet: 1) spelling of non-English words using the English alphabet/keyboard (ie whether the "é" is stemmed into an "e"), and 2) actual misspellings. It will also allow me to test the ñ by changing the last letter for the next test.
The keyword misspelling test contains the following pages. All searches using Google:
Page 1: Main Page. Contains all three words in body text, title and keyword metatag, and I will link to it using the misspellings, as well. It should show up for all three versions.
altwrittén: Result
altwritten: Result
altwriten: Result
Page 2: The proper word is used in the content and title, but with no mention of misspellings anywhere, and is not linked with misspellings. This is the control page. It should not show up for any misspellings unless the search engine stems or makes a decision to include a misspelling for it's own reasons.
altwrittén: Result
altwritten: No Result
altwriten: No Result
Page 3: The keyword misspellings are used in image "alt" tags only (unlinked)
altwrittén: Result
altwritten: Result
altwriten: Result
Page 4: The keyword misspellings are used in image "alt" tags only (linked)
altwrittén: Result
altwritten: Result
altwriten: Result
Page 5: The keyword misspellings are used in the keywords metatag only
altwrittén: Result
altwritten: No Result
altwriten: No Result
Page 6: The keyword misspellings are used in <noscript> only
altwrittén: Result
altwritten: Result
altwriten: Result
Page 7: The keyword misspellings are used within <object> only
altwrittén: Result
altwritten: No Result
altwriten: No Result
Page 8 : The keyword misspellings are used in incoming anchor text only (no on-page use)
altwrittén: Result
altwritten: Result
altwriten: No Result *** I'll check this later
Page 9: The keyword misspellings are used in the title only
altwrittén: Result
altwritten: Result
altwriten: Result
Page 10: The page path (i.e. domain name/directory test) contains the misspellings, but the content does not.
altwrittén: Result
altwritten: Result
altwriten: Result
Page 11: The misspellings are hidden using CSS within the body.
altwrittén: Result
altwritten: Result
altwriten: Result
Page 12: The misspellings are within comments only.
altwrittén: Result
altwritten: No Result
altwriten: No Result
Page 13: The misspellings are only within a Dublin Core tag intended for the purpose.
altwrittén: Result
altwritten: No Result
altwriten: No Result
Page 14: The misspellings are within a bookmark (ie domain.com/page.htm#keyword") link on the same page, but not otherwise on the page.
altwrittén: Result
altwritten: No Result
altwriten: No Result
-------------------------------------------------------------------------
Conclusions For Google
One interesting thing is the order that Google listed these in, though I wasn't testing for it. Feel free to check the listings and draw your own conclusions.
All pages in the test are indexed and show up for the control word. At this time they are the ONLY pages that show up - which is good for this test.
Contrary to popular belief - Google apparently checks and indexes unlinked alt tags. There is another possible explaination - I deliberately used the misspellings as the image file names - I'll check this in the next round. Inconclusive.
Google does NOT index the keyword metatag or other metatags like Dublin Core.
Google will index misspellings in the filename and URL, but not bookmarks (which are technically part of the URL)
Google indexes hidden CSS
Google does NOT index comments
Google will index <noscript>, but did not index misspellings in the <object> tag. This was not the ordinary usage of alt text within an object, but a custom experiement, which has now provided me with useful (though negative) data.
There was absolutely no indication that Google will expand it's search to include an "e" when a search term includes an "é". :(
Ian
Robert_Charlton
06-10-2005, 05:01 AM
Ian,
Thank you. The system won't let me give you any more rep points, but I wanted to let you know that was one of the best posts about an SEO test I've ever seen. It covers a lot of indexing questions beyond the indexing of accents.
Have you looked at results on MSN and Yahoo... and have you tried keyword misspellings in the meta description?
DoingItWell
06-10-2005, 05:27 AM
ALT text is indexed? Interesting. It seems that the Google people follow the trend that has been known in the printed media for some time.
When people look at/scan a page, they first check headlines, then pictures and picture captions. Then the rest of the text gets a go if some of the scanned info has caught the viewer's interest. Some people even check pictures first.
If webmasters know and act accordingly, then picture captions - and ALT text - should be very indicative of the page contents.
Jorge
06-10-2005, 06:14 AM
Great information Ian, thank you for sharing. I also was not allowed to rate positively of your post (again).
I am going to take a good look at your results. You have indeed provided a lot of eye opening information.
mcanerin
06-12-2005, 02:05 AM
Quick update on the tests: it is now the 11th and Yahoo has now indexed 3 pages with the word - unfortunately 1 is this thread, 1 is my blog, and the final one is the actual test page. None of the others have been indexed, yet.
But that's better than MSN (only indexed this thread) and Teoma (nothing yet). When I get full info from an SE, I'll present the results.
This will be the last update until then - posting about there being nothing to post yet is a waste of space and time - I only did this one in response to an earlier question about results for the other three.
It is kind of interesting to see the crawling and indexing speeds, though ;)
Ian
kenpomachine
06-12-2005, 07:45 AM
Jorge,
I completely agree with both your comments. Good post! We follow the same strategy, specially because most of our clients look for U.S. Hispanic users which most likely have U.S. keyboards.
The challange really is, when you get a scenario like trying to get users (ie. from Mexico) that do have regular use of accents and your goal is for them to access your client's U.S. based content in Spanish without accents (or less optimized for them).
The main problem with spanish is not the accents, as the results don't vary that much either using them or not (compare
seguros automóvil (http://www.google.com/search?q=seguros%20autom%C3%B3vil) with seguros automovil (http://www.google.com/search?q=seguros%20automovil), Yahoo only varies some of the positions, but results are mostly the same), but EÑES are a different matter due to the US keyboards used in north America. If you target Spain's spanish market, use the Ñ to optimise because nobody is going to type the word without the ñ, the ñ will show in most of the links pointing to your site. So typing the word correctly is a must.
AussieWebmaster
06-12-2005, 03:36 PM
Is the content of these variations seen as duplicate by Google since there are only small changes to the text etc.?
Or could geo IP redirects allow for variations to be found and listed separately and have the searcher pick the preferenced listing in the SERPs?
mcanerin
06-12-2005, 03:50 PM
Is the content of these variations seen as duplicate by Google since there are only small changes to the text etc.?
This is a very valid concern, and one I was careful to avoid in my tests.
When out "in the wild", be very careful in looking for this type of issue when trying to figure out behaviour - there are so many potential variables that I have serious concerns about conclusions made based on "real" working websites.
As a classic example, you can never "know" who Google knows is linking to you, and there are so many scaper sites out there that's it's impossible to be certain you know all the backlinks of an established site.
Ian
Vetters
06-13-2005, 10:38 AM
Thank you, Ian, for a very informative test. This is very useful research!
amabaie
06-13-2005, 03:21 PM
Ian, this is a great test you are running. I have two suggestions.
Page 8 : The keyword misspellings are used in incoming anchor text only (no on-page use)
altwrittén: Result
altwritten: Result
altwriten: No Result *** I'll check this later
I suggest combining these two misspellings into a single link. SE spiders don't spider duplicate links, so they should pick up only the spelling in the first link.
Also, I think your example of Montreal might be unnecessarily worrying some people here. As a Montrealer myself, I know that the correct French spelling is Montréal. But I also know that the correct English spelling is Montreal (Just as the difference between Roma and Rome, Vien and Vienna, etc.).
The high proportion of non-accent searches are likely English-speaking people (otherwise there would likely be barely enough searches to show up on WordTracker's radar). This becomes less of an issue for a truly French word (not a place name that English people would search for), since French searchers are a thousand times more likely to spell with the accents. Ditto for Spanish, German, etc.
Of course, it all depends who your market is. The ring-tone market is a lot more likely to use misspellings in any language than the B2B market.
Just as a matter of preference, when I optimize client sites in French, German and Spanish, I include as proper spellings as possible, which means accents all around. I might leave them off in linkbuilding if, and only if, I am having problems with non-accented searches (again, depending on the market).
mcanerin
06-13-2005, 03:48 PM
I suggest combining these two misspellings into a single link. SE spiders don't spider duplicate links, so they should pick up only the spelling in the first link.
My impression was that they combine the link text for identical links for the page, but your explanation fits the evidence better, so I will. It should be very interesting to see the result.
Also, I think your example of Montreal might be unnecessarily worrying some people here. As a Montrealer myself, I know that the correct French spelling is Montréal. But I also know that the correct English spelling is Montreal (Just as the difference between Roma and Rome, Vien and Vienna, etc.).
I wish the Canadian government agreed with you regarding it's websites - it would make my life a LOT easier...
Ian
mcanerin
06-15-2005, 12:15 AM
I'm starting to get initial data back from MSN (all the pages are not indexed yet so it's incomplete). Yahoo and Teoma are still no shows, for the most part.
There are only 8 pages indexed so far, but when I do a search for altwrittén and altwritten, all 8 pages come up for both. When I search for the other misspelling, altwriten, only 4 pages show up:
The main page (contains all words in the visible body content)
Page 9 - with the misspelling in the title (this shows up as the first result)
Page 10 - the misspelling is in the filename (testpage10-altwritten-altwriten.htm)
Page 6 - noscript tag.
This means that MSN will look for misspellings in the above areas.
So far it looks like MSN DOES STEM altwrittén as altwritten and return results for both. The é is being stemmed, because the other misspelling (altwriten) is not showing up for some pages that it would if the é was not being stemmed.
This also shows that MSN DOES NOT look for misspellings in:
The object tag
The Dublin Core Tags
Image ALT tags (with links)
I don't have data on the rest of the pages: image alt without links, keywords metatag, comments, bookmark, CSS, and Incoming anchor text. I'll post a full report when I get them.
Ian
Nacho
07-11-2005, 03:33 PM
Interesting test I came across today using Google, a search for:
Pina Colada (http://www.google.com/search?q=pina+colada)
Piña Colada (http://www.google.com/search?q=pi%C3%B1a+colada)
Seems like the #1 spot has a bunch of IBL with the "Piña Colada" by checking the allinanchor:piña colada (http://www.google.com/search?q=allinanchor%3Api%C3%B1a+colada) command, because, other than that there is not a single "ñ" on the source code.
Jorge
07-12-2005, 10:40 AM
That IS strange Nacho .
mrzen
08-08-2005, 10:33 AM
Great thread by the way - it occured to me while reading that if google spidered a title, let's say in a <div> or <acronym> then you could possibly use that to help out:
<div title="Montreal">Montréal</div>
This would pop the english spelling up in an alt tag style pop up and if spidered would provide two spellings for the word (if the spiders read it)...
Does anyone know if this would work?
Nacho
10-27-2005, 03:34 AM
Ian, your research was very deep and with amazing results. Would you mind posting any news findings since last time you posted?
Thanks!
Chris_D
10-27-2005, 07:22 AM
I'd also be interested in ian's update.
We've optimised sites in 11 languages so far, in some pretty competitive arenas.
The general rule of thumb I've developed is to target the keyboard of the language - so e.g. the French can easily type acute characters - and can use them in searches - whereas an english keyboard user won't easily know that (hold alt type 0233 release alt) gives an acute e (é) as in Nestlé
Obviously - as Ian said - there can be a real issue on an English language page using e.g a French / Swiss etc brand name
And that becomes an even bigger issue where the brand is multinational - and their brandname uses an é character for example in English language pages.....
We recently did a site in English & French - and did some research on how people searched differently. The result was that the English version of the site didn't have acutes etc - whereas the French one did. Because the english searchers didn't search with special characters.
German is similar - characters like ß aren't easy for an English keyboard user to type. Use the German character set on the German language version - and the accepted 'english' character translation on the English page - with a link to the German language version.......
amabaie
10-27-2005, 11:06 AM
whereas an english keyboard user won't easily know that (hold alt type 233 release alt) gives an acute e (¨¦) as in Nestl¨¦
I get this: ¦¨ For ¨¦, I hold alt, type 130, and release alt!
Aaargh! How do I get the accents to show on this board?
Chris_D
10-27-2005, 11:34 AM
Sorry - I mistyped its alt0233
i.e.
1. hold down the alt key.
2. with the alt key held down, press 0233
3 release the alt key
é
amabaie
10-27-2005, 02:13 PM
Ah...thet works, too. But 130 is still easier. :-)
orion
10-27-2005, 02:47 PM
Great thread by the way - it occured to me while reading that if google spidered a title, let's say in a <div> or <acronym> then you could possibly use that to help out:
<div title="Montreal">Montréal</div>
This would pop the english spelling up in an alt tag style pop up and if spidered would provide two spellings for the word (if the spiders read it)...
Does anyone know if this would work?
Since this 3-month old post was not addressed at this thread, I feel is time to…
At this SEWF thread Title Attribute Control Test (http://forums.searchenginewatch.com/showthread.php?t=3652) this was investigated. I didn’t want to discourage the proponent of the test, but the thing is that such attributes may not help with rankings for several reasons.
There are several attributes that are used for accessibility and usability reasons: the title attribute, the ALT, long description, table summary and W3C recommended metadata attributes.
During document linearization HTML markup tags, CSS/scripting instructions and any comment placed inside tags and their attributes are removed. Exceptions of this are search engines that care about accessibility/usability. They may retain comments in ALT attributes and in the summary attribute of tables.
Some metadata attributes established by the W3C are designed to help engines during the indexing process. These are like pointers or instructions for making sites appealing to search engines when written in other languages. When writing HTML code, I try to include these in the lines.
In the following example, I boldface these
First, if coding in XHTML for a Spanish site, I use the following code
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es">
<head>
<title>......</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="keywords" xml:lang="es" content="..." />
<meta name="description" content="..." />
'
'
'
The idea of conding in XHTML and not HTML is because the later has been long ago deprecated by the W3C (so as b, i, and font tags). XHTML is more structured, cross-compatible and is part of the XML standard. About the lang attribute, additional information can be obtained from the W3C
1. Why use the language attribute? (http://www.w3.org/International/questions/qa-lang-why)
"Search engines can group or filter results based on the user's linguistic preferences. It is also common to use meta tags to specify keywords that a search engine may use to improve the quality of search results. When several meta elements provide language-dependent information about a document, search engines may filter on the meta elements, using associated language attributes, and display search results according to the language preferences of the user."
2. More about Specifying the language of content: the lang attribute (http://www.w3.org/TR/WD-html40-970708/struct/dirlang.html#h-7.2.1)
3. Check also http://www.w3.org/TR/html401/struct/global.html#h-7.4.4 (Meta and search engines)
The W3C states,
"A common use for META is to specify keywords that a search engine may use to improve the quality of search results. When several META elements provide language-dependent information about a document, search engines may filter on the lang attribute..."
4. Additional info here http://www.w3.org/TR/WD-html40-970708/html40.txt
I’ve used the lang attribute for years with no detrimental effect to rankings or confusing search engines as some have incorrectly suggested (http://forums.searchenginewatch.com/showthread.php?t=333).
Regarding the use of international characters, I try to avoid such punctuations in important passages of the text (e.g., titles, description, specific snippets) unless there is no other alternative or the client insists, of course.
In such cases, I try to reinforce the context and surrounding in which the incidents occurs with terms that do not require special punctuation but improves the intended message.
There are some projects on semantic forms and structures that would make information extraction punctuation/language/keyboard independent and all the associated problems something of the past (not available in the near future but not a long shot dream either as some may think). Until then, we are stuck and need to make lemonade from lemon through the keyboard.
Orion
orion
10-27-2005, 03:07 PM
About the idea of arbitrarily encoding characters. This is something I tried in the past until the W3C put to rest that issue and I quote from http://www.w3.org/International/questions/qa-lang-why
"You might think information about natural language could be inferred from the character encoding. However, character encoding does not enable unambiguous identification of a natural language: there must be a 1:1 mapping between encoding and language for this inference to work... and there isn't one. For example, a single character encoding could be used for many languages, eg, Latin 1 (iso-8859-1) could encode both French and English, as well as a great many other languages. In addition, the character encoding can vary over a single language, eg, Arabic could be encoded with 'Windows-1256' or 'ISO-8859-6' or 'UTF-8' (or another Unicode encoding)."
An IR system must be instructed of that 1:1 mapping for encoding to work properly.
Further reading
Tutorial: Declaring Language in XHTML and HTML (http://www.w3.org/International/tutorials/language-decl/)
Tutorial: Using language information in XHTML, HTML and CSS (http://www.w3.org/International/tutorials/tutorial-lang/)
Language Identification in XML 1.0 (http://www.w3.org/TR/REC-xml/#sec-lang-tag)
Language Identification in HTML 4.01 (http://www.w3.org/TR/html401/struct/dirlang.html)
FAQ: Using HTTP and meta for language information (http://www.w3.org/International/questions/qa-http-and-lang)
Accessibility standard: clarifying natural language usage (http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign)
Identifying natural language for screen readers (http://diveintoaccessibility.org/day_7_identifying_your_language.html)
FAQ: Styling using the lang attribute (http://www.w3.org/International/questions/qa-css-lang)
FAQ: Two-letter or three-letter language codes (http://www.w3.org/International/questions/qa-lang-2or3)
Orion
For sites where accents are imperative (e.g. brand names or place names) yet you want to capture searchers typing without accents, what we do is optimize for the correct version and use PPC to target searchers using the non-accented version. Same goes for any misspelling actually. You could use a specific landing page with the misspelling and prevent search engine access via robots.txt.
That way you'll still capture the target audience who aren't necessarily good spellers but you won't have to display the unaccented version or the misspelling on the pages indexed by engines.