Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 06-23-2005   #1
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
Dash VS Underscore

Ok, this is a genuine question I'm hoping to get an answer for - qualified opinions/facts only, please.

Most SEO's know that the search engines will treat a dash "-" as a space, but will treat an underscore "_" as a hard character. I've spent a lot of time advising clients of this.

The problem is, I don't think I should have to.

The underscore is traditionally used by people in a *nix environment to signify a space. The character is specifically used to indicate a space by the person most qualified to decide it's a space.

Many of the most talented webmasters and programmers come from a *nix environment and use this convention. They are not doing so with the intention of manipulating a search engine, but rather to indicate in good faith that this is a space!

I can't imagine SE relevancy being helped by refusing to acknowledge the intentions of the content creators. I also refuse to accept "that's just the way things are" as an answer. If people did that we'd still be living in caves.

This character is being used by people who are focused on their content and not professional optimizers - isn't this the type of behavior that search engine claim to prefer?

Can *someone* please tell me why the underscore is not treated as a space? Is it a technical reason? A clueless assumption? An anti-spam measure (though I can't imagine how)? A web standards reason? Or is it treated as a space and we just think it isn't?

I'm very interested in the answer to this, since IMO it has a significant impact on relevancy as long as filenames are used for ranking and relevance - no matter how much.

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 06-24-2005   #2
Jill Whalen
SEO Consulting
 
Join Date: Jul 2004
Posts: 650
Jill Whalen is just really niceJill Whalen is just really niceJill Whalen is just really niceJill Whalen is just really niceJill Whalen is just really nice
I don't know the reasoning behind it, but I did hear in person Google's Craig Silverstein say it is so.
Jill Whalen is offline   Reply With Quote
Old 06-24-2005   #3
projectphp
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 449
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
Quote:
Can *someone* please tell me why the underscore is not treated as a space? Is it a technical reason?
Lock in "C", the programming language Eddie (or whoever does millionaire in your country).

The reason underscore isnt a space is because of programming functions like mysql_affected_rows in PHP, and many older C functions, pre Hungarian notation (that I dislike... but that is a nuther story).

If underscore was a space, how would an SE find that function for me? I don't want three words on a page, I want mysql_affected_rows, one word.

IMHO, the underscore != space convention in SEO is a very good one, for that reason alone.

And then ther is intent. hypen / dash is always a space, or nothing in the case of line wrappers like some-
thing.

Underscore was always something different.

My $0.02.
projectphp is offline   Reply With Quote
Old 06-24-2005   #4
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
projectphp, that is a good point, but I don't think it is in this case because the engine won't be using SQL queries to find results or mine information.

In computational linguistics, the hyphen is never changed and always kept as a hyphen because it indicates a compound noun. Underscores are used to seperate words when spaces cannot be used or keywords would be gone, like when you save a document in word, but it doesn't maen that they are nouns.

In some engines the hyphen is kept as a white space because it would indicate that there is a compound noun. This is important from a query point of view. If it isn't recognised you treat it as a string but allow partial matches.

If you type in Arab-Israeli into Google, you will get that compound back. If you type in ArabIsraeli you get all the error matches of the word (people who can't spell). When you type in Arab Israeli you will get both Arab-Israeli and Arab Israeli. Compound recognised. When you type in Arab_Israeli you get mostly the URLs, but its still tokenized though and so you'll get matches for Arab-Israeli. Partial matches.

Type in "multi-valued". You will get results includig all three ways, but it does suggest the "multivalued" spelling because it recognises that "multi" is not a word. Sometimes you just default to the case of a single token without hyphen.

Basically this is a common Computational linguistic question. Should you split them or not.

Hyphenated words are compounds which are always checked to be 2 nouns, and if so, both words are of interest. Compounds are very important. Underscore is just a seperator.


N.B: A dash is a pause in thought, but a hyphen is a punctuation mark used between the nouns of a compound or the syllables.

(I have to go now, in a rush, so if this seems a bit convuluted do ask away!)

Cleaned up CL approach in blog.

Last edited by xan : 06-24-2005 at 06:28 PM.
xan is offline   Reply With Quote
Old 06-24-2005   #5
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Ian,

Use - or _ in URL?

covers this and has links to other SEWF threads that discuss more in-depth
information regarding delimiters.

Still I want to add the following.

Two scenarios arise, when a delimiter DEL itself is
surrounded by spaces and when is not; e.g.,

with spaces, k1 DEL k2
no space, k1DELk2


In a FINDALL and other search modes, DEL will be ignored when
surrounded by spaces.

Without spaces, DEL affects the query in different ways, depending on the type of delimiter used. For instance,

Hyphens >> In Google and other SEs, a hyphen acts as a
localized EXACT mode within the FINDALL search, so

k1-k2 and “k1 k2”

tend to return similar counts. If we search in FINDALL for k1-k2 + k3,
then the k3 can appear before of after the localized EXACT sequence,
anywhere in the document since we are submitting in FINDALL mode. This
why is considered a localized mode within a mode.

Pipes >> In Google, k1|k2 acts as an OR search, but in MSN this acts
as an EXACT mode. This was a new parsing rule added to MSN beta.

Underscores >> In Google and other search engines this appends
terms, so

k1_k2 is considered one word.

A query for k1_k2 returns documents containing k1_k2 but miss documents
containing k1, k2 or k1 followed by k2. GoogleGuy has explained this before
at http://www.webmasterworld.com/forum3/23371.htm

“Yah, I'd stick to hyphens, periods, or commas. Most people seem to prefer
hyphens. If you use an underscore '_' character, then Google will combine
the two words on either side into one word. So bla.com/kw1_kw2.html
wouldn't show up by itself for kw1 or kw2. You'd have to search for kw1_kw2
as a query term to bring up that page.

The characters you can use in domain names are pretty restricted: a-z, 0-9,
and the '-' character. For subdomains and url paths (stuff after the slash),
you've got a lot more flexibility, but I'd recommend keeping it pretty simple.
That makes it easier for search engines and users to understand.
There's actually a proposal so that you can encode all sorts of characters in
a domain (e.g. CJK--Chinese/Japanese/Korean) but that's a little outside the
scope of your question, and I'm not as familiar with the encoding. My rule of
thumb is to keep it simple where you can.”


Ian,

As to WHY, which is your original question, I believe this is to adhere
to conventions used in many programming languages and with regular
expressions, where incidentally

hyphens are used to denote ranges (thus, localized EXACT sequences can be viewed as ranges where the window separation, w (# terms separating the query terms), is of zero length)

pipes are used to denote OR operations

underscores are used to visually identify variables labeled with
multiple words as one single label for the variable.


With proprietary IR systems the above rules can be overwritten. For
instance, legal IR systems can have different parsing rules (hence
interpretations) for hyphens, returning different set of results.


One more thing, watch out for experimentations and modifications made to the parsing rules.

From time to time, some engineers like to test/modify these. One of such,
was performed in Google News in 2003. Back then, it was reported
underscored queries were treated as separate terms. Research Buzz reported
in http://www.researchbuzz.org/google_n...e_syntax.shtml

“Google News has a special syntax that it doesn't share with the Google Web
search. It's called source: and it works like this:

source:[name of source]

So if you search for cricket source:bbc , the information you receive back will
be only from the BBC. What happens if you want to search for a source that
has multiple words? You separate the words with underscore. Thus
Washington Post becomes source:washington_post . “


End of the quote.


I search Google News today and searches using washington_post,
car_insurance, search_engine returned nothing. Other searches returned
nothing, too. It appears, at least to me, they no longer support this modification but I could be wrong.



Orion

Last edited by orion : 06-24-2005 at 05:35 PM.
orion is offline   Reply With Quote
Old 06-24-2005   #6
projectphp
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 449
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
Quote:
but I don't think it is in this case because the engine won't be using SQL queries to find results or mine information.
No idea what that means, but my understanding is that they create a reverse index of all words, there position and other stuff. Wouldn't that make the choice of how to enter a word, either as the suingular mysql_fetch_rows or as three words mysql fetch rows the issue??

Happy to be corrected!

Quote:
If you type in Arab-Israeli into Google,
Just did, and got very similar results. The main difference was highlighting!

But try [url=http://www.google.com/search?hl=en&lr=&c2coff=1&biw=1257&q=on-line&btnG=Search]on-line[/b] and you get a lot of results for online no hyphen. What does that show? Maybe that Google includes misspellings now? Not sure we can draw conclusions from that, as the cause effect is hard to isolate.

I still stand by my original comments. Programming languages, especially those that geeks who write SEs use, have so many functions with undescores in them, that seperating them out as individual words would make for worse results when it matters most to them.
projectphp is offline   Reply With Quote
Old 06-25-2005   #7
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
I didn't make myself very clear, that wouldn't suprise me!

SQL is not used to get stuff out of the index. Therefore the reason is not due to how SQL queries are syntaxed.

I explained fully elsewhere, but yes, it is the highlighting you're meant to be looking at. It tells you what's being picked.

Its a CL problem.

Its a huge misconception that its all geeky stuff. This is hardcore computing science, not dabbling with PCs. Any IR system will rely very heavily on computational linguistics.

Try it with Right-of-way then. "of" gets dropped its a stop word remember, so the hyphens have become spaces. Right_of_way will still have "of". It's not recognised as a single word.

Just linguistic rules on the creation of compunds and topical types.

Last edited by xan : 06-25-2005 at 06:07 AM.
xan is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off