PDA

View Full Version : Help needed - measuring International Connectivity


suely
03-09-2005, 09:37 PM
I am doing some academic research on international patterns of connectivity. Former researchers reported to have used AltaVista to look for bi-lateral hyperlinks from different national domains (using a simple <domain:.xx AND link:.yy>) but the results I get are incredibly inconsistent. Other people used specially-built robots, but these get clearly too small a sample.

My question is:

does anyone out there know how can I search for inlinks and outlinks from Brazilian websites? I guess I must define <domain .br> as my reference and go from there. I tried to combine that with <allinanchor:> in google, but it does not work.

If anyone can help, or think they can waste some time trying to work it out with me, please feel free to contact me directly (private messages are on in SEW) or to reply to this thread.

Thank you lots

Sue

Nacho
03-10-2005, 02:01 AM
My Portugueese is not good at all, but decided to play around a bit. Looks like you're focusing on Brasil.

For inbound links

We know Yahoo! (http://br.search.yahoo.com) has a more accurate data of links collected than Google and others, so I'll try here first.

You will want to go to br.search.yahoo.com and then go into the "Advanced Search" link. Then follow these steps:

Show results with all these words use . . . . site:domain.com.br (or domain.br)
Then select where it says . . . . . . . . . . . . only domains with .br
For Country select . . . . . . . . . . . . . . . . . Brasil

=>Search

Here is an example for site:www.uol.com.br (http://br.search.yahoo.com/search?_adv_prop=web&x=op&ei=UTF-8&prev_vm=p&va=site%3Awww.uol.com.br&va_vt=any&vp=&vp_vt=any&vo=&vo_vt=any&ve=&ve_vt=any&vd=all&vst=.br&vs=.br&vf=all&vm=i&vc=countryBR&fl=0&n=100), which returned 291,000 inbound links.

For outbound links

That's a little bit more complicated. You can try Xenu Link Sleuth (http://home.snafu.de/tilman/xenulink.html) or you may probably need to write a spider to go through the entire site and store all outbound links (excluding domain tested) and measure:

a) Domain TLDs (.com.br, .br, .gov.br, .org.br, etc.)
b) Capture outbound link's website IP address
c) Trace back DNS to hosting to find if site is hosted in Brasil

Boa sorte!

suely
03-10-2005, 02:37 PM
Thank you, Nacho!
I am not sure that your suggestion will do the trick, but it certainly made me feel less lonely with my problem.

What I need to do is to count (and identify) sites which receive and give links to Brazilian domains. Therefore I need to be able to search for both, inlinks and outlinks. I can certainly map (or spidy, whatever it is called) the sites afterwards to check where the links come from/go to, but I need to identify the relevant sites in the first place. I will try your syntax for <site:.br> only and see what I get. Perhaps wildcards can help as well (<site:.*.br>?)

I was using Google because they consider links in their ranking strategy and that was consistent with my research problematic.

Do you have any idea why the syntax my antecessors reported having used with Altavista does not work?

Thanks again

Sue

suely
03-11-2005, 08:02 PM
I am involved in an academic research project on international patterns of connectivity and I need help from Google's experts.
What I need to do is:

1. find all the Brazilian sites (domain .br) indexed by Google;
2. among these, find those sites which receive links from (inlinks( sites whose domain belongs to another country (such as .ca, .uk, .ar).
If needed, I can restrict to a subset of international domains (predefining some - lets say <.ca,.uk.,.ar> and working solely on those)
3. back to my former and larger set of Brazilian sites, I then need to find those which send links (outlinks) to domains from other countries.

With these 3 sets of sites quantified and listed, I will proceed to some qualitative sampling, mapping and so on. My final products will be academic papers and connectivity webmaps focusing on Brazil.

Step one is easy - if I use any combination of characters which will necessarily be in any html document (e.g. <p> or <br> or &nbsp; or whatever) and restrict my search to domains .br I will get the whole set of html pages registered to Brazil and known to Google (right now, 337.000.000 pages).

Steps 2 and 3 are more difficult - running spiders in all .br sites is unfeasible - even the best classified ones would have to be too many, as I already inferred that most .br sites do not give or receive links to/from other countries.

Former researchers reported to have done similar searches using AltaVista (their syntax was <domain:.xx AND link:.yy>). The results I get doing that are incredibly inconsistent (I guess something has been changed in AltaVista syntax). Other people used specially-built robots (spiders, crawlers), but these reach too small a sample. I also want to use Google because it considers links for ranking pages, therefore if working from Google I can be assume that the top ranking sites will be the most 'visible' pages on the Web.

I tried to combine a <domain:.br> with <allinanchor:> but it did not work. In fact, <allinanchor> appear not to be working at all.
I also tried to search for links from pages in domains not-br (e.g. .ar) to pages with domain *.br but the wildcard apparently does not work there. (search finds the sequence <link br> as if I had asked for it as an exact phrase)

If anyone can spare some time trying to work it out for/with me, I will be incredibly grateful.

Thanks

Sue

Michael Martinez
03-12-2005, 01:19 AM
Try this link:

http://www.google.com/search?hl=en&q=site%3A*.br

Keep in mind that I haven't tried to evaluate the 19,000,000 results they returned. If that link doesn't work because of the hexadecimal embedding, all I used for a query was "site:*.br".

Using a similar query on one of my domains brings up almost 4,000 hits, which -- depending on the age of the indexed documents -- could be fairly accurate.