PDA

View Full Version : IR on the web vs other repositories


xan
04-05-2005, 07:32 AM
From a nice and short article from L'Express (http://www.lexpress.mu/display_article_sup.php?news_id=39158):

The principles of searching a catalogue are different from that of a WWW and the WWW is devoid of cataloguing and classification as opposed to a library system.

“the WWW has created a revolution in the accessibility of information” (NISO, 2004).

Information overload can be defined as “the inability to extract needed knowledge from an immense quantity of information for one of many reasons” (Nelson, 1997).

“The volume of information on the Internet creates more problems than just trying to search an immense collection of data for a small and specific set of knowledge” (Nelson, 1997).

“finding authoritative information on the Web is a challenging problem” (Savoy in Baeza-Yates and Schauble, 2002) as opposed to a library where we would get mostly authoritative information.

“A Web page typically contains various types of materials that are not related to the topic of the Web page” (Yu et. al., 2003). As such, the heterogeneous nature of the Web affects information retrieval. Most of the Web pages would consist of multiple topics and parts such as pictures, animations, logos, advertisements and other such links. “Although traditional documents also often have multiple topics, they are less diverse so that the impact on retrieval performance is smaller” (Yu et. al., 2003). For instance, whilst searching in an OPAC, one won’t find any animations or pictures interfering with the search. "

Other differences not metioned are that the size of web pages varies hugely, exponential growth, the unstructured nature of the documents, keeping the index fresh, content quality,

In a digital library the user comes with a clear idea of what they are looking for (a document, an author, a specific subject), on the web, that is not so, making it necessary to refine the query futher.

I think we often forget that the web search engines are still very new and we are still unexperienced relatively speaking. All previous work has been in digital libraries and data mining. In fact the majority of IR work has been done in DL's. We have to modify tried and tested techniques and invent new ones to deal with this new format. This would suggest that previous IR work can help to understand what is going on in web work today.

strategicrankings
04-05-2005, 12:06 PM
Hi Xan,

thanks for the reference. Nice article indeed.

Do you read L'express often or simply put are you from Mauritius? I am. And its nice to see that one of my peers is referenced here on SEW.

xan
04-05-2005, 07:29 PM
I like L'express, its a good little read that!

I'm not from mauritius but could do with the sunshine. I'm based in the UK but not exactly british either, french.

Send us all some good weather!

claus
04-06-2005, 01:03 PM
In a digital library the user comes with a clear idea of what they are looking for (a document, an author, a specific subject), on the web, that is not so, making it necessary to refine the query futher.

Funny, i see it the other way round. A digital library (perhaps we don't think about the same thing here) is a well-organized collection of information, whereas the web is not.

In both cases the searcher might, or might not, know exactly what s/he is loking for, but mostly the searcher should have a pretty good idea, otherwise there would be no point in looking ;)

So, i see no need to refine the query, at least not initially - i see a need to refine the data set in order to better be able to extract the relevant portions out of it. If this still does not solve the particular problem, the query should be refined/expanded/whatever.

xan
04-06-2005, 06:54 PM
when you go to a science repository and you research SVM's and genomes. You type that in the search and will get related documents. You might not exactly know what you want, but you have a clear idea of the search area, and it is narrow. Also, you are likely to get every paper to do with that subject, as they are all organised according to a specifiac format or markup.

On the web, you will have to be a lot more specific than that, so if you enetered that as your search, you would have to try again to get what you're trying to get at, as far as your intention is concerend.

By using a digital repository on a specific area (arts, science, ecology,...) you have already stated your intention.

claus
04-07-2005, 06:51 PM
Okay, i might not have expressed my opinion clearly. Let's say there are two scenarios:

a) the user knows exactly what s/he is looking for
b) the user has some vague idea about the general direction but little more

And two types of collections:

1) A catalogue, DL, or otherwise specialised document collection
2) The general mess of the WWW

And, again, two types of interfaces:

I) A search box, in which the user can enter terms
II) An ordered, directory type, selection of documents by topic

Given that (1) and (2) both hold the answer to the problem at hand, i would say that for scenario (a), interface (I) is best, while for scenario (b) interface (II) is best.

If, on the other hand, the user is restricted to having interface type (I) only, then i would say that:

- collection (1) is more efficient with regard to scenario (b), but
- both collections might be equally efficiently searched in scenario (a)

IOW, if we only consider search boxes as interface, i think you are right regarding problems that are not well defined beforehand. Enabling "drill-down" searching as in a directory, OTOH, offers some of the benefits of the DL in a web context.

(This, as an example of refining the data set in opposition to refining the query)

With a very well defined problem, eg. "What is the name of the president of the USA" the two collections might be able to provide the answer equally quick.

All IMHO, AFAIK, FWIW, of course.

claus
04-07-2005, 07:49 PM
I don't know if this adds any value to the debate or not, but in seeing things from the searchers viewpoint i couldn't help thinking about this:

Digital Libraries, information repositories, or whatchamacallit needs to be found by the searcher before a search can be initiated using them.

The WWW "is everywhere" - grab any browser window and you will easily locate a search box, a directory, or both. Locating the topic specific information collections is not easy, so this adds an extra search process before the search in the DL can actually take place.

And, of course some of them are also access restricted, which skews any comparison even more to the benefit of WWW search.