|
|
Mary E. Brown, Ph.D., Professor
|
Resources:
Students Ask |
Digital libraries and their usersIn this unit we ask two questions: What are some of the problems of searching in a digital library? How do digital libraries improve user environments? In answering these questions we consider retrieving text in images, retrieval in a pictorial digital library, retrieving information from spoken documents, indexing of the internet, metadata for digital libraries, online shopping models, software work assistants, and identifying relevant documents for linking.
WHAT ARE SOME OF THE PROBLEMS OF SEARCHING IN A DIGITAL LIBRARY?To give a flavor of the kinds of research interests are implicated in digital coherence, we will briefly touch upon current research in four areas which present search and retrieval problems in digital libraries: retrieving text from within images, organizing and searching a pictorial library, creating abstracts of multimedia documents for surrogate searching; and retrieving information from spoken documents. Retrieving text in imagesWu and Riseman (1997) explore the problem of finding text in an image. Text is usually defined by images on a clean background, such as black typeface on a white background. These images can be converted into computer-readable form by using newer Optical Character Recognition (OCR) technology. However, when there is not a clean background, such as Celtics 33 embedded in a photograph as markings on a player's clothing or text printed on a shaded or patterned background, such as information on a stock certificate, how can the text be automatically retrieved for indexing and database construction? Wu and Riseman have demonstrated that through a process which includes segmenting textures, applying heuristics of text string characteristics, constructing density histograms and then filtering below given frequencies, text is able to be retrieved, passed though an optical character reader (OCR), and delivered as editable text. In current research, 95% of the characters and 93% of the words in experimental materials were successfully extracted. Of the extracted material in OCR-readable fonts, 84% of the characters and 77% of the words were successfully recognized by a commercial OCR system (Wu, Manmatha & Riseman, 1997). Successful automated-extraction of text from images will make possible automated indexing of visual content based on associated (or situated), and at times controlled, vocabulary. Organization and retrieval in a pictorial digital libraryAnother approach to indexing, organizing and retrieving pictures was developed by a team with members from computer science, electrical engineering, and library science (Quintana, 1997). Knowledge Engineering (KE) methodology was used to categorize images based on a background knowledge base (that is, human experience). Clustering (including clustering within clusters) of prototypical features (that is, features that focus on the similarity among individual items in the group) were used to measure conceptual coherence. This type of clustering or classification can be accomplished in two ways: a group of people can be shown a series of pictures and asked to name what the picture depicts or the group can be asked to list things they would expect to see in a type of scene, for example a beach scene. A database can then be constructed from prototypical representatives of the common elements in the photographs. Then when a photograph is passed through a scanner, features of the picture are recognized and compared to, for example, the prototypical elements of a beach. If the match of features is within a set range, that photograph will be retrieved as a picture of a beach. Subcategories can also be constructed of beach using a set of features which distinguish among, for example, a beach at night, a beach during a hurricane, a crowed beach, etc. In studies, this type of knowledge-based retrieval yielded increased recall with simultaneous increased precision. (Usually, an increase in either recall or precision causes a simultaneous decrease in the other. It is the, perhaps utopian, goal to achieve a significant increase in both recall and precision, with the ultimate goal of 100% precision with 100% knowledge recall.) The context within which the described indexing and organizing was accomplished has been misunderstood by some in the field of library science. Specifically, some have viewed this research as simply recreating Library of Congress Subject Headings. In fact, this research enters the fuzzy and little researched area of naming of complex categories, specifically scenes. While Quintana has been accused of ignoring the work in librarianship, his fault is actually in failing to link his research to work in categorization and naming and in prototypicality studies, for example, of Tversky and Hemenway, Rosch, Barsaulou, and others, and, perhaps, failing to distinguish it from established classification and categorization activities in librarianship. Per Quintana's conclusions, though the KE methodology is effort-intensive, for highly consulted picture knowledge bases, the investment may be justified by the added retrieval performance achieved. Multimedia abstractions for a digital video libraryChristel, Winkler and Taylor (1997) have found that users of websites are not willing to scroll through pages or wait for images to load: Christel et al. believe users are "naturally lazy" (a more empirical and tactful way to state this would be in terms of cognitive economy). Notwithstanding the terms used to label this problem, given current technology, we need to abstract multimedia for the user in ways which will complement their "natural laziness." That is, we need an abstraction technique for indexing visual and auditory materials that will increase satisfaction and performance in querying for and retrieving multimedia materials. For example, if a user needs to locate 30 seconds of film depicting a given concept and 5 hours of film have been identified as relevant to the concept, the user would need to view 5 hours of film to locate the best 30 seconds of desired images--while dealing with the mechanics of noting and comparing candidate segments while initially viewing all the available film. Christel et al. presents a method for breaking film into segments and cataloging each segment for what it contains in terms of text, faces, and motion. Christel et al. point out that we have learned from cinema that the more important representative images in a film segment occurs where the camera motion stops. Therefore, by taking a frame at this stop-action point, we represent a number of concepts contained in the preceding motion. Using this knowledge grained from cinema, Christel et al. offer five indexes or levels of abstraction to the user: title in text only; poster frame of a small visual sample with no temporal component; a filmstrip of visual images progressing though a range of temporal data; skims of visual, audio, and temporal data in a condensed form; and match bars where a text query to a database identifies potentially relevant segments and presents these in skim or filmstrip forms to the user (as opposed to browsing without the aid of a qualifying search). The need for such abstractioning is further underscored by the current time requirement to download film from internet resources for viewing. For example, a 210 second (18 megabyte) clip transferred at a sustained rate of 28.8 Kbps would take 83 minutes to download; at peak usage times it could take hours (Christel, Winkler & Taylor, 1997). By accessing first a title, then a poster frame, then a filmstrip and finally skims, a user can receive representative samplings in greatly shorter time frames, permitting non-relevant clips to be rejected before excessive time is used to determine their non-usefulness. Retrieving information from spoken documentsWitbrock and Hauptmann (1997) are concerned with 1) the automated process of converting spoken broadcast news into a text database using a speech recognition program and 2) a resulting problem of reduced-retrieval of material from that database. Their particular interest is in News-on-Demand (see http://www.informedia.cs.cmu.edu/) that receives daily a large volume of material which must be included into the database for timely retrieval and use of news. The current problem arises from query terms which match the spoken (audio) document but fail to be transcribed because the spoken word is not in the lexicon (vocabulary) of the speech recognition system. Witbrock and Hauptmann (1997) have developed a system which uses fixed length phoneme strings to search the phoneme space of the spoken document using an inverted index of phonemes. That is, the user enters a spoken search query which the system converts to a text representation of a fixed number of sequential phonemes from the spoken query. This is then submitted to an index of text representations of strings of sounds in the spoken news. This system is an alternative to a speech recognition that matches the spoken document against a controlled vocabulary, creating a database which is searchable through a text query. Witbrock and Hauptmann's system is equivalent to a natural language search of a text document, with the input being spoken natural language rather than written natural language. This technique has been shown to recapture some of the information lost to out-of-vocabulary words in the speech recognition transcript.
HOW DO DIGITAL LIBRARIES IMPROVE USER ENVIRONMENTS?Indexing of the internetIndexing the internet is a major challenge facing information professionals and technology. In one line of research that focuses on this problem, Thompson, Shafer and Vizine-Goetz (1997) look at the degree of class integrity in the Dewey Decimal Classification and evaluate Dewey as a knowledge base for an automatic subject assignment tool. Dewey is a widely used classification scheme. It is currently used in 135 countries and translated into 30 languages. It is used in 95% of U.S. public and school libraries (Thompson, Shaffer & Vizine-Goetz, 1997). Dewey is continuously updated, with updates taking place at the rate of about 10 LCSH per week being incorporated into Dewey (Vinize-Goetz, presentation, ACM Digital Libraries T97, July 1997). In a study, the integrity of the Dewey Classification System was tested by feeding Dewey concept definitions into the system to see how well Dewey could classify its own concept definitions. The researchers concluded from these tests that Dewey is adequate to classify the Web--but not completely. The major drawback to this work, at least as reported at ACM Digital Libraries T97 and in the associated paper, is that the researchers fail to draw a distinction between classification systems, such as Dewey and LC, and categorization or subject heading systems such as LCSH and Sears. We could say that, for example, Dewey and Sears are at 90! of one another: That is, Dewey divides knowledge--along the warp--into disciplines and sub-disciplines while Sears weaves the disciplines together--along the woof or weft--by common topic. To add subject headings to a classification scheme would appear to add depth to the divisions within each discipline while defeating, or at least camouflaging the purpose of subject headings (see Sears, Preface). One would think this would also make the classification scheme unnecessarily complex, and perhaps awkward. Nevertheless, the technique used in this research is interesting. Words in the search query were truncated to their stems and then matched to stems in the Dewey scheme. Stemming, notwithstanding some shortcomings, such as stemming German to germ and matching the query to biological concepts, did improve retrieval overall. It would be interesting to see longer queries, for example "the German people, country, and customs," stemmed and rated on density of match within hierarchies to see if retrieved concepts more closely matched query concepts. Metadata for digital librariesOne problem which plagues the online world is the propagation of many varied, and some times drastically different, forms and formats for search and retrieval. Baldonado, Chang, Gravano and Paepcke (1997) address this problem by designing a metadata architecture for digital libraries. Metadata is data which tells us about other data. For example, we may have data about a bill we just received: it is for $50, it is from the South's Garage, etc. Metadata about the class of object bill might include that a bill can have the following attributes: a date, an amount due, an interest rate, a customer name, a customer address, etc. The metadata permits us to build translation modules between systems. For example, I might want to copy the data from South's Garage into an automated budget and banking service program which manages my bills and payments. The metadata translator would know to write, for example, the value of the billing attribute Total_Due into the budget attribute Debt_Amount and the value from the billing attribute Minimum_Payment to the budget attribute Minimum_Due. Metadata contains the knowledge required for the cognitive task of adapting one system's available data to the data requirements of another system. Translators facilitate this adaption by mapping the attributes of one system on to the attributes of another system. Currently, much of this mapping is performed through human effort. The metadata architecture designed by Baldonado et al. (1997) provides a variety of services. These services include: finding resources that are likely to satisfy a given query; formulating queries that are appropriate for multiple sources; translating queries into those used by various sources; and making sense of query results. To provide these services, Baldonado et al. (1997), designed four basic component classes for their architecture: attribute model proxies, attribute model translators, metadata facilities for each proxies, and metadata repositories. For a full discussion of this design, refer to Baldonado et al. (1997). For now, the main thing point to understand is that many systems purporting to be digital libraries or associated information services are not compatible with one another, disabling the desirable sharing of information. (The attribute of compatibility among systems has been termed interoperability.) Online shopping modelsInteroperability is a key issue in digital libraries. That is, researchers are seeking different ways to bring about harmonious interaction among the various services and systems. Ketchpel, Garca-Molina and Paepcke (1997) are specifically concerned with the interactions between customers and information providers and merchants. Their research centers around developing shopping models for information commerce, specifically online financial transactions. The models they work with include delivery of weekly issues of a publication following receipt of payment, automatic billing for online searching, pay-per-view transactions, pay-what-you-will (for example, shareware) transactions, and pre-paid vouchers for online services. A side issue, not covered by Ketchel et al. but mentioned by Samuelson (Plenary Address, 2nd ACM International Conference on Digital Libraries, July 1997) is the issue of a deadbeat list. With commerce being conducted online, it would be quite easy for vendors to gain access to a list of individuals who have past due accounts with other vendors, presumably alerting them to the risk of entering into a transaction with those on the list. The problem with a deadbeat list is all those list are not necessarily deadbeats. For example, if a vendor received monies for a bill incurred by a library for online searching activity but the vendorUs accounting system does not show receipt of the money. The library could be automatically placed on a deadbeat list, and refused further service, resulting in other vendors also refusing service. This could go on until the library either double-paid the disputed amount or the vendor's accounting error was resolved. In addition, vendors or service could accumulate deadbeat listings for example to establish a financial rating for organizations. The problem is not as simple as excluding all customers with accounts in dispute as this would open up the opportunity for some customers to erroneously dispute all bills, thereby avoiding paying debts while keeping a good financial rating. We need to be sensitive to the uses of technology not only for all the benefits it can provide, but also for possible hazards that can be constructed with it. This could go on until the library either double-paid the disputed amount or the vendeor 'si anccfouontrinmg aertroir owans resiolnve d.i Itn sad ditrioen,s voenudorrsc oer serrveicpe ocsitory and forward it to the patron's agent. Software work assistantsSanchez, Legett and Schnase (1997) are developing software service agents for publishers, librarians and patrons, using the concept of delegation and indirect management of tasks. In the agent services architecture proposed by Sanchez et al. the patron would assign a task to a software agent, for example, to find the current value of 290 Danish Kroner (DKK) in US dollars (USD). The patron's agent would have access to other services at the patron side, such as a query interpreter, though which the patron's information need could be processed prior to forwarding it to the library. At the library side, a library service agent would receive the query, locate the information in its resource repository and forward it to the patron's agent. Identifying relevant documents for linkingShin, Nam and Kim (1997) discuss automatic creation of hypertext links between documents using statistical and semantic similarity between documents. As identification of candidate documents for linking is an intellectual-intensive task, it is also an area of high interest for automation. Statistical similarity is based on a weighting scheme that considers the frequency of occurrence of keywords in one document versus the frequency of occurrence of those keywords in another document. Semantic similarity is based on the closeness of keywords in one document to some point in a thesaurus compared with the closeness of keywords in another document to that same point in the thesaurus. For example, statistical similarity could show that one document dealing with water in polymers in not statistically similar to another document dealing with moisture in polymers. However, when measured against a thesaurus, water in polymers is equivalent to moisture in polymers, and therefore the two documents are linked based on semantic similarity of keywords. Sin et al.'s research demonstrated automatic creation of hypertext links which achieve 77.25% match to links created by human experts. Sin et al.'s work focused on documents written in Korean, however, their work has implications for documents written in other languages and needs to be replicated using other-language-based documents.
REFERENCESBaldonado, M., Chang, C-C. K., Gravano, L., & Paepcke, A. (1997). Metadata for digital libraries: Architecture and design rationale. Proceedings of the 2nd ACM International Conference on Digital Libraries, 47-56. Christel, M. G., Winkler, D. B., & Taylor, C. R. (1997). Multimedia abstractions for a digital video library. Proceedings of the 2nd ACM International Conference on Digital Libraries, 21-29. Ketchpel, S. P., Garcia-Molina, H., Paepcke, A. (1997). Shopping models: A flexivle architecture for information commerce. Proceedings of the 2nd ACM International Conference on Digital Libraries, 65-74. Quintana, Y. (1997). Organization and retrieval in a ictorial digital library. Proceedings of the 2nd ACM International Conference on Digital Libraries, 13-20. Sanchez, J. A., Leggett, J. J., & Schnase, J. L. (1997). AGS: Introducing agents as services provided by digital libraries. Proceedings of the 2nd ACM International Conference on Digital Libraries, 75-82. Shin, D., Nam, S., & Kim, M. (1997). Hypertext construction using statistical and semantic similarity. Proceedings of the 2nd ACM International Conference on Digital Libraries, 57-63. Thompson, R., Shafer, K. & Vizine-Goetz, D. (1997). Evaluating Dewey concepts as a knowledge base for automatic subject assignment. Proceedings of the 2nd ACM International Conference on Digital Libraries, 37-46. Witbrock, M. J. & Hauptmann, A. G. (1997). Using words and phonetic strings for efficient inforamtin retrieval from imperfectly transcribed spoken documents. Proceedings of the 2nd ACM International Conference on Digital Libraries, 30-35. Wu, V., Manmatha, R. & Riseman, E. M. (1997). Finding text in images. Proceedings of the 2nd ACM International Conference on Digital Libraries, 3-12. |
|||||||||
|
|
Last Modified
Thursday, July 7, 2005