ISO/IEC JTC1/SC34/WG3

Date: 1999-04-20

Title:  Development of a Multimedia Information Retrieval System Architecture with integrated Image Information Retrieval Technique

Source:  Ki-Joung Kang, Multimedia Technical Lab, Korea Telecom


Project:

Status:  To be discussed by JTC1 committees working with text, audio or images to determine how JTC1 can ensure the swift development of a unifying architecture for electronic information retrieval in today's multimedia information environment.

1. Abstract

Many mechanisms for describing information sources using applications of ISO/IEC 8879:1986, the Standard Generalized Markup Language, such as the W3C HyperText Markup Language (HTML) and Extensible Markup Language (XML) have been proposed the last few years. Examples include the Dublin Core proposals for identifying text sources, the use of the W3C Resource Description Framework (RDF) to describe eleectronic data resources, and the development of the MPEG-7 Multimedia Content Description Interface.

Text-based document processing techniques for information retrieval over the Internet have been widely developed, but normally these solutions are proprietary and no attempt has been made to standardize them. Recently, however, work has started on an XML Query Language (XQL) that will provide a standardized mechanism for searching XML-encoded data sources.

Future information sources are, however, more likely to be of a truly multimedia nature, being composed of many data types such as text, still images, animated graphics, audio clips and moving video images. Therefore, information retrieval engines for multimedia data will be required for the next generation of information systems.

An architecture for a Multimedia Information Retrieval (MIR) system is proposed in Figure 1 of this contribution. A model for an Image Information Retrieval (IIR) system, which is a key component of MIR architecture, is also proposed in Figure 2.

2. Multimedia Information Retrieval System

As the Internet and World-Wide Web (WWW) become more popular and widely used, the number of information hosts and users are rapidly increasing. As a resutl, the amount of electronically accessible information has enormously increased.

Many information sources are generated and changed continuously on the Internet. This makes permanent indexing of Internet data difficult to maintain.

There are many text-based information retrieval systems (TIRs), such as Korea Telecom's InfoCop, Shimmani and Navor in Korea and Yahoo, Altavista, Excite, Lycos, Goo (NTT in Japan) in other nations. However, these systems do not generally allow access to non-textual resources.

Figure 1 proposes an overall architecture for a multimedia information retrieval (MIR)

Figure 1: Multimedia information retrieval system architecture

The proposed system model defines a technique for multimedia data retrieval and management over the Internet by which multimedia data can be retrieved based on analysis of inherent features.

A Text Information Request (TIR) is based on a standardization of currently existent text-based IRS, using an extension of the techniques proposed for the XML Query Language (XQL) that can be applied to any text-based data sources. An Image Information Request (IIR) is used for retrieving image-based data: its architecture is described in more detail in Figure 2.

An Audio Information Reqquest (AIR) can be used for retrieving voice and other forms of audio data, while a Video Information Request (VIR) is required to retrieve video data.

A Multimedia Information Database (MIDB) is the database which contains data resources that have previously been analyzed to identify retrievable characterstics relevant to their types of contents.

The service scenario is that the various queries generated by users are passed to an appropriate type of Request Server (RS) which processes them into a form suitable for searching the Multimedia Information Database (MIDB). The results of the search are delivered to users in the form of a compound report described as part of the search request.

It is proposed that user requests may be entered in any suitable format, and that the request will be automatically processed to form a query relevant for the type of data being searched. For example, the user could enter some keywords, such as "Ben Hur" using the through TIR. The TIR would automatically convert the request to a query format that, as well as finding all occurrences of the text string would also find all spoken references to Ben Hur as well as all films with Ben Hur as part of the title, one of the roles played, or even as a member of the cast. As a result, users can retrieve all relevant types of information, such as text, image, voice and video, which were stored in the MIDB.

3. The architecture of an Image Retrieval System

Most text-based Information Retrieval Systems (IRS) search for an exact or near match for a given keyword. As multimedia informatio can be composed of text, voice, video, etc, a different type of IRS is needed for more efficient and effective search of multimedia data.

Until recently the main image-based IRS were based on the retrieval of keywords that had been associated with the image source. This technique has many limitations and problems because it normally requires manual intervention to insert keywords, and it is difficult to add or change keywords once they have been assigned to the source. To solve these problems, a new method, content-based information retrieval, has been developed which allows information to be efficiently searched in terms of specific features of images such as color, texture, shape and pattern. With this new retrieval technique, image data can be searched faster and more exactly by searching for specific sets of features.

The application area of this techniques is very wide, including digital picture libraries, advertisements, movies, pictures, trademarks and videos.

The basic architecture of an Image Information Request (IIR) system, as shown in Figure 2, consists of a Feature Extraction Module (FEM), a Simultaneous Estimation Module (SEM) and a Retrieval Module (RM).

                                     Figure 2: Image information retrieval system architecture

The FEM is used during the loading of data into the MIDB to automatically analyze various features of the image, such as colour, texture, shape, etc. It is represented in the MIDB as a set of standardized properties associated with the image.

The Simultaneous Estimation Module (SEM) is used during data searching to calculate the similarity of a particular image to the requested set of feature values. Dissimilarity instead of similarity can also be calculated according to the definition of features.

The Retrieval Module (RM) processes user queries by using interactions between users and system to determine which features should be searched for.

Many different query types may be relevant to the retrieval of multimedia data, such as query by example, query based on a user sketch and query by color combination. Search results may need to be displayed in order of similarity as calculated by the SEM. Repetitive querying, based on user feedback, may be needed to narrow the selection down to a relevant set of information resources. This is sometimes due to the fact that computers cannot always perfectly understand user queries. In other cases it may be caused by the fact that the query was too general in nature. (For example, "find orange" is not a good way to search for an image of an orange as it describes a property that applies to many images.)

4. Conclusion

Information retrieval technology for multimedia data, including text, image, graphic, video and audio data, will be needed by wide range of Internet applications and businesses. The multimedia information retrieval architecture proposed in this paper combines image processing and representation skill with text-based document processing techniques such as those used to retrieve information from SGML, HTML and XML encoded documents.

In addition multimedia information retrieval techniques will be used for searching domestic and foreign broadcasting stations, and by related business such as those involving web video, motion picture and image distribution. The economic effect of a unified architecure for the distribution and satisfaction of such multimedia information requests is likely to be enormous.