Digital Content Annotation and Transcoding

Katashi NAGAO
Graduate School of Information Science, Nagoya University

1 Introduction: Survival in Information Deluge

I think it was inevitable that digital content has spread all over theworld so rapidly due to the information distribution technologiesdeveloped in the twentieth century. I am also confident that various kindsof technologies to utilize digital content will emerge, and thesetechnologies will make gradual progress in this century. Our challenge isto develop technologies to create and distribute digital contenteasily. We need to provide a focal point for development of technologiesto utilize digital content more intelligently and multipurposely.

My definition of ``content technology'' covers technologies to create,store, deliver, manipulate, transform, and reuse digitalcontent. I mainly focus onmanipulation, transformation, and reuse of digital content.

Some typical instances of manipulation and transformation arepersonalization and adaptation. Personalization is the technique tomanipulate or transform digital content such as Web pages to conform touser preference. Adaptation means a kind of transformation to adjust sizeor color of content according to constraints and features of user devicessuch as mobile phones or PDAs (Personal Digital Assistants).

I define ``transcoding'' as the mixture of personalization and adaptationof digital content. Today, people are accessing theInternet not only through personal computers, but via mobile phones, carnavigation systems, and many other devices. In such cases, transcodingplays a major role in information access through the Net. For example, Webpages created for browsing by PCs are not suitable for browsing by mobilephones--automatic resizing of images and automatic summarization of textare required. Transcoding also considers user preferences such as nativelanguages and handicaps. Transcoding has advantages in connection withnetworks of narrow bandwidth and also in access by people who have variousdisabilities. Transcoding also facilitates reuse of digitalcontent. Inherently, content itself is declarative, therefore,versatile. Some documents may provide a story of someone's life, butsometimes we would like to see the information as a biographicdictionary. Transcoding can make this possible. So a single content itemcan have several aspects and changing its presentation makes it usable indifferent situations.

My group has been developing tools to annotate content with additionalinformation. Such information contributes to inference of contentmeanings. Transcoding handles such meanings or attributes of content andbridges content and a user's intentions by transforming content into auser preferred form.

I propose a concrete plan and method for creation andmanagement of digital content. Based on our recent research results, thiswill become a most important resource on the Net. This plan intends toevolve online information to ``knowledge'' and develop a machine to augmenthuman intelligence. Many researchers have tried to realize such a machinebut failed. With our machine, we have a means to get a long-awaitedhuman-powered ``intelligence amplifier'' due to rapid growth of informationtechnology and infrastructure like the Internet. I am confident that wecan have a common global knowledge system based on annotated digitalcontent if we are persistent in achievement of the plan. Before explainingthe precise plan, I will describe some of the problems of present onlinedigital content.

1.1 Problems of Online Content

Currently, there are the following problems in online content such as Webpages written in HTML (HyperText Markup Language).

  1. There have been many efforts on visual aspects of document presentationusing HTML tags and stylesheets. While their tags define parts ofsyntactic structure of documents, their meanings or semantics are notdefined in the current architecture.

    Tim Berners-Lee, Director of the World Wide Web Consortium, has proposedhis vision about a new architecture of the Web called the Semantic Webthat can specify semantics of Web content .He explained that in the context of the Semantic Web,the word ``semantic'' means ``machine processable.'' He explicitly ruledout the sense of natural language semantics. In his vision, the semanticsof Web content convey what a machine can do with that content. They willalso enable a machine to figure out how to convert it. His vision is good,but since the semantics of documents necessarily involve semantics ofnatural language or human thought, we have to consider deeper semantics ofcontent.

  2. The structure of current hypertext is very simple and static sothat many people can easily create it. However, from the dynamic nature ofthe Web, hyperlinks are not always valid and very hard to keepconsistent. Furthermore, only the author or other authorized persons canupdate description of the hyperlinks. In order to make hypertext moreflexible, authoring of content and linking between content objects shouldbe individually managed.

  3. In the case of conventional books, there are skilled people, usuallyeditors or publishers, who evaluate content of the books, negotiatechanges in writing or subjects with the authors, and mediate between theauthors and the readers. On the contrary, the authoring of Web contentseldom has such system.

Of course, the World Wide Web provided us with a flexible and openplatform for document and multimedia publication anddistribution. However, it is very hard to automatically transform currentWeb content into more convenient forms. Here, I discuss anadditional mechanism for the Web that makes current Web content moreintelligent and accessible. Another important keyword here is``annotation.''

1.2 Extension of Digital Content

Traditionally, Web content has been created using only ``human-friendly''information. Of course, this makes sense--after all, humans are theintended users of the information. However, with the continuing explosionof Web content--especially multimedia content--I would like to argue thatcontent should be created in a more ``machine-friendly'' format--one whichallows machines to better ``understand'' and process documents. My groupproposes a system to annotate documents externally with additionalinformation in order to make them easier for computers to process.

Why would we want to make content more ``machine-friendly?'' Well, annotateddocuments are much easier to personalize than the non-annotatedvariety. For example, a computer can process textual information annotatedwith parts-of-speech and word-senses much better than plain text, and cantherefore produce a nice, grammatically correct, personalized summary,formatted for a cellular phone, or translate a document from English toJapanese. Normally when dealing with non-annotated text, transcoding tomany different formats requires a lot of task-specific effort for eachformat. However, by using document annotation, content providers put insome extra work early on, but receive the benefits of being able totranscode to an endless variety of formats and personal tastes easily,thus reaching a much wider audience with less overall effort.

Annotation isn't new, and there have already been several attempts to addit to the Web. One of these was the controversial (and now defunct)ThirdVoice , allowing post-it note style adornment (some might say defacement) of pages. Our annotation isdifferent from others, because we mean annotation as helpful informationfor machines to automatically process the semantics of content. My grouphas been developing an easy and simple method for constructing asuperstructure on the Web, based on external annotations to Webdocuments. The word ``external'' indicates that the annotations are notembedded in the content itself but linked with the content. Annotateddocuments are easier for computers to understand and process, allowingpersonalized content to be created with much less effort and greaterquality. This permits content providers to reach a much wider audiencewith minimal overhead.

Here, I specify three categories of annotation. One is linguisticannotation, which helps the transcoder understand the semantic structureof textual elements. The second is commentary annotation, which helps thetranscoder manipulate both textual and non-textual elements such as imagesand sounds. Commentary annotations are also effective to evaluate targetcontent like book reviews. The third category is multimedia annotation,which is a combination of the first two types.

My group has also developed a system for semi-automatic and interactiveWeb document annotation, allowing users to annotate any element of any Webdocument with additional information. We have also developed a proxyserver that transcodes requested contents using information fromannotations assigned to them. All types of annotation are described usingXML (Extensible Markup Language) that is a standard for interoperable dataexchange systems.The correspondence between annotations and elements of content is definedusing URL (Uniform Resource Locator) and XPointer (XML Pointer Language).

The entire process is called ``semantic transcoding,'' because it providesmeans for easily transcoding annotated documents using information abouttheir deep semantic content .The current semantic transcoding process handles mainly text and videosummarization, language translation, and speech synthesis of documentscontaining text and images.

Annotation is also useful for knowledge discovery from content. Using thisidea, we are also developing a system which discovers knowledge from Webdocuments, and generates a document that includes the discovered knowledgeand summaries of multiple documents related to the same topic.

To better visualize these ideas, consider the following: The conventionalWeb structure can be thought of as a graph on a plane. We are proposing amethod for extending such a planar graph to a three-dimensional structureconsisting of multiple planar layers. Such metalevel structure is based onexternal annotations on digital content on the Web. Figure represents the concept of our approach.

Super-Structure on the Web

Fugure1: Super-Structure on the Web

As shown in the figure, our idea of a Web super-structure consists oflayers of content and metacontent. The bottomlayer corresponds to the set of raw content. The second layer correspondsto a set of metacontent about content of the first layer. We generallyconsider such metacontent as external annotations.

A popular example of external annotation is comments or notes on Webcontent created by people other than the author. This kind of annotationis useful for readers evaluating the content. For example, images withoutalternative descriptions are not understandable for visually challengedpeople. If there are comments on these images, these people can understandthe image contents by listening to them via text-to-speech transcoding.

Another example of annotation is flexible external hyperlinks that connectcontent with conventional knowledge sources such as onlinedictionaries. External links can be defined outside of the set oflink-connected content. Such external links have been discussed by the XMLcommunity in the context of XLink (XML Linking Language).

There are a large number of documents of great diversity on the Web, whichmakes some of the documents difficult to understand due to the viewers'lack of background knowledge. In particular, if technical terms or jargonare contained in the document, viewers who are unfamiliar with them mightnot understand their correct meanings.

When we encounter unknown words in a document, for example, scientificterms or proper nouns, we usually look them up in dictionaries or askexperts or friends for their meanings. However, if there are lots ofunfamiliar words in a document or there are no experts around, the work oflooking the words up can be very time consuming. To facilitate thiseffort, we need (1) machine understandable online dictionaries, (2)automated consultation of these dictionaries, and (3) effective methods toshow the lookup results.

There is an application that consults online dictionaries when the userclicks on a certain word on a Web page, then shows the lookup results in apopped-up window. In this case, the application accesses its inner/onlinedictionaries and the consultation process is automated using the viewer'smouse click as a cue. Popup windows correspond to the display method.Other related applications operate in more or less the same way. There arethree big problems with this conventional method:

  1. Due to the difficulty of word sense disambiguation, in the case ofpolysemic (having a diversity of meanings) words, applications to dateshow all possible word sense candidates for certain words, which forcesthe viewer to choose the correct meaning.

  2. The popup window showing the lookup results hides the area nearthe clicked word, so that the user tends to lose the context and has toreread the original document.

  3. Since the document and the dictionary lookup results are shown indifferent layers (e.g., windows), other natural language processingtechniques such as summarization, translation, and voice synthesis cannotbe easily applied to the results.

To cope with these problems, my group proposes a systematic method toannotate words in a document with word senses in such a way that anyone(e.g., the author) can easily add word sense information to a certain wordusing a user-friendly annotating tool. This operation can be considered asa creation of a hyperlink between a word in the document and a node in adomain-specific ontology.

Our proposed system basically works as illustrated in Figure. The proxy server in the middle dealswith user interactions, content, and metacontent (annotation) retrievals, and consultation and integration of resources. The proxy server also has arole of transcoding of requested content and is called the transcodingproxy. The details of the transcoding proxy are described in the nextsection.

Basic Configuration of the Proposed System

Fugure2: Basic Configuration of the Proposed System

2 Semantic Annotation and Transcoding: Towards Semantically Sharable Digital Content

Here, I present an annotation-based approach to an emergingdigital content that is semantically sharable between humans and machines.In order to create such content, we have to realize ``informationgrounding,'' grounding information on the human life world, that is anessential process for machines to share meaning of information withhumans. It is discussed below in more detail.

Recently, a project aimed at this type of grounding on the Web has beenpromoted by W3C. It is called the Semantic Web. One of the basic milestones in theroad to the Semantic Web is the association of well-defined descriptionsto content. The descriptions allow the Web developers to extract andprocess properties about some given content, even if the medium of thecontent does not directly provide the necessary means to do so.

The Semantic Web is not designed as a separate Web but an extension of thecurrent one, in which information is given well-defined meaning, betterenabling computers and people to work in cooperation. The first steps inweaving the Semantic Web into the structure of the existing Web arealready under way. In the near future, these developments will lead tosignificant new functionality as machines become much better able toprocess and ``understand'' the data that they merely display at present.

The Semantic Web is mainly advocating an ontology-based approach to amachine-understandable Web that requires formal descriptions of conceptscontained in Web content.It seems to be top-down, because such formal descriptions should covermore than one content and must be universal.

Another approach, an annotation-based approach, is the main topic in thissection. Since an annotation concerns one particular content and can be modifiedin an incremental manner by meta-annotations, the annotation-basedapproach must be bottom-up.In this section, annotations have a definite role that is to provide formachines to autonomously infer the semantics of the target content.Annotated content is easier for computers to understand and process, allowing personalized content to be created with much less effort and greater quality. This permits content providers toreach a much wider audience with minimal overhead.

The ontology-based and annotation-based approaches are not at allcontradictory but rather complementary to each other for informationgrounding. Their descriptions should be closely aligned with an originalcontent (annotation-based approach) if the original content is available,and formal semantics that the Semantic Web is pursuing should be takeninto more serious consideration (ontology-based approach) if there is nooriginal content. Of course, further study is necessary on how tointegrate the two approaches.

I discuss more about the semantics of content and the ontology in theSemantic Web in the following section.

2.1 Semantics and Grounding

For the Semantic Web to function, computers must have access to structuredcollections of information and sets of inference rules that they can useto conduct automated reasoning. Artificial intelligence researchers havestudied such systems since long before the Web was developed. Knowledgerepresentation, as this technology is called, is clearly a good idea, andsome very nice demonstrations exist, but it has not yet changed theworld. It contains the seeds of important applications, but to realize itsfull potential it must be linked into a single global system.

Semantic Web researchers, in contrast, accept that inconsistencies andunanswerable questions are a price that must be paid to achieveversatility. They are developing the language for the rules as expressiveas needed to allow the Web to reason as widely as desired. This philosophyis similar to that of the conventional Web: early in the Web'sdevelopment, detractors pointed out that it could never be awell-organized library; without a central database and tree structure, onewould never be sure of finding everything. They were right. But theexpressive power of the system made vast amounts of information available,and search engines (which would have seemed quite impractical a decadeago) now produce remarkably complete indices of a lot of the material outthere. The challenge of the Semantic Web, therefore, is to provide alanguage that expresses both data and rules for reasoning about the dataand that allows rules from any existing knowledge representation system tobe exported onto the Web.

2.1.1 Ontology

In the Semantic Web, meaning of information is expressed by RDF , which encodes it in sets of triples, each triplebeing rather like the subject, verb, and object of an elementarysentence. These triples can bewritten using XML tags. In RDF, a document makes assertions thatparticular things (people, Web resources, or whatever) have properties(such as ``is a kind of,'' ``is a part of'') with certain values (anotherperson, another Web resource). This structure turns out to be a naturalway to describe the vast majority of the data processed bymachines. Subject and object are each identified by a URI (Uniform Resource Identifier) , just as used in a link on a Webdocument. The verbs are also identified by URIs, which enables anyone todefine a new concept, a new verb, just by defining a URI for it somewhereon the Web.

Of course, this is not the end of the story, because two databases on theWeb may use different identifiers for what is in fact the same concept. Aprogram that wants to compare or combine information across the twodatabases has to know that these two terms are being used to mean the samething. Ideally, the program must have a way to discover such commonmeanings for whatever databases it encounters.

A solution to this problem is provided by the third basic component of theSemantic Web, collections of information called ``ontologies.'' Inphilosophy, an ontology is a theory about the nature of existence, of whattypes of things exist; ontology as a discipline studies suchtheories. Artificial intelligence researchers have been using the term to mean that an ontology is a document or filethat formally defines the relations among terms. The most typical kind ofontology for the Web has a taxonomy and a set of inference rules.

The taxonomy defines classes of objects and relations among them. Forexample, an address may be defined as a type of location, and city codesmay be defined to apply only to locations, and so on. Classes, subclassesand relations among entities are a very powerful tool for Web use. We canexpress a large number of relations among entities by assigning propertiesto classes and allowing subclasses to inherit such properties. If citycodes must be of type city and cities generally have Web sites, we candiscuss the Web site associated with a city code even if no database linksa city code directly to a Web site.

Inference rules in ontologies supply further power. An ontology mayexpress the rule ``If a city code is associated with a country code, andan address uses that city code, then that address has the associatedcountry code.'' A program could then readily deduce, for instance, that aNagoya University address, being in Nagoya, must be in Japan, andtherefore should be formatted to Japanese address standards. The computerdoesn't truly ``understand'' any of this information, but it can nowmanipulate the terms much more effectively in ways that are useful andmeaningful to the human user.

With ontology data on the Web, solutions to terminology (and other)problems begin to emerge. The meaning of terms or XML data used on a Webdocument can be defined by pointers from the document to an ontology. Ofcourse, the same problems as before now arise if I point to an ontologythat defines addresses as containing a zip code and you point to one thatuses postal code. This kind of confusion can be resolved if ontologiesprovide equivalence relations: one or both of our ontologies may containthe information that my zip code is equivalent to your postal code.

The scheme of the Semantic Web for sending in the clowns to satisfy mycustomers is partially solved when the two databases point to differentdefinitions of address. The program, using distinct URIs for differentconcepts of address, will not confuse them and in fact will need todiscover that the concepts are related at all. The program could then usea service that takes a list of postal addresses (defined in the firstontology) and converts it into a list of physical addresses (the secondontology) by recognizing and removing post office boxes and otherunsuitable addresses. The structure and semantics provided by ontologiesmake it easier for an entrepreneur to provide such a service and can makeits use completely transparent.

Ontologies can enhance the functioning of the Web in many ways. They canbe used in a simple fashion to improve the accuracy of Web searches--thesearch program can look for only those pages that refer to a preciseconcept instead of all the ones using ambiguous keywords. More advancedapplications will use ontologies to relate the information on a Web page tothe associated knowledge structures and inference rules.

2.1.2 Information Grounding

In order to make information technology useful for humans at all,we essentially need ``information grounding'' of digital dataon the human life world, which is to allow machines to share meaning with humans .

The Symbol Grounding Problem was originally positedas a problem about how digital information could have inherent meaning of the real world, rather than meaning externally assigned through interpretationby the human theoretician or programmer.

The original argument on grounding tended to regard the real world as the physical world, but in the present discussion the real world is the human life world, which encompasses not just the physical aspectsbut also conceptual, social, and other aspects.

Then, we can get ``intelligent content'' from digital content structuredfor the sake of information grounding in this sense: i.e., structured sothat humans and machines can share the meaning of the originalcontent. Here the original content may benatural language documents, visual/auditory content, or any other data inany modality. Information grounding should allow retrieval, translation,summarization, presentation, etc. of such content, with very high qualityand accuracy. Initiatives to promote intelligent content include MPEG-7, GDA (Global Document Annotation) , the Semantic Web, and so forth.

For intelligent content, there are two major approaches to informationgrounding, which are complementary to each other.

  1. The ``ontology-based'' approach, mainly advocated by the Semantic Web,aims at grounding in the classical sense that machines canautonomously carry out sound inferences reflecting the real world,based on formal ontologies associated with the formal descriptions as partof the information content.Information grounding is guaranteed here by the validity of the ontologyprescribed by human developers.

  2. GDA and MPEG-7, on the other hand, are regarded as putting more emphasison the ``annotation-based'' approach to information grounding.This approach aims at grounding by fine-grained association betweenthe semantic description and the original content.In GDA, for instance, XML tags are embedded in text data,so that the formal description in terms of the tags is closelyaligned with the raw text data.

Similarly, MPEG-7 can align XML-based descriptions with the originalmultimodal data, though the descriptions are external tothe original data.

Information grounding is maintained here by the interaction between machinesand humans, rather than autonomous inferences by machines,as illustrated in Figure .

Content-Description Association and Information Grounding

Fugure3: Content-Description Association and Information Grounding

The lines in the figure represent various types of interactions,where the double-lined ones encode more minute and meaningful interactions.That is, humans can readily understand and thus meaningfully interactwith the original content.Machines can understand and manipulate semantic descriptions more readilythan the original content.Meaningful interactions between humans and machines are hencesupported by associating the semantic descriptions and the original content.

The association realizes information grounding in this sense,and a number of merits hence follow.

  1. First of all, the association is obviously useful and necessaryfor summarization, translation, retrieval, and so forth.Summarization/translation of the original contentusing the formal description is possible owing to the association,provided that the descriptions is first summarized/translated and the resultingsummarization/translation is mapped to the original content to generate a desired result.

  2. Another major merit of the associationis that we can dispense with formal ontology to a substantial extent.Namely, automatic inferences using formal ontologies may be replacedby interactions with humans, because they understandthe meaning of the original content and this understanding iscommunicated to machines due to the association.For instance, inferences involved in information retrievalcan be carried out by interactions between the human user and the machine,using a linguistic thesaurus rather than a full-fledged ontology.Information retrieval potentially involves very intricate inferencesto plug the gaps between the query and the database.For several decades to come, it will be impossible tofully automate such inferences, even with a very nice formal ontology.The only feasible means of such inferences is some human-machine interaction,which may not require any formal ontology.

  3. As a third important merit, the association between the original contentand the description facilitates the authoring of the description,because it allows humans to produce the description whileclosely referring to the original content, rather than totally from scratch.Also, a fine-grained association makesit easy to keep track of which partsof the original content have already been described and which others have not.Furthermore,the association allows humans and machines to effectively collaboratein the authoring process.For instance, a human annotator could first partially describethe content and the machine could add descriptions to the otherparts of the content.

Below, an annotation-based approach to intelligent content ismainly presented. This approach is bottom-up and our annotation systemallows ordinary people to annotate their Web documents with someadditional and useful information by using user-friendly tools.

2.2 Semantic Annotation

Since technologies of automatic analysis of content are not perfect,human support is needed for machines to understand semantics of content.Semantic annotations created by human-machine collaboration are helpfulhints to assist the machine understanding.Such annotations do not just increase the expressive power of content butalso play an important role in content reuse. An example of contentreuse is transcoding of content depending on user preferences.

Content adaptation is a type of transcodingwhich considers a user's environment--devices, network bandwidth,profiles, etc.In addition, such adaptation sometimes requires a good understandingof the original content.If the transcoder fails to analyse the semantic structure of content, thenthe transcoding results may not be accurate and may cause usermisunderstanding.

Our technology assumes that semantic annotations help machines tounderstand content so that transcoding can have higherquality. I call such transcoding based on semantic annotation ``semantictranscoding.'' The overall configuration of the semantic transcodingsystem can be viewed in Figure .

Overall Configuration of Semantic Transcoding System

Fugure4: Overall Configuration of Semantic Transcoding System

Our group has developed a simple method for associating semanticannotations with any element or segment of any Web documents andmultimedia data. The word ``semantic'' indicates that the annotations arehelpful for machines to understand semantics of the original content.

We use URIs (actually URLs), XPointers (location identifiers in thedocument) , and document hash codes (digest values) basedon the MD5 Message-Digest Algorithm to identify particularelements in HTML or XML documents.We have also developed an annotation server that maintains therelationship between content and annotations and transfers requestedannotations to a transcoding proxy.

2.2.1 Annotation Environment

Our annotation environment consists of a client side editor for the creationof annotations and a server for the management of annotations.

The annotation environment is shown in Figure .

Annotation Environment

Fugure5: Annotation Environment

  1. The user runs the annotation editor and requests a URI as a targetof annotation.

  2. The annotation server (the server, hereafter) accepts the requestand sends it to the Web server.

  3. The server receives the Web document.

  4. The server calculates the document hash code (digest value) andregisters the URI with the code to its database.

  5. The server returns the Web document to the editor.

  6. The user annotates the requested document and sends the result tothe server, along with some encrypted personal data (name, professionalareas, etc.).

  7. The server receives the annotation data and relates it with its URIin the database.

  8. The server also checks a digital signature associated with theannotation data, decrypts the personal data, and updates the annotatorprofiles.

The process flows as follows (in this example case, XML or HTML documentsare processed):

Below I explain the annotation editor and the server in more detail.

2.2.2 Annotation Editor

Our annotation editor, implemented as a Java application, can communicatewith the annotation server explained below.

The annotation editor has the following functions:

  1. Registering targets of annotation with the annotation serverby sending URIs

  2. Specifying elements in the document using a Web browser

  3. Generating and sending annotation data to the annotation server

  4. Reusing previously-created annotations when the target contentis updated

  5. Associate an annotator's digital signature with his/her createdannotation data

An example screen of our annotation editor is shown inFigure .

Screen of the Annotation Editor

Fugure6: Screen of the Annotation Editor

The left top window of the editor shows the document object structure ofthe HTML document. The right window shows some text that was selectedon the Web browser.The selected area is automatically assigned an XPointer.The left bottom window shows a linguistic structure of the sentence inthe selected area as described later.

2.2.3 Annotation Server

Our annotation server receives annotation data from an annotator andclassifies it according to the annotator's name.The server retrieves documents from URIs in annotation data and registersthe document hash codes with their URIs in its annotation database. Sincea document's author may modify the document after the initial retrieval,the hash code of a document's internal structure (i.e., Document ObjectModel) enables the server to discover modified elements in the annotateddocument .

The annotation server makes a table of annotator names, URIs, XPointers,and document hash codes. When the server accepts a URI as a request froma transcoding proxy, the server returnsa list of XPointers with associated annotation data, their types(linguistic or commentary), and a hash code. If the server receives anannotator's name as a request, it responds with the set of annotationscreated by the specified annotator.

In the case of multimedia annotation, XPointers and hash codes cannot becalculated for the target content. Time stamps and data sizes are usedinstead of XPointers and hash codes, respectively.

Our group is currently developing a mechanism for access control betweenannotation servers and ordinary content servers.If authors of original documents do not want to allow anyone to annotatetheir documents, they can add a statement about it in the documents, andannotation servers will not retrieve such content for the annotationeditors.

2.2.4 Linguistic Annotation

Linguistic annotation has been used to make digital documentsmachine-understandable, and to develop content-based presentation,retrieval, question-answering, summarization, and translation systems withmuch higher quality than is currently available.

We have employed the GDA tag set as a basic framework to describelinguistic and semantic features of documents .

GDA is a challenging project to make digital documentsmachine-understandable on the basis of a new tag set, and to developcontent-based presentation, retrieval, question-answering, summarization,and translation systems with much higher quality than before. GDA thusproposes an integrated global platform for digital content authoring andannotation, presentation, and reuse.

The GDA tag set is based on XML, and designed to be as compatible aspossible with TEI (Text Encoding Initiative) , CES (CorpusEncoding Standard) , and EAGLES (Expert Advisory Group onLanguage Engineering Standards) .These projects aim at development of standards that help libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic resources for online research and teaching, using encoding schemes that are maximally expressive and minimally obsolescent.The GDA specifies an encoding of modifier-modifiee relations, anaphor-referent relations, word senses, etc.

An example of a GDA-tagged sentence follows:

<su><np opr="agt" sem="time0">Time</np>
<v sem="fly1">flies</v>
<adp opr="eg"><ad sem="like0">like</ad> 
<np>an <n sem="arrow0">arrow</n></np>

The element is a sentential unit. The other tagsabove, , , , and mean noun, noun phrase, verb, adnoun or adverb (includingpreposition and postposition), and adnominal or adverbial phrase,respectively.

The opr attribute encodes a relationship in which the currentelement stands with respect to the element that it semantically dependson. Its value denotes a binary relation, which may be a thematic rolesuch as agent (actor), patient, recipient, etc., or a rhetorical relationsuch as cause, concession, etc. For instance, in the above sentence,Time depends on the second elementflies. opr="agt" means that ``Time'' hasthe agent role with respect to the event denoted by ``flies.'' The sem attribute encodes a word sense.

The GDA initiative aims at having many Web authors annotate their onlinedocuments with this common tag set so that machines canautomatically recognize the underlying semantic andpragmatic structures of those documents much more easily than byanalyzing traditional HTML documents. A huge amount of annotated data isexpected to emerge, which should serve not just as tagged linguisticcorpora but also as a worldwide, self-extending knowledge base, mainlyconsisting of examples showing how our knowledge is manifested.

The GDA project has three main steps:

  1. Propose an XML tag set which allows machines to automatically inferthe underlying structure of documents.

  2. Promote development and spread of natural language processing andartificial intelligence applications to turn tagged texts to versatile andintelligent content.

  3. Motivate thereby the authors of Web documents to annotate theirdocuments using the proposed tags.

The tags proposed in Step 1 will also encode coreferences, rhetoricalstructure, the social relationship between the author and the audience,etc., in order to render the document machine-understandable.

Step 2 concerns AI applications such as machine translation, informationretrieval, information filtering, data mining, consultation, expertsystems, and so on. If annotation with such tags as mentioned above maybe assumed, it is certainly possible to drastically improve the accuracyof such applications. New types of applications for communication aidsmay be invented as well.

Step 3 encourages Web authors to present themselves to the widest andbest possible audience by organized tagging. Web authors will bemotivated to annotate their Web documents, because documents annotatedaccording to a common standard can be translated, retrieved, etc., withhigher accuracy, and thus have a greater chance to reach more targetedreaders. Thus, tagging will make documents stand out much moreeffectively than decorating them with pictures and sounds.

2.2.5 Linguistic Annotation Editor

Using our linguistic annotation editor, the user annotates text with linguistic structure (grammatical and semantic structure, described later) and adds a comment to an element in the document.The editor is capable of basic natural language processing andinteractive disambiguation.The user should modify the results of the automatically-analyzed sentencestructure as shown in Figure .

Annotation Editor with the Linguistic Structure Editor

Fugure7: Annotation Editor with the Linguistic Structure Editor

In the computational linguistic field, word sense disambiguation hasbeen one of the biggest issues. For example, to have a bettertranslation of documents, disambiguation of certain polysemic words isessential. Even if an estimation of the word sense is achieved to someextent, incorrect interpretation of certain words can lead toirreparable misunderstanding.

To avoid this problem, we have been promoting annotation of word sense forpolysemic words in the document, for example using WordNet , so that their word senses can be machine-understandable.

For this purpose, we need a dictionary of concepts, for which we use existing domain ontologies. An ontology is a set ofdescriptions of concepts--such as things, events, and relations--thatare specified in some way (such as specific natural language) in order tocreate an agreed-upon vocabulary for exchanging information.

Annotating a word sense is therefore equal to creating a link between aword in the document and a concept in a certain domain ontology. We have made a word sense annotating tool for this purpose which has beenintegrated with the annotation editor.

As mentioned, using the editor, the user annotates text with linguisticstructure (syntactic and semantic structure) and adds a comment to anelement in the document. The editor is also capable of word senseannotation as shown in Figure . The ontologyviewer appears in the right middle of the figure. The user can easilyselect a concept in the domain ontology and assign a concept ID to a wordin the document as a word sense.

Annotation Editor with the Ontology Viewer

Fugure8: Annotation Editor with the Ontology Viewer

2.2.6 Commentary Annotation

Commentary annotation is mainly used to annotate non-textual elements likeimages and sounds with some additional information.Each comment can include not only tagged texts but also otherimages and links. Currently, this type of annotation appears in asubwindow that is overlayed on the original document window when a userpositions a mouse pointer at the area of a comment-added element as shownin Figure .

Comment Overlay on the Document

Fugure9: Comment Overlay on the Document

Users can also annotate text elements with information such as paraphrases, correctly-spelled words, and underlines.This type of annotation is used for text transcoding that combines such comments on texts and original texts.

Commentary annotaion on hyperlinks is also available. This contributesto quick introduction of target documents before clicking the links.If there are linguistic annotations on the target documents, the transcoders can generate summaries of these documents and relate themwith hyperlinks in the source document.

Previously, some research has been published concerning sharing commentsover the Web. Annotea is a general meta-information architecture forannotating documents on the Web .This architecture includes a basicclient-server protocol, general meta-information description language(i.e., Resource Description Framework ), a server system, and an authoring tool and browser with interface augmentations to provide accessto its extended functionality. Annotea provides a general mechanism forshared annotations, which enables people to annotate arbitrary documentsat any position in-place, share comments/pointers with other people.

These systems are often limited to particular documents or documents shared only among a few people. Our annotation and transcoding system canalso handle multiple comments on any element of any document on the Web.Also, a community wide access control mechanism can be added to our transcoding proxy. If a user is not a member of a particular group,then the user cannot access the transcoding proxy that is for group useonly. In the future, transcoding proxies and annotation servers willcommunicate with some secured protocol that prevents some other server orproxy from accessing the annotation data.

Our main focus is adaptation of online content to users, and sharing commentsin a community is one of our additional features.We have been applying both commentary and linguistic annotations tosemantic transcoding.

Commentary annotations are also used in knowledge sharing and reuse incommunities of several domain experts.

2.2.7 Multimedia Annotation

Multimedia content such as digital video is becoming a prevalent information source.Since the volume of such content is growing to huge numbers of hours,summarization is required to effectively browse video segments in a shorttime without missing significant content.Annotating multimedia content with semantic information such asscene/segment structures and metadata about visual/auditory objects is necessary for advanced multimedia content services.Since natural language text such as a voice transcript is highly manageable,speech and natural language processing techniques have an essential rolein our multimedia annotation.

Our group has developed techniques for semi-automatic video annotationintegrating a multilingual voice transcription method, some videoanalysis methods, and an interactive visual/auditory annotation method. The video analysis methods include automatic color change detection, characterization of frames, and scene recognition using similarity between frame attributes.

There are related approaches to video annotation. For example, MPEG-7 candescribe indeces, notes, and so on, to retrieve necessary parts of contentspeedily. However, it takes a high cost to add these descriptions manually.The method of extracting them automatically through the video/audioanalysis is vitally important. Our method can be integrated into toolsfor authoring MPEG-7 data. The linguistic description schema, which willbe a part of the amendment to MPEG-7, should play a major role in thisintegration.

Using such annotation data, we have also developed a system for advancedmultimedia processing such as video summarization and translation.Our video summary is not just a shorter version of the original video clip, but an interactive multimedia presentation that shows keyframes of important scenes and their transcripts in Web documents and allow users to interactively modify summary.The video summarization is customizable according to users' favorite lengthand keywords.When a user's client device is not capable of video playing,our system tranforms video to a document that is the same as a Webdocument in HTML format.

Multimedia annotation can make delivery of multimedia content todifferent devices very effective. Dissemination of multimedia contentwill be facilitated by annotation on the usage of the content for differentpurposes, client devices, and so forth. Also, it provides object-leveldescription of multimedia content which allows a higher granularity ofretrieval and presentation in which individual regions, segments, objectsand events in image, audio and video data can be differentially accesseddepending on publisher and user preferences, network bandwidth and clientcapabilities.

Multimedia annotation is an extension of document annotation such as GDA.Since natural language text is more tractable and meaningful than binarydata of visual (image and moving picture) and auditory (sound and voice)content, we associate text with multimedia content in several ways. Sincemost video clips contain spoken narrations, our system converts them intotext and integrates them into video annotation data. The text is sometimesacquired from closed captions on television programs. The text in themultimedia annotation is linguistically annotated based on GDA.

2.2.8 Multimedia Annotation Editor

Our group has developed an authoring tool called the Multimedia AnnotationEditor capable of video scene change detection, multilingual voicetranscription, syntactic and semantic analysis of transcripts, andcorrelation of visual/auditory segments and text.

An example screen of the editor is shown inFigure . The editor screen consists of three windows. One window (top) shows the video content, automatically detectedkeyframes in the video, and an automatically generated voice transcript.The second window (left bottom) enables the user to edit the transcriptand modify an automatically analyzed linguistic markup structure. Thethird window (right bottom) shows graphically a linguistic structure ofthe selected sentence in the second window.

Multimedia Annotation Editor

Fugure10: Multimedia Annotation Editor

The editor is capable of basic natural language processing andinteractive disambiguation.The user can modify the results of the automatically analyzed multimediaand linguistic (syntactic and semantic) structures.

2.3 Semantic Transcoding

Semantic transcoding is a transcoding technique based on annotation, usedfor content adaptation according to user preferences.The transcoders here are implemented as an extension to anHTTP proxy server. Such an HTTP proxy is called a transcoding proxy.

Figure shows the environment of semantictranscoding.

Transcoding Environment

Fugure11: Transcoding Environment

The information flow in transcoding is as follows:

  1. The transcoding proxy receives a request URI with a client ID.

  2. The proxy sends the request of the URI to the Web server.

  3. The proxy receives the document and calculates its hash code.

  4. The proxy also asks the annotation server for annotation data related tothe URI.

  5. If the server finds the annotation data of the URI in its database,it returns the data to the proxy.

  6. The proxy accepts the data and compares the document hash code with that of the already retrieved document.

  7. The proxy also searches for the user preference with the client ID.If there is no preference data, the proxy uses a default setting untilthe user gives a preference.

  8. If the hash codes match, the proxy attempts to transcode the documentbased on the annotation data by activating the appropriate transcoders.

  9. The proxy returns the transcoded document to the client Web browser.

The transcoding proxy and various kinds of transcoding are explained inmore detail below.

2.3.1 Transcoding Proxy

We employed IBM's WBI (Web Intermediaries) as a development platform to implement the semantic transcoding system .

WBI is a customizable and extendable HTTP proxy server.WBI provides APIs (Application Programming Interfaces) for user levelaccess control and easy manipulation of input/output data of the proxy.

The transcoding proxy based on WBI has the following functionality:

  1. Management of personal preferences

  2. Gathering and management of annotation data

  3. Activation of transcoders and integration of their outputs

For the management of personal preferences, we use a Web browser'scookie to identify the user.The cookie holds a user ID assigned by the transcoding proxy on thefirst access and the ID is used to identify the user and to select user preferences defined previously.The ID stored as a cookie value allows the user, for example, to change an access point using DHCP (Dynamic Host Configuration Protocol) with the same preference setting.There is one technical problem. Generally, cookies can be accessed only by the HTTP servers that have set their values and ordinary proxies do not use cookies for user identification.Instead, conventional proxies identify the client by the hostname and IP address.Thus, when the user accesses our proxy and sets/updates the preferences, the proxy server acts as an HTTP server to access the browser's cookie data and associates the user ID (cookie value) and the hostname/IP address.When the transcoding proxy works as a coventional proxy, it receivesthe client's hostname and IP address, retrieves the user ID, and thenobtains the preference data.If the user changes access point and hostname/IP address, our proxy performs as a server again and reassociates the user ID and such client IDs.

The transcoding proxy communicates with annotation servers that hold theannotation database.The second step of semantic transcoding is to collect annotationsdistributed among several servers.

The transcoding proxy creates a multi-server annotation catalog by crawling distributed annotation servers and gathering their annotation indices. The annotation catalog consists of server name (e.g., hostname and IPaddress) and its annotation index (set of annotator names and identifiersof the original document and its annotation data).The proxy uses the catalog to decide which annotation server should beaccessed to get annotation data when it receives a user's request.

The final stage of semantic transcoding is to transcode requested contentdepending on user preferences and then to return them to the user's browser.This stage involves activation of appropriate transcoders and integrationof their results.

As mentioned previously, there are several types of transcoding. In thissection we describe four types: text, image, voice, and videotranscodings.

2.3.2 Text Transcoding

Text transcoding is the transformation of text content based on linguisticannotations. We implemented several types of text transcoding such as textsummarization, language translation, and dictionary-based text paraphrasing.

As an example of a basic application of linguistic annotation, we havedeveloped an automatic text summarization system as a part of the texttranscoder. Summarization generally requires deep semantic processing anda lot of background knowledge. However, most previous works use severalsuperficial clues and heuristics on specific styles or configurations ofdocuments to summarize.

Our text summarization method employs a spreading activation technique tocalculate the importance values of elements in the text. Since the method does not employ any heuristicsdependent on the domain and style of documents, it is applicable to anylinguistically-annotated document. The method can also trim sentences inthe summary because importance scores are assigned to elements smallerthan sentences.

A linguistically-annotated document naturally defines an intra-document network in which nodes correspond to elements and links represent the syntactic/semantic relations. This network consists of sentence trees (syntactic head-daughter hierarchies of subsentential elements such as words or phrases), coreference/anaphora links, document/subdivision/paragraph nodes, and rhetorical relation links.

Figure shows a graphical representation of theintra-document network.

Intra-Document Network

Fugure12: Intra-Document Network

A spreading activation method is used to evaluate a degree of importance ofeach node in the intra-document network.The method is an iterative process to calculate an activation value of each node from the activation values of its adjacent nodes connected withnormal and/or reference links.

We can get the activation values of all nodes and then an ordering ofimportance is decided such that the node with a higher activation value ismore important.

The summarization algorithm works as follows:

  1. Spreading activation is performed in such a way thattwo elements (nodes) are identical (they have the same activation value) if they are coreferent or one of them is the syntactic head of the other.

  2. The unmarked element with the highest activation value is markedfor inclusion in the summary.

  3. When an element is marked, the following elements arerecursively marked as well, until no more elements are found:

  4. All marked elements in the intra-document network are generatedpreserving the order of their positions in the original document.

  5. If a size of the summary reaches the user-specified value,then terminate; otherwise go back to Step 2.

The size of the summary can be changed by simple user interaction.Thus the user can see the summary in a preferred size by using an ordinaryWeb browser without any additional software.The user can also input any words of interest. The correspondingwords in the document are assigned numeric values that reflect degrees of interest. These values are used during spreading activation for calculating importance scores.

Figure shows the summarization result on the normalWeb browser.This is the summarized version of the document shown in Figure.

Summarized Documents

Fugure13: Summarized Documents

The second type of text transcoding is language translation. We canpredict that translation based on linguistic annotations will produce amuch better result than many existing systems. This is because the majordifficulties of present machine translation come from syntactic and wordsense ambiguities in natural languages, which can be easily clarified inannotation . An example of the result ofEnglish-to-Japanese translation of the previous document is shown inFigure .

Translated Documents

Fugure14: Translated Documents

2.3.3 Image Transcoding

Image transcoding is to convert images into these of different size, color(full color or grayscale), and resolution (e.g., compression ratio)depending on user's device and communication capability.Links to these converted images are made from the originalimages. Therefore, users will notice that the images they are looking atare not original if there are links to similar images.

Figure shows the document that is summarized inone-third size of the original and whose images are reduced to half.In this figure, the preference setting subwindow is shown on the right hand.The window appears when the user double-clicks the icon on the lower rightcorner (the transcoding proxy automatically inserts the icon).Using this window, the user can easily modify the parameters for transcoding.

Image Transcoding and Preference Setting Window

Fugure15: Image Transcoding and Preference Setting Window

By combining image and text transcodings, the system can, for example,convert contents to just fit the client screen size.

2.3.4 Voice Transcoding

Voice synthesis also works better if the content has linguisticannotation. For example, SSML (Speech Synthesis Markup Language) has beendiscussed in .A typical example is processing proper nouns and technical terms. Wordlevel annotations on proper nouns allow the transcoders to recognize notonly their meanings but also their readings.

Voice transcoding generates spoken language version of documents. Thereare two types of voice transcoding. One is whenthe transcoder synthesizes sound data in audio formats such as MP3 (MPEG-1Audio Layer 3). This case is useful fordevices without voice synthesis capability such as cellular phones andPDAs. The other is when the transcoder converts documents into moreappropriate style for voice synthesis. This case requires that a voicesynthesis program is installed on the client side. Of cource, theclient-side synthesizer uses the result of voice transcoding. Therefore,the mechanism of document conversion is a common part of both types ofvoice transcoding.

Documents annotated for voice include some text in commentary annotation fornon-textual elements and some word information in linguistic annotationfor the reading of proper nouns and unknown words in the dictionary.The document also contains phrase and sentence boundary informationso that pauses appear in appropriate positions.

Figure shows an example of the voice-transcodeddocument in which an audio player is embedded in the top area of thedocument. When the user clicks the play button, the embedded MP3 playersoftware is invoked and starts playing the synthesized voice data.

Voice Transcoding

Fugure16: Voice Transcoding

2.3.5 Multimedia Transcoding

Based on multimedia annotation, we developed a module for multimediatranscoding, especially, video summarization andtranslation. One of the main functions ofthe system is to generate an interactive HTML document from multimediacontent with annotation data for interactive multimedia presentation,which consists of an embedded video player, hyperlinked keyframe images,and linguistically-annotated transcripts. Our summarization andtranslation techniques are applied to the generated document called amultimodal document.

There are some previous work on multimedia summarization such asInformedia and CueVideo. They create a videosummary based on automatically extracted features in video such as scenechanges, speech, text and human faces in frames, and closed captions.They can process video data without annotations. However, currently, theaccuracy of their summarization is not good enough for practical usebecause of the failure of automatic video analysis. Our approach tomultimedia summarization attains sufficient quality for use if the datahas enough semantic information. As mentioned earlier, we have developed atool to help annotators create multimedia annotation data. Since ourannotation data is declarative, hence task-independent and versatile, theannotations are worth creating if the multimedia content will befrequently used in different applications such as automatic editing andinformation extraction.

Video transformation is an initial process of multimedia summarization andtranslation. The transformation module retrieves the annotation dataaccumulated in an annotation repository (XML database) and extractsnecessary information to generate a multimodal document. The multimodaldocument consists of an embedded video window, keyframes of scenes, andtranscipts aligned with the scenes as shown in Figure .The resulting document can be summarized and translated by the modules explained later.

Multimodal Document

Fugure17: Multimodal Document

This operation is also beneficial for people having devices without videoplaying capability. In this case, the system creates a simplified versionof the multimodal document containing only keyframe images of important scenesand summarized transcripts related to the selected scenes.

The proposed video summarization is performed as a by-product of text summarization. The text summarization is an application of linguisticannotation. The method is cohesion-based and employs spreading activationto calculate the importance values of words and phrases in the document.

Thus, the video summarization works in terms of summarization of a transcriptfrom multimedia annotation data and extraction of the video scene relatedto the summary.Since a summarized transcript contains important words and phrases,corresponding video sequences will produce a collection of significantscenes in the video.The summarization results in a revised version of the multimodal documentthat contains keyframe images and summarized transcripts of selectedimportant scenes. Keyframes of less important scenes are shown in a smaller size.

An example screen of a summarized multimodal document is shown inFigure .

Summarized Multimodal Document

Fugure18: Summarized Multimodal Document

The vertical time bar in the middle of the screen of the multimodal documentrepresents scene segments whose color indicates if the segment is includedin the summary or not. The keyframe images are linked with theircorresponding scenes so that the user can see the scene by just clickingits related image. The user can also access information about objectssuch as people in the keyframe by dragging a rectangular region enclosingthem. This information appears in external windows. In the case of auditoryobjects, the user can select them by clicking any point in the time bar.

One type of our video translation is achieved through the followingprocedure. First, transcripts in the annotation data are translated intodifferent languages for the user choice using the text transcoder, andthen, the results are shown as subtitles synchronized with the video. Theother type of translation is performed in terms of synchronization ofvideo playing and speech synthesis of the translation results. Thistranslation makes another-language version of the original video clip. Ifcomments, notes, or keywords are included in the annotation data onvisual/auditory objects, then they are also translated and shown on apopup window.

In the case of bilingual broadcasting, since our annotation system generatestranscripts in every audio channel, multimodal documents can becoming from both channels. The user can easily select a favoritemultimodal document created from one of the channels.We have also developed a mechanism to change the language to play dependingon the user profile that describes the user's native language.

3 Concluding Remarks

I have discussed a full architecture for creating and utilizing annotationincluding an authoring tool to create linguistic, commentary, andmultimedia annotations and a mechanism to apply such data to semantictranscoding that automatically customizes Web content depending on user preferences such as text and multimedia summarization and translation.

Linguistic processing is an essential task in those applications so thatnatural language technologies are very important in intelligent content.The linguistic annotation tool compensates for automatic natural languageprocessing by disambiguation of syntactic, semantic, and pragmaticstructures of text.

The main component of the multimedia annotation tool is a multilingual voicetranscriptor to generate transcripts from multilingual speech in videoclips. The tool also extracts scene and object informationsemi-automatically, describes the data in XML format, and associates thedata with content.

I also presented some advanced applications on multimedia content basedon annotation. Our group has implemented video-to-document transformation thatgenerates interactive multimodal documents, video summarization using atext summarization technique, and video translation.

This technology also contributes to commentary information sharinglike Annotea and device dependent transformation for any device.One of our future goals is to make Web content intelligent enough toanswer our questions asked using natural language.I imagine that in the near future we will not use search engines but willinstead use knowledge discovery engines that give us a personalized summary of multiple documents instead of hyperlinks. The work in this document is one step toward a better solution of dealingwith the coming information deluge.

While our current prototype system is running locally, I am also planning to evaluate the semantic transcoding system with open experimentsjointly with Nagoya University and Cyber Assist Research Center in Japan.In addition, I will distribute our annotation editor, with natural language and multimedia processing capabilities.