Video Scene Retrieval Based on Online Video Annotation

Dept. of Information Engineering, Nagoya University
Graduate School of Information Science, Nagoya University
Shigeki OHIRA
EcoTopia Science Institute, Nagoya University
Katashi NAGAO
Center for Information Media Studies, Nagoya University

1 Introduction

2 Tagging Video Scenes (Creation of Scene Tags)

Creating scene tags is, in other words, relating keywords to arbitrary time code in video. Scene tags contain nouns, verbs and adjectives and don't contain particles and auxiliary verbs. Unknown words are dealed as nouns.

We create scene tags by using 3 methods for 27 videos that registred in Synvie. Length of used videos are about 349 seconds on average and that of longest video is 768 seconds, shortest is 76 seconds. We used various kind of video, e.g. education, story, entertainment.

In next capter, we compare usefullness of tags created by using each method.


One annotator added scene tags by using a tool that user can add tags to arbitrary time in videos. The annotator that is not a creator and do not have special knowleade about videos added objective informations aquired from images and sounds to video scene as scene tags in detail and exhaustively. this method is a kind of conventional client-side video annotation.

We define costs for creating scene tags as time that an annotator spent for adding them. And it is 1480 seconds on average, and the longest time is 3692 seconds, the shotest is 582 seconds.


Synvie is video sharing system that users can comment to video scenes and quote them to weblog. We opened a public experimental service from July 1, 2006 and using data accumulated from July 1, 2006 to October 30, 2006. We gathered 97 registered users, and 94 videos.

From accumurated annotation data, we can aquire text data related to time data. Through some processing, we created scene tags automatically from annotation data. Process for creating tags is shown below.

  1. Morphological Analysis(using a Japanese morphological analyser, "Cabocha").
  2. Removing stop words.
  3. Extracting nouns, verbs, adjectives, and unknown words.
  4. Relating words to time data and saving to database.

These processing can be performed automatically and annotation data can be accumurated through natural communication by human on the web, so it can be said that costs for creating scene tags is extremelly small.

In this method, 153 scene tags were created on average for 27 videos.

2.3 Tag Selection System

We can easily predict that annotation data may include useless data like that have no relation to video. In comments or weblogs, users do not necessarilly refer to contents of video. Therefore, scene tags created from online video annotation automatically in previous section have high possibility of including useless tags. Actually we could find scene tags that are obviously insuitable to scenes in sight. They include tags that are not meaningless but unsuitable to scenes and tags that lost meanings on account of morphological analysis processes. By those reasons, quality of tags that created from online video annotation automatically won't be high. So we must select appropriate scene tags from them for using them to practical applications. If we success appropriate selecting of them, higher quality tags will be created.

Becouse it is ideal that this selection can be successed by automaticall processing, we tried it by some methods. In first, we used well-known technique of TF-IDF (Term Frequency-Inverse Document Frequency) . But this technique needs a large amount of documents to be successful in finding appropriate words. In second, by using Google Web API we tried to score words by co-occulence relations to tags that have been added when video registered. But in this method scores of words that appear in general documents was higher than words that have high relations to scene.These results shows that it is very difficult to perform appropriate selecting of scene tags by automatically processing, and manual processing by humans must be needed for making it successful. So we have developed a tag selection system that online users can select appropriate scene tags from automatically created data using online video annotation. We can guess that quality of tags selected by humans is high. But the more we put human costs in this process, the more advantages of using online video annotation is lost. Mechanisms that users can select tags efficiently must be needed for reducing human costs. This system is used by one or more users. Users watch a video and when time code that tags have been added come, a video stops temporary and users select tags that appropriate to the scene.

We performed subject experiments on selecting tags using this system. A number of subjects for each video is 2 or 3.

We defined costs for creating tags by using this system as time that each user have spent for selecting. We calculated this value from time measured automatically. An average time is 314 seconds, and this is 1/5 of time spent for creating tags by using an annotation tool in section 2.1.

In this experiment, 55 scene tags were created on average for 27 videos. This result shows that 36.2 percentage of tags that were created automatically from online video annotation were judged as appropriate tags to each scene. A comparison 3 methods by costs for creating tags is shown in Table 1.


3 video scene retrieval

We have developed a video scene retrieval system based on scene tags with new concept. And we performed subject experiments on video scene retrieval using scene tags created in previous chapter.

3.1 Tag-Based Video Scene Retrieval System

We have developed a video scene retrieval system that makes the most of characteristics of scene tags. Scene tags created from online video annotation have an essential problem that their exhaustiveness for videos is small usually. Though degree of this problem depends on the amount of annotation, it is very difficult to create tags to overall video scenes. And time ranges that tags have not been added can not be returned as search result. But tags also have a strong point that a large number of tags can be displayed in small space on browser and it can be helps for efficient retrieval. We have developed this system considering these characteristics of scene tags. Process of retrieving video scenes is shown below.

  1. Select arbitrary number of tags and submit as query.
  2. A list of videos is returned corresponding to query. And timeline seek bar that highlights time ranges that tags used as query have been added and a list of all scene tags that have been added to the video are displayed with each video.
  3. Select tags from a list to reflect on timeline seek bar.
  4. Move seek bars to view thumbnail images for arbitrary time code.
  5. Play a video from arbitrary time code.


The top page of this video scene retrieval system is shown in Figure 2. Scene tags and tags that have been added when video registered are displayed. Tags classified into nouns (including unknown words), verbs, adjectives in ABC order and A-I-U-E-O order. When a tag is clicked, a word of the tag is added to text field for search, so inputting text using keyboard is not necessally. And users can use incremental search for tags. Incremental search is a search that progressively finds a match for the search string as each character that is typed. In this system, when letter is typed, only the tag which starts from that letter are displayed and others become invisible. These functions help users to find tags for query from a large amount of tags.

As an output of search, a list of videos with timeline seek bar is returned (Figure 3). Timeline seek bar help users to view video scenes on web browser without access to the video itself. Not only seek bar but also thumnail images corresponding to time code of seek bar and all of scene tags are displayed. If users move the seek bar to arbitrary time code, a thumbnail image changes corresponding to the time code. This function helps users to view images of time code that tags have not been added. Because time ranges that tags have been added are highlighted on the seek bar, users can easily see thumbnail images of video scenes that have high possibility of being what users want to see. Moreover, users can click tags from lists to reflect on the seek bar. These actions are repeated and video scenes users want to see are found out. An example of an image that video scene retrieval are performed is shown in Figure 4.

We proceed development for making retrieval more efficient. And we opened a public experimental service from February 27, 2007.



3.2 Experiment on Video Scene Retrieval

We set 9 scenes as a retrieval target. The sentence of questions was "Scene where a certain animal was reflected by the parent and child", "Scene before the person who was doing snowboard crashed into the edge of the course", etc. and it did not necessarily include words and phrases given as scene tags, but subjects could presume the scene getting hints from scene tags. Moreover, to clarify that the answer to each question was the only one time range, not only sentences but also a thumbnail image was given. Subjects retrieved the answer scene to each questions, and measured the time that had been spent on the answer automatically.The number of subjects is 9 people. Subjects retrieved scenes using tags created by 3 methods in previous chapter. Each subject retrieved 3 scenes respectively by using tags created by each method, and retrieved 9 scenes in total. Therefore, each creating method of scene tags can be impartially compared. We prepared experimental top page not to inform which method tags subjects used for each retrieval have been created by.

Data we aquired in this experiment are shown below.

  • Scene decided as answer.
  • Time spent on retrieval.
  • Query used for retrieval.
  • Viewed scene.

Because all subjects were able to discover a correct scene for all questions in this experiment, we could not compare each creating method of scene tags by this viewpoint. Therefore, we compared them according to the time that had been spent on the retrieval. The experiment result is shown in Table 3. It can be thought that the difference did not go out by some influences other than unlike tags at the retrieval time because the number of query submitted increases corresponding to an increase of the average retrieval time, too. Time spent for retrieval was the longest when tags extracted from the online video annotation automatically was used. And it was the shortest when tags created using an anntataion tool was used. This result shows that retrieval time has been shortened in 2 methods that put human costs on creating scene tags. Though automatic tag creating from online video annotation is the best method in the point of costs for creating scene tags, costs for retrieval are very large by this method. And becouse it is an indispensable problem to shorten the retrieval time to video contents that will become the great number in the future, it is thought that it is necessary to put human costs in no small way for creating scene tags.

Therefore, we compared tags created using an annotation tool with them created using the tag selection system according to cost performance of them(cost-effectiveness), that is, ratios of the retrieval cost and the creating cost. Cost performance C of tag was calculated by the following expression.

From this expression, how much retrieval time was shortend by spending 100 seconds for creating tags can be calculated. This result is shown in Table 4.

From this result, when comparing methods by a viewpoint of cost performance of tags, it can be said that the method of creating tags using the tag selection system was best in this experiment.

4 Conclusion and Next Challenge

4.1 Conclusion

4.2 Next Challenge