Engineering Life: Image processing technique for Video extraction

ABSTRACT

"A picture is worth a thousand words", the message we are getting from an image. Visual information has been playing an important role in our everyday life.

The significant challenge in large multimedia databases is the provision of efficient means for semantics indexing and retrieval of visual information. The video has low resolution and the often has poor contrast with a changing background. Problems in segmenting text from video are similar to those faced detection and localization phases. The main motivation for extracting the content of information is the accessibility problem. A problem that is even more relevant for dynamic multimedia data, which also have to be searched and retrieved. While content extraction techniques are reasonably developed for text, video data still is essentially opaque. Its richness and complexity suggests that there is a long way to go in extracting video features, and the implementation of more suitable and effective processing procedures is an important goal to be achieved.

This report describes brief introduction of Video and Image processing, common Image Processing Techniques, Basics of Video Processing, current researches about Video Indexing and Retrieval, Basic Requirements, Image processing techniques for Video content extraction and some applications like Videocel, COBRA Model.

.The video text extraction problem is divided into three main tasks- 1. Detection, 2. Localization, 3. Segmentation. The present development of multimedia technology and information highways has put content processing of visual media at the core of key application domains: digital and interactive video, large distributed digital libraries, multimedia publishing.

1. Introduction

1.1 Basis of Video and Image Processing:

This chapter will introduce the basis of video and image processing. The image or video is stored only as a set of pixels with RGB values in computer. The computer knows nothing about the meaning of these pixel values. The content of an image is quite clear for a person. However, it is not so easy for a computer. For example, it is a piece of cake to recognize yourself in an image or video, even in a crowd. But this is extremely difficult for computer. The preprocessing is to help the computer to understand the content of image or video. What is the so-called content of image or video? Here content means features of image or video or their objects such as color, texture, resolution , and motion. Object can be viewed as a meaningful component in an image or video picture. For example, a moving car, a flying bird, a person are all objects. There are a lot of techniques for image and video processing. This chapter starts with an introduction to general image processing techniques and then talks about video processing techniques. The reason we want to introduce image processing first is that image processing techniques can be used on video if we treat each picture of a video as a still image.

2. Background

A few years ago, the problems of representation and retrieval of visual media were confined to specialized image databases (geographical, medical, pilot experiments in computerized slide libraries), in the professional applications of the audiovisual industries (production, broadcasting and archives), and in computerized training or education. The present development of multimedia technology and information highways has put content processing of visual media at the core of key application domains: digital and interactive video, large distributed digital libraries, multimedia publishing. Though the most important investments have been targeted at the information infrastructure (networks, servers, coding and compression, delivery models, multimedia systems architecture), a growing number of researchers have realized that content processing will be a key asset in putting together successful applications. The need for content processing techniques has been made evident from a variety of angles, ranging from achieving better quality in compression, allowing user choice of programs in video-on-demand, achieving better productivity in video production, providing access to large still image databases or integrating still images and video in multimedia publishing and cooperative work.

3. Common Image Processing techniques

3.1 Dithering:

Dithering is a process of using a pattern of solid dots to simulate shades of gray. Different shapes and patterns of dots have been employed in this process, but the effect is the same. When viewed from a great enough distance that the dots are not discernible, the pattern appears as a solid shade of gray.

3.2 Erosion

Erosion is the process of eliminating all the boundary points from an object, leaving the object smaller in area by one pixel all around its perimeter. If it narrows to less than three pixels thick at any point, it will become disconnected (into two objects) at that point. It is useful for removing from a segmented image objects that are too small to be of interest.

Shrinking is an special kind of erosion in that single-pixel objects are left intact. This is useful when the total object count must be preserved.

Thinning is another special kind of erosion. It is implemented in a two-step process. The first step will mark all candidate pixels for removal. The second step actually remove those cnadidates that can be removed without destroying object connectivity.

3.3 Dilation:

Dilation is the process of incorporating into the object all the background pixels that touch it, leaving it larger in area by that amount. If two objects are separated by less than three pixels at any point, they will become connected (merged into one object) at that point. It is useful for filling small holes in segmented objects.

Thickening is a special kind of dilation. It is implemented in a two-step process. The first step marks all the candidate pixels for addition. The second step adds those cnadidates that can be added without merging objects.

3.4 Opening:

The process of erosion followed by dilation is called opening. It has the effect of eliminating small and thin objects, breaking objects at thin points, and generally smoothing the boundaries of larger objects without significantly changing their area.

3.5 Closing:

The process of dilation followed by erosion is called closing. It has the effect of filling small and thin holes in objects, connecting nearby objects, and generally smoothing the boundaries of objects without significantly changing their area.

3.6 Filtering:

Image filtering can be used for noise reduction, image sharpening, and image smoothing. By applying a low-pass or high-pass filter to the image, the image can be smoothed or sharpened respectively. Lowpass filter is used to reduce the amplitude of high-frequency components. Simple lowpass filters appliese local averaging. The gray level at each pixel is replaced with the average of the gray levels in a square or rectangular neighborhood. Gaussian Lowpass Filter applies Fourier transform to the image. Highpass filter is used to increase the amplitude of high-frquency components. It is useful for detecì¥ a set in which all the pixels are adjacent or touching. Within each region, there are some common features among the pixels, such as color, intensity, or texture. When a human observer views a scene, his visual system will automatically segment the scene for him or her. The process is so fast and efficient that one sees not a complex scene, but rather a collection of objects. However, computer must laboriously isolate the objects in an image by breaking the image into sets of pixels, each of which is the image of one object.

Image segmentation can be approached from three ways. The first approach is called region approach, in which each pixel is assigned to a particular object or region. In the boundary approach, only the boundaries that exist between the regions are located. The third is called edge approach, where people try to identify edge pixels and then link them together to form the required boundaries.

3.7 Object Recognition:

The most difficult part of image processing is object recognition. Although there are many image segmentation alogrithms that can segment image into regions with some continuous feature, it is still very difficult to recognize objects from these regions. There are several reasons for this. First, image segmentation is an ill-posed task and there is always some degree of uncertainty in the segmentation result. Second, an object may contain several regions and how to connect different regions is another problem. At present, no algorithm can segment general images into objects automatically with high accuracy. In the case that there is some a prior knowledge about the foreground objects or background scene, the accuracy of object recognition could be pretty good. Usually the image is first segmented into regions according to the pattern of color or texture. Then separate regions will be grouped to form objects. The grouping process is important for the success of object recognition. Full automatical grouping only occurs when the a prior knowledge about the foreground objects or background scene exists. In the other cased, human interaction may be required to achieve good accuracy of object recognition.

4. Basis of Video Processing

4.1 Content of Digital Video

Generally speaking, there is much similarity between digital video and image. Each picture of video can be treated as a still image. All the techniques applicable to images can also be applied to video pictures. However, there are still different. The most significant difference is that video has temporal information and uses motion estimation for compression. Video is a meaningful group of pictures that tells a story or something else. Video pictures can be grouped as a shot. A video shot is a set of pictures taken in one camera break. Within each shot, there can be one or more key pictures. Key picture is a representative of the content of a video shot. For a long video shot, there may be multiple key pictures. Usually video processing segments video into separate shots, selects key pictures from these shots, and then generate features of these key pictures. The features (color, texture, object) of key pictures are searched in video query.

Video processing includes shot detection, key picture selection, feature generation, and object extraction.

4.1.1 Shot Detection:

Shot detection is a process to detect camera shots. A camera shot consists of one or more pictures taken in one camera break. The general approach to shot detection has been the definition of a difference metric. If the difference between two pictures are above the metric, then there is a shot between them. An algorithm can be proposed for this.This algorithm uses binary search to detect shot which makes it very fast and achieve good performance as well. Recently, there are some algorithms that detect shot directly on MPEG compressed data.

4.1.2 Key Picture Selection:

After shot detection, each shot is represented by at least one key picture. The choice of key picture could be as simple as a particular picture in the shot: the first, the last, or the middle. However, in situations such as long shot, no single picture can represent the content of the entire shot. QBIC uses a synthesized key picture created by seamlessly mosaicking all the pictures in a given shot using the computed motion transformation of the dominant background. This picture is an authentic depiction of all background captured in the whole shot. In CBIRD system, key picture selection is a simple process that usually chooses the first and last pictures of a shot as key pictures.

4.1.3 Feature Generation:

After key picture selection, features of key pictures such as color, texture, intensity are stored as indexes of the video shot. Users can perform traditional search by using keyword querying and content-based query by specifying a color, intensity, or texture pattern. Only the generated features will be searched against and the retrieval can be in real time.

4.2.4 Object Extraction:

During the process of shot detection and key picture selection, the objects in the video are also extracted using image segmentation techniques or motion information. Segmentation-based techniques is mainly based on image segmentation. And objects are recognized and tracked by segmentation projection. Motion-based techniques make use of motion vectors to distinguish objects from background and keep track of their motion. It is a very difficult problem. And the new MPEG-4 standard will talk about how to get objects in the video and encode them separately into different layers. Hopefully this process is not manual and it is also unrealistic to expect it to be full automatical.

5. Current Research about Video Indexing and Retrieval

Video indexing and retrieval is a very active research area. In the field of digital video, computer-assisted content-based indexing is a critical technology and currently a bottleneck in the productive use of video resources. Only an indexed video can effectively support retrieval and distribution in video editing, production, video-on-demand and multimedia information systems. To achieve this, we need algorithms and systems that provide the ability to store and retrieve video in a way that allows flexible and efficient search based on content. In this chapter, we will talk about some important aspects about the state of art progress in video indexing and retrieval. It is organized as follows:

v Video Parsing

v Video Indexing and Retrieval

v Object Recognition and Moition Tracking

5.1 Video Parsing:

The first step of video processing is video parsing. Video parsing is a process to segment video stream into generic shots. These shots are the elementary index unit in a video database, just like a word in a text database. Then each of these shots will be represented by one or more key pictures. Only these key pictures are stored into the video database. There are several tasks in video parsing, including shot detection and key picture selection.

5.1.1 Shot Detection in video parsing:

The first step of video parsing is shot detection. Shot detection algorithms usually belong to two classes: (1)those based on global representations like color/intensity histograms without any local information, and

(2)those based on measuring local difference like intensity change. The former are relatively insensitive to motion but can miss shots when scenes look quite different but have similar distributions. The latter are sensitive to moving objects and camera. Some systems combine the advantages of the two classes of detection by using a mixed method. QBIC is one of these systems.

5.1.2 Key Picture Selection in video parsing

The next step after shot detection is key picture selection. Each shot has at least one key picture. Key picture can best represent the visual content of video. The number of key pictures for each shot can be constant or adaptive to shot content. The first picture is selected as a key picture; and subsequent pictures are compared against this candidate. A two-threshold technique, similar to the one described above, is applied to identify a picture significantly different from the candidate. This new picture is considered another key picture and the subsequent pictures are compared against this new candidate. Users can control the density of key pictures by adjusting the two threshold values.

5.1.3Feature Generation in video parsing

5.2 Video Indexing and Retrieval

After each object in the video shot has been segmented and tracked, their features such as color, texture, motion can be obtained and stored in a feature database. The resulting database is a simple feature, value pair and the actual query is performed on this feature database. For each feature, there is a function to calculated the distance between query object and tracked objects in thevideo database. The total distance is a weighted sum of these distances. If the total distance is below a certain threshold, then it is returned as a possible matching.

There are also some image processing system such as Yahoo Image Surfer Category List, WebSeer, WebSeek, VisualSeek,UCB's query all images, Lycos and MIT's Photobook. Some of them are mainly based on keyword searching. First the images are assigned one or more keywords manually and catorized into different groups such as photos, arts, people, animals, plants. Users can then browse through the separate category that may be interesting to them.

5.2.1 Examples of some image processing systems :

“Yahoo Image Surfer Category List(YISCL)” and “Lycos”. YISCL system also provides visual search function which is based on color distribution matching. UCB's query all image presents several interesting ideas such as ``blobworld'' and ``body plan''. Blobworld is a region. While blobworld does not exist completely in the "thing" domain, it recognizes the nature of images as combinations of objects, and querying and learning in blobworld are more meaningful than they are with simple "stuff" representations. The Expectation-Maximization (EM) algorithm is used to perform automatic segmentation based on image features. After segmentation, each region is shown as an elliptic blob. Body plan is an algorithm for image segmentation. A body plan is a sophisticated model of the way a horse is put together; as a result, the program is capable of recognising horses in different aspects MIT's Photobook allows users to perform texture modeling, face recognition, shape matching, brain matching, and interactive segmentation and annotation. WebSeek allows users to draw a query that depicts the spatial relations between objects.

5.3 Object Recognition and Motion Tracking

This is such an important topic. In video, the credibility of object recognition is higher than that in still image because there is more information available. The most valuable information is motion vectors. The motion vectors of a moving object has some intrinsic patterns that conform to a motion model. There are some papers talking about object recognition using affine motion model.

6. Basic Requirements:

6.1Video Data Modeling

In a conventional database management system (DBMS), access to data is based on distinct attributes of well-defined data developed for a specific application. For unstructured data such as audio, video, or graphics, similar attributes can be defined. A means for extracting information contained in the unstructured data is required. Next, this information must be appropriately modeled in order to support both user queries for content and data models for storage.

Fig 1: First Stage in Video Data Adaptation: Data Modeling

From a structural perspective, a motion picture can be modeled as data consisting of a finite-length of synchronized audio and still images. This model is a simple instance of the more general models for heterogeneous multimedia data objects. Davenport et al describe the fundamental film component as the shot: a contiguously recorded audio/image sequence. To this basic component, attributes such as content, perspective, and context can be assigned, and later used to formulate specific queries on a collection of shots. Such a model is appropriate for providing multiple views on the final data schema and has been suggested by Lippman and Bender Smith and Davenport use a technique called stratification for aggregating collections of shots by contextual descriptions called strata. These strata provide access to frames over a temporal span rather than to individual frames or shot endpoints. This technique can then be used primarily for editing and creating movies from source shots. It also provides a quick query access and a view of desired blocks of video. Because of the linearity of the medium we cannot get a coherent description of an item but as a result of the stratification method the related information is lumped together. The linear integrity of the raw footage is erased resulting in contextual information which relates the shot with the environment. Rowe et al. have developed a video-on-demand system for video data browsing. In this system the data are modeled based on a survey of what users would query for. Three types of indices were identified to satisfy the user queries. The first is a textual bibliographic index which includes information about the video and the individuals involved in the making of the video. The second is a textual structural index of the hierarchy of movie, i.e., segment, scene, and shots. The third is a content index which includes keyword indices for the audio track, object indices for significant objects and key images in the video which represent important events.

The above model does not utilize the semantics associated with video data. Different video data types have different semantics associated with them. We must take advantage of this fact and model video data based on the semantics associated with each data type.

6.2 Video indexing

Video annotation or indexing is the process of attaching content based labels to video. Video indexing is the process of extracting from the video data the temporal location of a feature and its value.

6.2.1 Need of video indexing

Indexing video data is essential for providing content based access. Indexing has typically been viewed either from a manual annotation perspective or from an image sequence processing perspective. The indexing effort is directly proportional to the granularity of video access. As applications demand finer grain access to video, automation of the indexing process becomes essential. Given the current state of art in computer vision, pattern recognition and image processing reliable and efficient automation is possible for low level video indices, like cuts and image motion properties etc.

Existing work on content based video access and video indexing can be grouped into three main categories

6.2.1.1 High level indexing

The work by Davis is an excellent instance of high level indexing. This approach uses a set of predefined index terms for annotating video. The index terms are organized based on a high level ontological categories like action, time, space, etc.

The high level indexing techniques are primarily designed from the perspective of manual indexing or annotation. This approach is suitable for dealing with small quantities of new video and for accessing previously annotated databases.

6.2.1.2 Low level indexing

These techniques provide access to video based on properties like color, texture etc. These techniques can be classified under the label of low level indexing.

The driving force behind this groups of techniques is to extract data features from the video data, organize the features based on some distance metric and to use similarity based matching to retrieve the video. Their primary limitation is the lack of semantics attached to the features.

6.2.1.3 Domain specific indexing

These techniques use the high level structure of video to constrain the low level video feature extraction and processing. These techniques are effective in their intended domain of application. The primary limitation of these techniques is their narrow range of applicability.

6.3 Video data Management

We here want to know how to extract contents from segmented video shots and then index them effectively so that users can retrieve and browse a large amount of video collections. Management of sequential video streams includes three steps, i.e. parsing, content extraction & indexing and retrieval & browsing.

Video parsing is the process of detecting scene changes or the boundaries between camera shots in a video stream The video stream is segmented into generic clips. These clips are the elemental index units in a video database, just like a word in a text database. Then, each of these clips will be represented visually by their key frames. To reduce the requirements for mass amount of storage, only these key frames will be stored into the database. of indices for their location. There are two type of transitions, abrupt transitions or camera break and gradual transitions e.g., fade-in, fade-out, dissolve, and wipe.

Indexing, which tags video clips when the system inserts them into the database. The tag includes information based on a knowledge model that guides the classification according to the semantic primitives of the images. Indexing is thus driven by the image itself and any semantic descriptors provided by the model. Two types of indices, text-based and image-based, are needed. The text-based index is typed in by human operator based on the key frames using a content logger. The image-based index is automatically constructed based on the image features extracted from the key frames.

Retrieval and browsing, where users can access the database through queries based on text and/or visual examples or browse it through interaction with displays of meaningful icons. Users can also browse the results of a retrieval query. It is important that both retrieval and browsing appeal to the user's visual intuition. By visual query, users want to find video shots that look similar to a given example. In concept query, users want to find video shots by the presence of specific objects or events. Visual query can be realized by directly comparing low level visual features like color, texture, shape and temporal variance of video shots or their representative frames (i.e. key frames). On the other hand, the concept query depends on object detection, tracking and recognition. Since fully automatic object extraction is still impossible, some extent of user interaction is necessary in this process Manual indexing labor can be greatly reduced with the help of video analysis techniques.

7. Image Processing Techniques For Video Content Extraction

The increase in the diversity and availability of electronic information led to additional processing requirements, in order to retrieve relevant and useful data: the accessibility problem. This problem is even more relevant for audiovisual information, where huge amounts of data have to be searched, indexed and processed. Most of the solutions for this type of problems point towards a common need: to extract relevant information features for a given content domain. A process which underlies two difficult tasks: deciding what is relevant and extracting it. In fact, while content extraction techniques are reasonably developed for text, video data still is essentially opaque. Despite its obvious advantages as a communication medium, the lack of suitable processing and communication supporting platforms has delayed its introduction in a generalized way. This situation is changing and new video based applications are being developed.

7.1 Toolkit overview

videoCEL is basically a library for video content extraction. Its components extract relevant features of video data and can be reused by different applications. The object model includes components for video data modelling and tools for processing and extracting video content, but currently the video processing is restricted to images.

At the data modelling level, the more significant concepts are the following:

· Images, for representing the frame data, a numerical matrix whose values can be colors, color map entries, etc.;

· ColorMaps, which map entries into a color space, allowing an additional indexation level;

· ImageDisplayConvertes and ImageIOHandlers, that convert images in the specific formats of the platforms and vice-versa.

The object model of videoCEL is a subset of a more complete model, which also includes concepts such has shots, shot sequences and views Concepts, which are modelled in a distinct toolkit that provides functionalities for indexing, browsing and playing annotated video segments.

A shot object is a discrete sequence of images with a set of temporal attributes such as frame rate and duration and represents a video segment. A shot sequence object groups several shots using some semantic criteria. Views, are used to visualize and browse shots and shot sequences.

7.2 Temporal segmentation tools

One of the most important tasks for video analysis is to specify a unit set, in which the video temporal sequence may be organized. The different video transitions are important for video content identification and for the definition of the semantics of the video language, making their detection one of the primary goals to be achieved. The basic assumption of the transition detection procedures is that the video segments are spatially and temporally continuous, and thus the boundary images must suffer significant content changes. Changes, which depend on the transition type and can be measured. The original problem is reduced to the search of suitable difference quantification metrics, whose maximums identify, with great probability, the transition temporal locations.

7.3 Cut detection

The process of detecting cuts is quite simple, mainly because the changes in content are very visible and they always occur instantaneously between consecutive frames. The implemented algorithm simply uses one of the quantification metrics, and a cut is declared when the differences are above a certain threshold. Thus, its success is greatly dependent on the metric suitability. The results obtained by applying this procedure to some of our metrics are presented next. The thresholds selection was made empirically, while trying to maximize the success of the detection (minimizing simultaneously the false and missed detections). The captured video segment belongs to an outdoors news report, so its transitions are not very “artistic” (mainly cuts). There are several well known strategies that usually improve this detection. For instance, the use of adaptive thresholds increases the flexibility of the thresholding, allowing the adaptation of the algorithm to diverse video content An approach that was used with some success in previous work , while trying to reduce some of the lacks of the metrics specific behavior, was simply to produce a weighted average of the differences obtained with two or more metrics. Pre-processing images using noise filters or lower resolution operators are also quite usual tasks, offering means for reducing image the noise and also the processing complexity. The distinctive treatment of image regions, in order to eliminate some of the more extreme values, remarkably increases the detection accuracy, specially when there are only a few objects moving on the captured scene .

7.4 Gradual transition detection

Gradual transitions, such as fades, dissolves and wipes, cause more gradual changes which evolve during several images. Although the obtained differences are less distinct from the average values, and can have similar values to the ones caused by camera operations, there are several successful procedures, which were adapted and are currently supported by the toolkit.

7.4.1 Twin-Comparison algorithm

This algorithm was developed after verifying that, in spite of the fact that the first and last transition frames are quite different, consecutive images remain very similar. Thus, as in the cuts detection, this procedure uses one of the difference metrics, but, instead of one, it has two thresholds: one higher for cuts, and another for the gradual transitions. While this algorithm just detects gradual transitions and distinguish them from cuts, there are other approaches which also classify fades, dissolves and wipes,.

7.4.2 Edge-Comparison algorithm

This algorithm analyses both edge change fractions, exiting and entering. Distinct gradual transitions generate characteristic variations of these values. For instance, a fade in always generates an increase in the entering edge fraction; conversely, a fade out causes an increase in the exiting edge fraction; a dissolve has the same effect as a fade out followed by a fade in.

7.5 Camera operation detection

As distinct transitions give different meanings to adjacent video segments, the possible camera operations are also relevant for content identification . For example, that information can be used to build salient stills and select key frames or segments for video representation. All the methods which detect and classify camera operations start from the following observation: each one generates global characteristic changes in the captured objects and background . For example, when a pan happens they move horizontally in the opposite direction of the camera motion; the behavior of the tilts is similar but in the vertical axis; zooms generate convergent or divergent moves.

7.5.1 X-ray based method

This approach basically produces fingerprints of the global motion flow. After extracting the edges, each image is reduced to its horizontal and vertical projections, a column and a row, that roughly represent the horizontal and vertical global motions, which are usually referred to as the x-ray images.

7.6 Lighting conditions characterization

Light effects are usually mentioned in the cinema language grammar, as they contribution is essential for the overall video content meaning. The lighting conditions can be easily extracted by observing the distribution of the light intensity histogram: its mode, mean and average are valuable in characterising its distribution type and spread. These features also allow the quantification of the lighting variations, once the similarity of the images is determined.

7.7 Scene segmentation

Scene segmentation refers to the image decomposition in its main components: objects, background, captions, etc. It is a first step for the identification and classification of the scene main features, and its tracking during all the sequence. The simplest implemented segmentation method is the amplitude thresholding, which is quite successful when the different regions have distinct amplitudes. It is particularly useful procedure for binarizing captions. Other methods are described below.

7.7.1 Region-based segmentation

Region-based segmentation procedures find out various regions in an image which have similar features. One of such algorithms is the split and merge algorithm, that first divides the image in atomic homogeneous regions, and then merges the similar adjacent regions until they are sufficiently different. Two distinct metrics are needed: one for measuring the initial regions homogeneity (the variance, or any other difference measure), and another for quantifying the adjacent regions similarity (the average, median, mode, etc.).

7.7.2 Motion-based segmentation

The main idea in motion-based segmentation techniques is to identify image regions with similar motion behaviors. These properties are determined by analysing the temporal evolution of the pixels. This process is carried out in the frequency image produced for all the image sequence. When more constant pixels are selected, for example, the final image is the background causing the motion removal. Once the background is extracted, the same principle can be used to extract and track motion or objects.

7.7.3 Scene and object detection

The process of detecting scenes or scene regions (objects) is, in certain way, the opposite process of transition detection: we want to find images regions whose differences are below a certain threshold. As a consequence this procedure uses difference quantification metrics. These functions can be determined for all the image, or a hierarchical growing resolution calculation can be performed to accelerate the process. Another tested algorithm, also hierarchical, is based in the hausdorff distance. It retrieves all the possible transformations (translation, rotation, etc.) between the edges of two images . Another way of extracting objects is by representing their contours. The toolkit uses a polygonal line approach to represent contours as a set of connected segments. The ending of a segment is detected when the relation between the current segment polygonal area and its length is beyond a certain threshold.

7.7.4 Caption extraction

Based on an existing caption extraction method a new and more effective procedure was implemented. As the captions are usually artificially added to images, the first step of this procedure is extracting high-contrast regions. This task is performed by segmenting the edge image, whose contours have been previously dilated by a certain radius. These regions are then subjected to a certain caption-characteristic size constrains, based on the x-rays (projections of edge images) properties; just the horizontal clusters remain. The resulting image is segmented and two different images are produced: one with black background for lighter text, and another with white background for darker text. The process is complete after binarizing both images and proceeding to more dimensional region constrains.

7.8 Edge detection

Two distinct procedures for edge detection were implemented: (1) gradient module thresholding, where the image vectors are obtained using the Sobel operator; (2) the canny filter, considered the optimum detector, which analyses the representativity of gradient module maximums, and thus producing thinner contours. As the differential operators amplify high frequency zones, it is common practice to pre-process the images using noise filters, a functionality also supported by the toolkit in the form of several smoothing operators: the median filter, the average filter, and a gaussian filter.

8. Applications

8.1 Videocel applications:

Video browser:

This application is used to visualize video streams. The browser can load a stream and split it in its shot segments using cut detection algorithms. Each shot is then represented in the browser main window by an icon, that is a reduced form of its first frame the shots can be played using several view objects.

WeatherDigest :

The WeatherDigest application generates HTML documents from TV weather forecasts. The temporal sequence of maps, presented on the TV, is mapped to a sequence of images in the HTML page. This application illustrates the importance of information models.

News analysis :

developed a set of applications to be used by social scientists in content analysis of TV news. The analysis was in filling forms including news items duration, subjects, etc., which our attempts to automate. The system generates HTML pages with the images and CSV (Comma Separated Values) tables suitable for use in spreadsheets such as Excel. Additionally, these HTML pages can be also used for news browsing and there also is a java based tool for accessing this information.

8.2 COBRA Model

In order to explore video content and provide a framework for automatic extraction of semantic content from raw video data, we propose the Contenet Based RetrievAl (COBRA) video data model. The model is independent of feature/semantic extractors, providing flexibility by using different video processing and pattern recognition techniques for that purposes. The feature grammar is exploited to describe the low-level persistent meta-data. The grammar also describes the dependencies between the extractors.

At the same time it is in line with the latest development in MPEG-7, distinguishing four distinct layers within video content: the raw data, the feature, the object, and the event layer. The object and event layers are concept layers consisting of entities characterized by prominent spatial and temporal dimensions respectively.

To provide automatic extraction of concepts (objects and events) from visual features (which are extracted using existing video/image processing techniques), the COBRA video model is extended with object and event grammars. These grammars are aimed at formalizing the descriptions of these high-level concepts, as well as facilitating their extraction based on features and spatio-temporal reasoning.

This rule-based approach results in the automatic mapping from features to high-level concepts. However, we still have the problem of creating object and event rules manually. This might be very difficult, especially in the case of certain object rules which require extensive user familiarity with features and object extraction techniques.

As the model also provides a framework for stochastic modeling of events, we have chosen to exploit the learning capability of Hidden Markov Models (HMMs) to recognize events in video data automatically.

Figure 2 - Video sequences

Figure 3 - Video shots

Figure 4 - Principal color detection

Figure 5 - Detected player

`SELECT`	`vi.frame seq` `FROM player pl, video vi`
`WHERE`	`Event:vi.frame.e=((Appr_net_BSL:` `((e1:Player_near_the_net,e2:Backhand_slice),(),(),(),(before (e2,e1,n),n<75 ampras="" e1.o1.name="e2.o1.nam=" span="">`

Fig 6 - Video query

The WHERE clause of the query shown in the table above, constrains player profiles on only documents that contain videos with the event Approaching the net with backhand slice stoke'. This new event description, defined inside the query, demonstrates how complex events can be defined dynamically based on previously extracted events and spatio-temporal relations. The first one of two predefined events is the Player_near_the_net event, which is defined using spatio-temporal rules, where as the second one, the Backhand_slice event, is defined using the HMM approach. The temporal relation requires e1 to start at least 75 frames before event e2. The event descriptions are evaluated by the query processor. It rewrites the event from its conceptual definition into a standard object algebra extended by the COBRA video model and spatio-temporal operations. Therefore, a user is able to explore video content specifying very detailed complex queries that include a combination of features, objects, and events, as well as spatio-temporal relations among them.

Conclusion

Visual information has always been always an important source of knowledge. With the advances in information computing & communication technology, this information in the type of digital images & digital video, is highly available also through the computer. To be able to cope with the explosion of visual information, an organization of material which allows for fast search & retrieval is required. This calls for the system which is in some way can provide content-based handling of visual information. In this seminar I have tried to give the basic image processing techniques, status of content based access to images & video databases, some applications regarding to video content extraction.

An image extraction system is necessarily for users that have large collection of images like digital library. During the last few years, some content based techniques for image retrieval system are commercially available.. These systems offer retrieval by color,, texture or shape & smart combinations of these images help users in finding the image he is looking for.. A video retrieval system is useful for video archiving, video editing, production etc.

BIBLIOGRAPHY

Basis of Video and Image Processing by Jian Wang

A survey on content based access to image & video databases by Kjersti Aas & Line Eikvil

Content-based representation and retrieval of visual Media: a state-of-the-art review 1

Philippe Aigrain, Hongjiang Zhang Dragutin Petkovic

Advanced content-based semantic scene analysis and information retrieval: the schema project

E. Izquierdo, J.R. Casas, R. Leonardi, P. Migliorati, Noel e.

O’connor, I. Kompatsiaris and M. G. Strintzis

Engineering Life

Pages

Tuesday, June 4, 2013

Image processing technique for Video extraction