ABSTRACT
"A picture is worth a thousand words", the message we are getting from an image. Visual information has been playing an important role in our everyday life.
The significant challenge in large multimedia databases is the provision of efficient means for semantics indexing and retrieval of visual information. The video has low resolution and the often has poor contrast with a changing background. Problems in segmenting text from video are similar to those faced detection and localization phases. The main motivation for extracting the content of information is the accessibility problem. A problem that is even more relevant for dynamic multimedia data, which also have to be searched and retrieved. While content extraction techniques are reasonably developed for text, video data still is essentially opaque. Its richness and complexity suggests that there is a long way to go in extracting video features, and the implementation of more suitable and effective processing procedures is an important goal to be achieved.
This report describes brief introduction of Video and Image processing, common Image Processing Techniques, Basics of Video Processing, current researches about Video Indexing and Retrieval, Basic Requirements, Image processing techniques for Video content extraction and some applications like Videocel, COBRA Model.
.The video text extraction problem is divided into three main tasks- 1. Detection, 2. Localization, 3. Segmentation. The present development of multimedia technology and information highways has put content processing of visual media at the core of key application domains: digital and interactive video, large distributed digital libraries, multimedia publishing.
1. Introduction
1.1 Basis of Video and Image Processing:
This chapter will introduce the basis of video and image processing. The image or video is stored only as a set of pixels with RGB values in computer. The computer knows nothing about the meaning of these pixel values. The content of an image is quite clear for a person. However, it is not so easy for a computer. For example, it is a piece of cake to recognize yourself in an image or video, even in a crowd. But this is extremely difficult for computer. The preprocessing is to help the computer to understand the content of image or video. What is the so-called content of image or video? Here content means features of image or video or their objects such as color, texture, resolution , and motion. Object can be viewed as a meaningful component in an image or video picture. For example, a moving car, a flying bird, a person are all objects. There are a lot of techniques for image and video processing. This chapter starts with an introduction to general image processing techniques and then talks about video processing techniques. The reason we want to introduce image processing first is that image processing techniques can be used on video if we treat each picture of a video as a still image.
2. Background
3. Common Image Processing
techniques
3.1 Dithering:
Dithering is a process of using a pattern of solid dots to simulate shades
of gray. Different shapes and patterns of dots have been employed in this
process, but the effect is the same. When viewed from a great enough distance
that the dots are not discernible, the pattern appears as a solid shade of
gray.
3.2 Erosion
Erosion is the process of eliminating
all the boundary points from an object, leaving the object smaller in area by
one pixel all around its perimeter. If it narrows to less than three pixels
thick at any point, it will become disconnected (into two objects) at that
point. It is useful for removing from a segmented image objects that are too
small to be of interest.
Shrinking is an special kind of erosion in that
single-pixel objects are left intact. This is useful when the total object
count must be preserved.
Thinning is another special kind of erosion. It is
implemented in a two-step process. The first step will mark all candidate
pixels for removal. The second step actually remove those cnadidates that can
be removed without destroying object connectivity.
3.3 Dilation:
Dilation is the process of incorporating
into the object all the background pixels that touch it, leaving it larger in
area by that amount. If two objects are separated by less than three pixels at
any point, they will become connected (merged into one object) at that point.
It is useful for filling small holes in segmented objects.
Thickening is a special kind of dilation. It is
implemented in a two-step process. The first step marks all the candidate
pixels for addition. The second step adds those cnadidates that can be added
without merging objects.
3.4 Opening:
The process of erosion followed by
dilation is called opening. It has the effect of eliminating small and
thin objects, breaking objects at thin points, and generally smoothing the
boundaries of larger objects without significantly changing their area.
3.5 Closing:
The process of dilation followed by
erosion is called closing. It has the effect of filling small and thin
holes in objects, connecting nearby objects, and generally smoothing the
boundaries of objects without significantly changing their area.
3.6 Filtering:
Image filtering can be used for noise
reduction, image sharpening, and image smoothing. By applying a low-pass or
high-pass filter to the image, the image can be smoothed or sharpened
respectively. Lowpass filter is used to reduce the amplitude of high-frequency
components. Simple lowpass filters appliese local averaging. The gray level at
each pixel is replaced with the average of the gray levels in a square or
rectangular neighborhood. Gaussian Lowpass Filter applies Fourier transform
to the image. Highpass filter is used to increase the amplitude of
high-frquency components. It is useful for detecì¥ a set in which all the
pixels are adjacent or touching. Within each region, there are some common
features among the pixels, such as color, intensity, or texture. When a human
observer views a scene, his visual system will automatically segment the scene
for him or her. The process is so fast and efficient that one sees not a
complex scene, but rather a collection of objects. However, computer must
laboriously isolate the objects in an image by breaking the image into sets of
pixels, each of which is the image of one object.
Image segmentation can be approached from three
ways. The first approach is called region approach, in which each pixel
is assigned to a particular object or region. In the boundary approach,
only the boundaries that exist between the regions are located. The third is
called edge approach, where people try to identify edge pixels and then
link them together to form the required boundaries.
3.7 Object Recognition:
The most difficult part of image
processing is object recognition. Although there are many image segmentation
alogrithms that can segment image into regions with some continuous feature, it
is still very difficult to recognize objects from these regions. There are
several reasons for this. First, image segmentation is an ill-posed task and
there is always some degree of uncertainty in the segmentation result. Second,
an object may contain several regions and how to connect different regions is
another problem. At present, no algorithm can segment general images into
objects automatically with high accuracy. In the case that there is some a
prior knowledge about the foreground objects or background scene, the accuracy
of object recognition could be pretty good. Usually the image is first
segmented into regions according to the pattern of color or texture. Then
separate regions will be grouped to form objects. The grouping process is
important for the success of object recognition. Full automatical grouping only
occurs when the a prior knowledge about the foreground objects or background
scene exists. In the other cased, human interaction may be required to achieve
good accuracy of object recognition.
4. Basis
of Video Processing
4.1 Content of Digital Video
Generally speaking, there is much
similarity between digital video and image. Each picture of video can be
treated as a still image. All the techniques applicable to images can also be
applied to video pictures. However, there are still different. The most
significant difference is that video has temporal information and uses motion
estimation for compression. Video is a meaningful group of pictures that tells
a story or something else. Video pictures can be grouped as a shot. A video
shot is a set of pictures taken in one camera break. Within each shot, there
can be one or more key pictures. Key picture is a representative of the content
of a video shot. For a long video shot, there may be multiple key pictures.
Usually video processing segments video into separate shots, selects key
pictures from these shots, and then generate features of these key pictures.
The features (color, texture, object) of key pictures are searched in video
query.
Video processing includes shot detection, key
picture selection, feature generation, and object extraction.
4.1.1 Shot Detection:
Shot detection
is a process to detect camera shots. A camera shot consists of one or more
pictures taken in one camera break. The general approach to shot detection has
been the definition of a difference metric. If the difference between two
pictures are above the metric, then there is a shot between them. An algorithm
can be proposed for this.This algorithm uses binary search to detect shot which
makes it very fast and achieve good performance as well. Recently, there are
some algorithms that detect shot directly on MPEG compressed data.
4.1.2 Key Picture Selection:
After shot
detection, each shot is represented by at least one key picture. The choice of
key picture could be as simple as a particular picture in the shot: the first,
the last, or the middle. However, in situations such as long shot, no single
picture can represent the content of the entire shot. QBIC uses a synthesized
key picture created by seamlessly mosaicking all the pictures in a given shot
using the computed motion transformation of the dominant background. This
picture is an authentic depiction of all background captured in the whole shot.
In CBIRD system, key picture selection is a simple process that usually chooses
the first and last pictures of a shot as key pictures.
4.1.3 Feature Generation:
After key picture selection, features of key pictures such as color,
texture, intensity are stored as indexes of the video shot. Users can perform traditional
search by using keyword querying and content-based query by specifying a color,
intensity, or texture pattern. Only the generated features will be searched
against and the retrieval can be in real time.
4.2.4 Object Extraction:
During the process
of shot detection and key picture selection, the objects in the video are also
extracted using image segmentation techniques or motion information.
Segmentation-based techniques is mainly based on image segmentation. And
objects are recognized and tracked by segmentation projection. Motion-based
techniques make use of motion vectors to distinguish objects from background
and keep track of their motion. It is a very difficult problem. And the new
MPEG-4 standard will talk about how to get objects in the video and encode them
separately into different layers. Hopefully this process is not manual and it
is also unrealistic to expect it to be full automatical.
5.
Current Research about Video Indexing and Retrieval
Video indexing
and retrieval is a very active research area. In the field of digital video,
computer-assisted content-based indexing is a critical technology and currently
a bottleneck in the productive use of video resources. Only an indexed video
can effectively support retrieval and distribution in video editing,
production, video-on-demand and multimedia information systems. To achieve
this, we need algorithms and systems that provide the ability to store and
retrieve video in a way that allows flexible and efficient search based on
content. In this chapter, we will talk about some important aspects about the
state of art progress in video indexing and retrieval. It is organized as
follows:
v Video Parsing
v Video Indexing and Retrieval
v Object Recognition and Moition Tracking
5.1 Video Parsing:
The first step of video processing is video parsing. Video parsing is a
process to segment video stream into generic shots. These shots are the
elementary index unit in a video database, just like a word in a text database.
Then each of these shots will be represented by one or more key pictures. Only
these key pictures are stored into the video database. There are several tasks
in video parsing, including shot detection and key picture selection.
5.1.1 Shot Detection in video parsing:
The first step of
video parsing is shot detection. Shot detection algorithms usually belong to
two classes: (1)those based on global representations like color/intensity
histograms without any local information, and
(2)those based on measuring local
difference like intensity change. The former are relatively insensitive to
motion but can miss shots when scenes look quite different but have similar
distributions. The latter are sensitive to moving objects and camera. Some
systems combine the advantages of the two classes of detection by using a mixed
method. QBIC is one of these systems.
5.1.2 Key Picture Selection in video parsing
The next step after shot detection is key picture selection. Each shot has
at least one key picture. Key picture can best represent the visual content of
video. The number of key pictures for each shot can be constant or adaptive to
shot content. The first picture is selected as a key picture; and subsequent
pictures are compared against this candidate. A two-threshold technique,
similar to the one described above, is applied to identify a picture
significantly different from the candidate. This new picture is considered
another key picture and the subsequent pictures are compared against this new
candidate. Users can control the density of key pictures by adjusting the two
threshold values.
5.1.3Feature Generation in video parsing
After key picture selection, features of key pictures such as color,
texture, intensity are stored as indexes of the video shot. Users can perform
traditional search by using keyword querying and content-based query by
specifying a color, intensity, or texture pattern. Only the generated features
will be searched against and the retrieval can be in real time.
5.2 Video Indexing and Retrieval
After
each object in the video shot has been segmented and tracked, their features
such as color, texture, motion can be obtained and stored in a feature
database. The resulting database is a simple feature, value pair and the actual
query is performed on this feature database. For each feature, there is a
function to calculated the distance between query object and tracked objects in
thevideo database. The total distance is a weighted sum of these distances. If
the total distance is below a certain threshold, then it is returned as a
possible matching.
There
are also some image processing system such as Yahoo Image Surfer Category List,
WebSeer, WebSeek, VisualSeek,UCB's query all images, Lycos and MIT's Photobook.
Some of them are mainly based on keyword searching. First the images are
assigned one or more keywords manually and catorized into different groups such
as photos, arts, people, animals, plants. Users can then browse through the
separate category that may be interesting to them.
5.2.1 Examples of some image processing systems :
“Yahoo Image Surfer Category List(YISCL)”
and “Lycos”. YISCL system also provides visual search function which is based
on color distribution matching. UCB's query all image presents several
interesting ideas such as ``blobworld'' and ``body plan''. Blobworld is
a region. While blobworld does not exist completely in the "thing"
domain, it recognizes the nature of images as combinations of objects, and
querying and learning in blobworld are more meaningful than they are with
simple "stuff" representations. The Expectation-Maximization (EM)
algorithm is used to perform automatic segmentation based on image features.
After segmentation, each region is shown as an elliptic blob. Body plan
is an algorithm for image segmentation. A body plan is a sophisticated model of
the way a horse is put together; as a result, the program is capable of
recognising horses in different aspects
MIT's Photobook allows users to perform texture modeling, face
recognition, shape matching, brain matching, and interactive segmentation and
annotation. WebSeek allows users to draw a query that depicts the spatial
relations between objects.
5.3 Object Recognition and Motion Tracking
This is such an important topic. In video, the
credibility of object recognition is higher than that in still image because
there is more information available. The most valuable information is motion
vectors. The motion vectors of a moving object has some intrinsic patterns that
conform to a motion model. There are some papers talking about object
recognition using affine motion model.
6. Basic Requirements:
6.1Video Data Modeling
In a conventional database management system (DBMS), access to data is
based on distinct attributes of well-defined data developed for a specific
application. For unstructured data such as audio, video, or graphics, similar
attributes can be defined. A means for extracting information contained in the
unstructured data is required. Next, this information must be appropriately
modeled in order to support both user queries for content and data models for
storage.
Fig 1:
First Stage in Video Data Adaptation: Data Modeling
From a structural perspective,
a motion picture can be modeled as data consisting of a finite-length of
synchronized audio and still images. This model is a simple instance of the
more general models for heterogeneous multimedia data objects. Davenport et al
describe the fundamental film component as the shot: a contiguously
recorded audio/image sequence. To this basic component, attributes such as
content, perspective, and context can be assigned, and later used to formulate
specific queries on a collection of shots. Such a model is appropriate for
providing multiple views on the final data schema and has been suggested by
Lippman and Bender Smith and Davenport use a technique called stratification
for aggregating collections of shots by contextual descriptions called strata.
These strata provide access to frames over a temporal span rather than to
individual frames or shot endpoints. This technique can then be used primarily
for editing and creating movies from source shots. It also provides a quick
query access and a view of desired blocks of video. Because of the linearity of
the medium we cannot get a coherent description of an item but as a result of
the stratification method the related information is lumped together. The
linear integrity of the raw footage is erased resulting in contextual
information which relates the shot with the environment. Rowe et al. have
developed a video-on-demand system for video data browsing. In this system the
data are modeled based on a survey of what users would query for. Three types
of indices were identified to satisfy the user queries. The first is a textual
bibliographic index which includes information about the video and the
individuals involved in the making of the video. The second is a textual
structural index of the hierarchy of movie, i.e., segment, scene, and shots.
The third is a content index which includes keyword indices for the audio
track, object indices for significant objects and key images in the video which
represent important events.
The
above model does not utilize the semantics associated with video data.
Different video data types have different semantics associated with them. We
must take advantage of this fact and model video data based on the semantics
associated with each data type.
6.2 Video indexing
Video
annotation or indexing is the process of attaching content based labels to
video. Video indexing is the process of extracting from the video data
the temporal location of a feature and its value.
6.2.1 Need of video indexing
Indexing video data is essential for
providing content based access. Indexing has typically been viewed either from
a manual annotation perspective or from an image sequence processing
perspective. The indexing effort is directly proportional to the granularity of
video access. As applications demand finer grain access to video, automation of
the indexing process becomes essential. Given the current state of art in
computer vision, pattern recognition and image processing reliable and
efficient automation is possible for low level video indices, like cuts and image motion properties etc.
Existing work on content based video
access and video indexing can be grouped into three main categories
6.2.1.1 High
level indexing
The work by Davis is an excellent instance of high level
indexing. This approach uses a set of predefined index terms for annotating
video. The index terms are organized based on a high level ontological
categories like action, time, space, etc.
The high level indexing techniques
are primarily designed from the perspective of manual indexing or annotation.
This approach is suitable for dealing with small quantities of new video and
for accessing previously annotated databases.
6.2.1.2 Low level indexing
These techniques provide access to
video based on properties like color, texture etc. These techniques can be
classified under the label of low level indexing.
The driving force behind this groups
of techniques is to extract data features from the video data, organize the
features based on some distance metric and to use similarity based matching to
retrieve the video. Their primary limitation is the lack of semantics attached
to the features.
6.2.1.3 Domain specific indexing
These techniques use the
high level structure of video to constrain the low level video feature extraction
and processing. These techniques are effective in their intended domain of
application. The primary limitation of these techniques is their narrow range
of applicability.
6.3 Video data Management
We here want to know how to extract
contents from segmented video shots and then index them effectively so that
users can retrieve and browse a large amount of video
collections. Management of sequential video streams includes three steps,
i.e. parsing, content extraction & indexing and retrieval & browsing.
Video parsing is the process of detecting scene changes or the
boundaries between camera shots in a video stream The video stream is segmented
into generic clips. These clips are the elemental index units in a video
database, just like a word in a text database. Then, each of these clips will
be represented visually by their key frames. To reduce the requirements for
mass amount of storage, only these key frames will be stored into the database.
of indices for their location. There are two type of transitions, abrupt
transitions or camera break and gradual transitions e.g., fade-in, fade-out,
dissolve, and wipe.
Indexing, which tags video clips when
the system inserts them into the database. The tag includes information based
on a knowledge model that guides the classification according to the semantic
primitives of the images. Indexing is thus driven by the image itself and any
semantic descriptors provided by the model. Two types of indices, text-based
and image-based, are needed. The text-based index is typed in by human operator
based on the key frames using a content logger. The image-based index is
automatically constructed based on the image features extracted from the key
frames.
Retrieval and browsing, where users
can access the database through queries based on text and/or visual examples or
browse it through interaction with displays of meaningful icons. Users can also
browse the results of a retrieval query. It is important that both retrieval
and browsing appeal to the user's visual intuition. By visual query, users want
to find video shots that look similar to a given example. In concept query,
users want to find video shots by the presence of specific objects or events.
Visual query can be realized by directly comparing low level visual features
like color, texture, shape and temporal variance of video shots or their
representative frames (i.e. key frames). On the other hand, the concept query
depends on object detection, tracking and recognition. Since fully automatic
object extraction is still impossible, some extent of user interaction is
necessary in this process Manual indexing labor can be greatly reduced with the
help of video analysis techniques.
7. Image Processing Techniques For
Video Content Extraction
The
increase in the diversity and availability of electronic information led to
additional processing requirements, in order to retrieve relevant and useful
data: the accessibility problem. This problem is even more relevant for
audiovisual information, where huge amounts of data have to be searched,
indexed and processed. Most of the solutions for this type of problems point
towards a common need: to extract relevant information features for a given
content domain. A process which underlies two difficult tasks: deciding what is
relevant and extracting it. In fact, while content extraction techniques are
reasonably developed for text, video data still is essentially opaque.
Despite its obvious advantages as a communication medium, the lack of suitable
processing and communication supporting platforms has delayed its introduction
in a generalized way. This situation is changing and new video based
applications are being developed.
7.1 Toolkit overview
videoCEL
is basically a library for video content extraction. Its components extract
relevant features of video data and can be reused by different applications.
The object model includes components for video data modelling and tools for
processing and extracting video content, but currently the video processing is
restricted to images.
At the data modelling level, the
more significant concepts are the following:
· Images, for representing the frame
data, a numerical matrix whose values can be colors, color map entries, etc.;
· ColorMaps, which map entries into a
color space, allowing an additional indexation level;
· ImageDisplayConvertes and ImageIOHandlers,
that convert images in the specific formats of the platforms and vice-versa.
The object model of videoCEL is a subset of a more complete model,
which also includes concepts such has shots, shot sequences and views Concepts,
which are modelled in a distinct toolkit that provides functionalities for
indexing, browsing and playing annotated video segments.
A shot object is a discrete sequence
of images with a set of temporal attributes such as frame rate and duration and
represents a video segment. A shot sequence object groups several shots using
some semantic criteria. Views, are used
to visualize and browse shots and shot sequences.
7.2 Temporal segmentation tools
One of the most important tasks for
video analysis is to specify a unit set, in which the video temporal sequence
may be organized. The different video transitions are important for video
content identification and for the definition of the semantics of the video
language, making their detection one of the primary goals to be achieved. The
basic assumption of the transition detection procedures is that the video
segments are spatially and temporally continuous, and thus the boundary images
must suffer significant content changes. Changes, which depend on the
transition type and can be measured. The original problem is reduced to the
search of suitable difference quantification metrics, whose maximums identify,
with great probability, the transition temporal locations.
7.3 Cut detection
The process of detecting cuts is
quite simple, mainly because the changes in content are very visible and they
always occur instantaneously between consecutive frames. The implemented
algorithm simply uses one of the quantification metrics, and a cut is declared
when the differences are above a certain threshold. Thus, its success is
greatly dependent on the metric suitability. The results obtained by applying
this procedure to some of our metrics are presented next. The thresholds
selection was made empirically, while trying to maximize the success of the
detection (minimizing simultaneously the false and missed detections). The
captured video segment belongs to an outdoors news report, so its transitions
are not very “artistic” (mainly cuts). There are several well known strategies
that usually improve this detection. For instance, the use of adaptive
thresholds increases the flexibility of the thresholding, allowing the
adaptation of the algorithm to diverse video content An approach that was used
with some success in previous work , while trying to reduce some of the lacks of the metrics specific
behavior, was simply to produce a weighted average of the differences obtained
with two or more metrics. Pre-processing images using noise filters or lower
resolution operators are also quite usual tasks, offering means for reducing
image the noise and also the processing complexity. The distinctive treatment
of image regions, in order to eliminate some of the more extreme values,
remarkably increases the detection accuracy, specially when there are only a
few objects moving on the captured scene .
7.4 Gradual transition detection
Gradual
transitions, such as fades, dissolves and wipes, cause
more gradual changes which evolve during several images. Although the
obtained differences are less distinct
from the average values, and can have similar values to the ones caused by
camera operations, there are several successful procedures, which were adapted
and are currently supported by the toolkit.
7.4.1
Twin-Comparison algorithm
This
algorithm was developed after verifying that, in spite of the fact that the
first and last transition frames are quite different, consecutive images remain
very similar. Thus, as in the cuts detection, this procedure uses one of the
difference metrics, but, instead of one, it has two thresholds: one higher for
cuts, and another for the gradual transitions. While this algorithm just
detects gradual transitions and distinguish them from cuts, there are other
approaches which also classify fades, dissolves and wipes,.
7.4.2 Edge-Comparison
algorithm
This
algorithm analyses both edge change
fractions, exiting and entering. Distinct gradual transitions generate
characteristic variations of these values. For instance, a fade in always
generates an increase in the entering edge fraction; conversely, a fade out
causes an increase in the exiting edge fraction; a dissolve has the same effect
as a fade out followed by a fade in.
7.5 Camera operation detection
As distinct
transitions give different meanings to adjacent video segments, the possible
camera operations are also relevant for content identification . For example,
that information can be used to build salient stills and select key frames or
segments for video representation. All the methods which detect and classify
camera operations start from the
following observation: each one generates global characteristic changes in the
captured objects and background . For example, when a pan happens they move
horizontally in the opposite direction of the camera motion; the behavior of
the tilts is similar but in the vertical axis; zooms generate convergent or
divergent moves.
7.5.1 X-ray
based method
This
approach basically produces fingerprints
of the global motion flow. After extracting the edges, each image is reduced to
its horizontal and vertical projections, a column and a row, that roughly
represent the horizontal and vertical global motions, which are usually
referred to as the x-ray images.
7.6 Lighting conditions characterization
Light effects are usually mentioned
in the cinema language grammar, as they contribution is essential for the
overall video content meaning. The lighting conditions can be easily extracted
by observing the distribution of the light intensity histogram: its mode, mean
and average are valuable in characterising its distribution type and spread.
These features also allow the quantification of the lighting variations, once
the similarity of the images is determined.
7.7 Scene segmentation
Scene segmentation refers to the
image decomposition in its main components: objects, background, captions, etc.
It is a first step for the identification and classification of the scene main
features, and its tracking during all the sequence. The simplest implemented
segmentation method is the amplitude thresholding, which is quite successful
when the different regions have distinct amplitudes. It is particularly useful
procedure for binarizing captions. Other methods are described below.
7.7.1
Region-based segmentation
Region-based
segmentation procedures find out various regions in an image which have similar
features. One of such algorithms is the split and merge algorithm, that first
divides the image in atomic homogeneous regions, and then merges the similar adjacent
regions until they are sufficiently different. Two distinct metrics are needed:
one for measuring the initial regions homogeneity (the variance, or any other
difference measure), and another for quantifying the adjacent regions
similarity (the average, median, mode, etc.).
7.7.2
Motion-based segmentation
The
main idea in motion-based segmentation techniques is to identify image regions
with similar motion behaviors. These properties are determined by analysing the
temporal evolution of the pixels. This process is carried out in the frequency
image produced for all the image sequence. When more constant pixels are
selected, for example, the final image is the background causing the motion
removal. Once the background is extracted, the same principle can be used to
extract and track motion or objects.
7.7.3 Scene
and object detection
The
process of detecting scenes or scene regions (objects) is, in certain way, the
opposite process of transition detection: we want to find images regions whose
differences are below a certain threshold. As a consequence this procedure uses
difference quantification metrics. These functions can be determined for all
the image, or a hierarchical growing resolution calculation can be performed to
accelerate the process. Another tested algorithm, also hierarchical, is based
in the hausdorff distance. It retrieves all the possible transformations
(translation, rotation, etc.) between the edges of two images . Another way of
extracting objects is by representing their contours. The toolkit uses a polygonal
line approach to represent contours as a set of
connected segments. The ending of a segment is detected when the relation
between the current segment polygonal area and its length is beyond a certain
threshold.
7.7.4 Caption
extraction
Based
on an existing caption extraction method a new and more effective procedure was
implemented. As the captions are usually artificially added to images, the
first step of this procedure is extracting high-contrast regions. This task is
performed by segmenting the edge image, whose contours have been previously
dilated by a certain radius. These regions are then subjected to a certain
caption-characteristic size constrains, based on the x-rays (projections of
edge images) properties; just the horizontal clusters remain. The resulting
image is segmented and two different images are produced: one with black
background for lighter text, and another with white background for darker text.
The process is complete after binarizing both images and proceeding to more
dimensional region constrains.
7.8 Edge detection
Two
distinct procedures for edge detection were implemented: (1) gradient
module thresholding, where the image vectors are obtained using the Sobel
operator; (2) the canny filter, considered the optimum detector, which
analyses the representativity of gradient module maximums, and thus producing
thinner contours. As the differential operators amplify high frequency zones,
it is common practice to pre-process the images using noise filters, a
functionality also supported by the toolkit in the form of several smoothing
operators: the median filter, the average filter, and a gaussian filter.
8.
Applications
8.1 Videocel applications:
Video browser:
This application is used to visualize
video streams. The browser can load a stream
and split it in its shot segments using cut detection algorithms. Each
shot is then represented in the browser main window by an icon, that is a
reduced form of its first frame the shots can be played using several view
objects.
WeatherDigest :
The WeatherDigest application
generates HTML documents from TV weather forecasts. The temporal sequence of
maps, presented on the TV, is mapped to a sequence of images in the HTML page.
This application illustrates the importance of information models.
News analysis :
developed a set of applications to be used by social
scientists in content analysis of TV news. The analysis was in filling forms
including news items duration, subjects, etc., which our attempts to automate.
The system generates HTML pages with the images and CSV (Comma Separated
Values) tables suitable for use in spreadsheets such as Excel. Additionally,
these HTML pages can be also used for news browsing and there also is a java
based tool for accessing this information.
8.2 COBRA Model
In order to explore video content and
provide a framework for automatic extraction of semantic content from raw video
data, we propose the Contenet Based RetrievAl (COBRA) video data model. The model is independent of
feature/semantic extractors, providing flexibility by using different video
processing and pattern recognition techniques for that purposes. The feature
grammar is exploited to describe the low-level persistent meta-data. The
grammar also describes the dependencies between the extractors.
At the same time it is in line with
the latest development in MPEG-7, distinguishing four distinct layers within
video content: the raw data, the feature, the object, and the event layer. The
object and event layers are concept layers consisting of entities characterized
by prominent spatial and temporal dimensions respectively.
To provide automatic extraction of concepts
(objects and events) from visual features (which are extracted using
existing video/image processing techniques), the COBRA video model is extended
with object and event grammars. These grammars are aimed at formalizing the
descriptions of these high-level concepts, as well as facilitating their
extraction based on features and spatio-temporal reasoning.
This rule-based approach results in
the automatic mapping from features to high-level concepts. However, we still
have the problem of creating object and event rules manually. This might be
very difficult, especially in the case of certain object rules which require
extensive user familiarity with features and object extraction techniques.
As the model also provides a
framework for stochastic modeling of events, we have chosen to exploit the
learning capability of Hidden Markov Models (HMMs) to recognize events
in video data automatically.
Figure 2 - Video sequences
Figure
3 - Video shots
Figure
4 - Principal color detection
Figure
5 - Detected player
SELECT
|
vi.frame seq
FROM player pl, video vi |
WHERE
|
Event:vi.frame.e=((Appr_net_BSL:
((e1:Player_near_the_net,e2:Backhand_slice),(),(),(),(before (e2,e1,n),n<75 ampras="" e1.o1.name="e2.o1.nam=" span="">75> |
Fig
6 - Video query
The WHERE
clause of the query shown in the table above, constrains player profiles on
only documents that contain videos with the event Approaching the net with
backhand slice stoke'. This new event description, defined inside the query,
demonstrates how complex events can be defined dynamically based on previously
extracted events and spatio-temporal relations. The first one of two predefined
events is the Player_near_the_net
event, which is defined using spatio-temporal rules, where as the second one,
the Backhand_slice
event, is defined using the HMM approach. The temporal relation requires e1 to start at least 75
frames before event e2.
The event descriptions are evaluated by the query processor. It rewrites the
event from its conceptual definition into a standard object algebra extended by
the COBRA video model and spatio-temporal operations. Therefore, a user is able
to explore video content specifying very detailed complex queries that include
a combination of features, objects, and events, as well as spatio-temporal
relations among them.
Conclusion
Visual
information has always been always an important source of knowledge. With the
advances in information computing & communication technology, this
information in the type of digital images & digital video, is highly
available also through the computer. To be able to cope with the explosion of
visual information, an organization of material which allows for fast search
& retrieval is required. This calls for the system which is in some way can
provide content-based handling of visual information. In this seminar I have
tried to give the basic image processing techniques, status of content based
access to images & video databases, some applications regarding to video
content extraction.
An image
extraction system is necessarily for users that have large collection of images
like digital library. During the last few years, some content based techniques
for image retrieval system are commercially available.. These systems offer
retrieval by color,, texture or shape & smart combinations of these images
help users in finding the image he is looking for.. A video retrieval system is
useful for video archiving, video editing, production etc.
BIBLIOGRAPHY
Basis of Video and Image Processing
by Jian Wang
A survey on
content based access to image & video databases by Kjersti Aas & Line Eikvil
Content-based
representation and retrieval of visual Media: a state-of-the-art review 1
Philippe
Aigrain, Hongjiang Zhang Dragutin Petkovic
Advanced content-based
semantic scene analysis and information retrieval: the schema project
E.
Izquierdo, J.R. Casas, R. Leonardi, P. Migliorati, Noel e.
O’connor, I. Kompatsiaris
and M. G. Strintzis