DYNAMIC VIDEO SYNOPSIS
Active In SP
Joined: Oct 2010
30-10-2010, 09:37 AM
DYNAMIC VIDEO SYNOPSIS
Computer Science and Engineering
College Of Engineering, Trivandrum
Video Abstraction is a nascent technology which is concerned with generating a summary of a video. It is particularly relevant in archiving and retrieving huge volumes of video files. Various techniques are evolving, to create ideal abstracts of videos that would reflect the nature and content of the original. Among all possible research areas video abstraction is one of the most important topics, which helps enable a quick browsing of a large collection of video data to achieve efficient content access and representation.
A new technique is dynamic video synopsis, where most of the activity in the video is condensed by simultaneously showing several actions, even when they originally occurred at different times. For example, we can create a “stroboscopic movie”, where multiple dynamic instances of a moving object are played simultaneously. This is an extension of the still stroboscopic picture. Previous approaches for video abstraction addressed mostly the temporal redundancy by selecting representative
key-frames or time intervals. In dynamic video synopsis the activity is shifted into a significantly shorter period, in which the activity is much denser.
Video Abstraction.doc (Size: 1.31 MB / Downloads: 58)
Digital video is an emerging force in today’s computer and telecommunication industries. The rapid growth of the Internet, in terms of both bandwidth and the number of users, has pushed all multimedia technology forward including video streaming. Continuous hardware developments have reached the point where personal computers are powerful enough to handle the high storage and computational demands of digital video applications. DVD, which delivers high quality digital video to consumers, is rapidly penetrating the market. Moreover, the advances in digital cameras and camcorders have made it quite easy to capture a video and then load it into a computer in digital form. Many companies, universities and even ordinary families already have large repositories of videos both in analog and digital formats, such as the broadcast news, training and education videos, advertising and commercials, monitoring, surveying and home videos. All of these trends are indicating a promising future for the world of digital video.
The fast evolution of digital video has brought many new applications and consequently, research and development of new technologies, which will lower the costs of video archiving, cataloging and indexing, as well as improve the efficiency, usability and accessibility of stored videos are greatly needed. Among all possible research areas, one important topic is how to enable a quick browse of a large collection of video data and how to achieve efficient content access and representation. To address these issues, video abstraction techniques have emerged and have been attracting more research interest in recent years.
Video abstraction, as the name implies, is a short summary of the content of a longer video document. Specifically, a video abstract is a sequence of still or moving images representing the content of a video in such a way that the target party is rapidly provided with concise information about the content while the essential message of the original is well preserved.
Theoretically a video abstract can be generated both manually and automatically, but due to the huge volumes of video data and limited manpower, it’s getting more and more important to develop fully automated video analysis and processing tools so as to reduce the human involvement in the video abstraction process.
There are two fundamentally different kinds of abstracts: still- and moving-image abstracts. The still-image abstract, also known as a static storyboard, is a small collection of salient images extracted or generated from the underlying video source. This type of abstract is called a video summary. The moving-image abstract, also known as moving storyboard, or multimedia summary, consists of a collection of image sequences, as well as the corresponding audio abstract extracted from the original sequence and is thus itself a video clip but of considerably shorter length. This type of abstract is called a video skimming.
There are some significant differences between video summary and video skimming. A video summary can be built much faster, since generally only visual information is utilized and no handling of audio and textual information is needed. Therefore, once composed, it is displayed more easily since there are no timing or synchronization issues.. Besides, the temporal order of all extracted representative frames can be displayed in a spatial order so that the users are able to grasp the video content more quickly. Finally, all extracted stills could be printed out very easily when needed.
There are also advantages using video skimming. Compared to a still-image abstract, it makes much more sense to use the original audio information since sometimes the audio track contains important information such as those in education and training videos. Besides, the possibly higher computational effort during the abstracting process pays off during the playback time: it’s usually more natural and more interesting for users to watch a trailer than watching a slide show, and in many cases, the motion is also information-bearing. The technical aspects and merits and demerits of video summary and video skimming are examined in this report.
Characteristics of a Video
Video is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion. All videos are characterized by the following.
Number of frames per second
Also called frame rate, the number of still pictures per unit of time of video ranges from six or eight frames per second (frame/s) for old mechanical cameras to 120 or more frames per second for new professional cameras. PAL (Europe, Asia, Australia, etc.) and SECAM (France, Russia, parts of Africa etc.) standards specify 25 frame/s, while NTSC (USA, Canada, Japan, etc.) specifies 29.97 frame/s. Film is shot at the slower frame rate of 24frame/s, which complicates slightly the process of transferring a cinematic motion picture to video. The minimum frame rate to achieve the illusion of a moving image is about fifteen frames per second.
Video can be interlaced or progressive. Interlacing was invented as a way to achieve good visual quality within the limitations of a narrow bandwidth. The horizontal scan lines of each interlaced frame are numbered consecutively and partitioned into two fields: the odd field (upper field) consisting of the odd-numbered lines and the even field (lower field) consisting of the even-numbered lines. NTSC, PAL and SECAM are interlaced formats. Abbreviated video resolution specifications often include an “i” to indicate interlacing. For example, PAL video format is often specified as 576i50, where 576 stands for the vertical line resolution, “i” indicates interlacing, and 50 stands for 50 fields (half-frames) per second.
In progressive scan systems, each refresh period updates all of the scan lines. The result is a higher perceived resolution and a lack of various artifacts that can make parts of a stationary picture appear to be moving or flashing.
A procedure known as de-interlacing can be used for converting an interlaced stream, such as analog, DVD, or satellite, to be processed by progressive scan devices, such as TFT TV-sets, project and implimentationors, and plasma panels. De-interlacing cannot, however, produce a video quality that is equivalent to true progressive scan source material.
The size of a video image is measured in pixels for digital video, or horizontal scan lines and vertical lines of resolution for analog video. In the digital domain (e.g. DVD) standard-definition television (SDTV) is specified as 720/704/640×480i60 for NTSC and 768/720×576i50 for PAL or SECAM resolution. However in the analog domain, the number of visible scanlines remains constant (486 NTSC/576 PAL) while the horizontal measurement varies with the quality of the signal: approximately 320 pixels per scanline for VCR quality, 400 pixels for TV roadcasts, and 720 pixels for DVD sources. Aspect ratio is preserved because of non-square "pixels".
New high-definition televisions (HDTV) are capable of resolutions up to 1920×1080p60, i.e. 1920 pixels per scan line by 1080 scan lines, progressive, at 60 frames per second. Video resolution for 3D-video is measured in voxels (volume picture element, representing a value in the three dimensional space). For example 512×512×512 voxels resolution, now used for simple 3D-video, can be displayed even on some PDAs.
Aspect ratio describes the dimensions of video screens and video picture elements. All popular video formats are rectilinear, and so can be described by a ratio between width and height. The screen aspect ratio of a traditional television screen is 4:3, or about 1.33:1. High definition televisions use an aspect ratio of 16:9, or about 1.78:1. The aspect ratio of a full 35 mm film frame with soundtrack (also known as the Academy ratio) is 1.375:1.
Pixels on computer monitors are usually square, but pixels used in digital video often have non-square aspect ratios, such as those used in the PAL and NTSC variants of the CCIR 601 digital video standard, and the corresponding anamorphic widescreen formats. Therefore, an NTSC DV image which is 720 pixels by 480 pixels is displayed with the aspect ratio of 4:3 (which is the traditional television standard) if the pixels are thin and displayed with the aspect ratio of 16:9 (which is the anamorphic widescreen format) if the pixels are fat.
Color space and bits per pixel
Color model name describes the video color representation. The number of distinct colours that can be represented by a pixel depends on the number of bits per pixel (bpp). A common way to reduce the number of bits per pixel in digital video is by chroma subsampling (e.g. 4:4:4, 4:2:2, 4:2:0, 4:1:1).
As mentioned in the Introduction, video summary is a set of salient images called keyframes which are selected or reconstructed from an original video sequence. Therefore, selecting salient images (key frames) from all the frames of an original video is very important to get a video summary. The different methods using for making video summaries will be discussed in the following subsections.
Video Summary techniques can be broadly classified into Shot-based Video Summary techniques and Segment-based Video Summary techniques, which are the two major methods of extracting the keyframes which constitute a video summary.
Shot-based Keyframe Extraction
The general structure of a video is as shown in the diagram below.
Schematic structure of a video
Since a shot is defined as a video segment within a continuous capture period, a natural and straightforward way of keyframe extraction is to use the first frame of each shot as its keyframe. However, while being sufficient for stationary shots, one keyframe per shot does not provide an acceptable representation of dynamic visual content, therefore multiple keyframes need to be extracted by adapting to the underlying semantic content. However, since computer vision still remains to be a very difficult research challenge, most of existing work chooses to interpret the content by employing some low-level visual features such as color and motion, instead of performing a tough semantic understanding. In this report, based on the features that these works employ, we categorize them into following 3 different classes: color-based approach, motion-based approach and others.
The keyframes are extracted in a sequential fashion for each shot. Particularly, the first frame within the shot is always chosen as the first keyframe, and then the color-histogram difference between the subsequent frames and the latest keyframe is computed. Once the difference exceeds a certain threshold, a new keyframe will be declared.
An image and its corresponding color histogram
Using color histograms to detect keyframes
One possible problem with above extraction method is that there is a probability that the first frame is a part of transition effect at the shot boundary, thus strongly reducing its representative quality. As an alternative, eyframes can be extracted using an unsupervised clustering scheme. Basically, all video frames within a shot are first clustered into certain number of clusters based on the color-histogram similarity comparison where a threshold is predefined to control the density of each cluster. Next, all the clusters that are big enough are considered as the key clusters and a representative frame closest to the cluster centroid is extracted from each of them. Because the color histogram is invariant to image orientations and robust to background noises, color-based keyframe extraction algorithms have been widely used. However, most of these works are heavily threshold-dependent, and cannot well capture the underlying dynamics when there is lots of camera or object motion.
Motion-based approaches are relatively better suited for controlling the number of frames based on temporal dynamics in the scene. In general, pixel-based image differences or optical flow computation are commonly used in this approach. In one approach, the optical flow for each frame is first computed, and then a simple motion metric is computed. Finally by analyzing the metric as a function of time, the frames at the local minima of motion are selected as the keyframes.
Using optical flow motion metrics to identify keyframes
A domain specific keyframe selection method is proposed where a summary is generated for video-taped presentations. Sophisticated global motion and gesture analysis algorithms are developed. There are works which employ 3 different operation levels based on the available machine resources: at the lowest level, pixel-based frame differences are computed to generate the “temporal activity curve” since it requires minimal resources; at level 2, color histogram-based frame differences are computed to extract “color activity segments”, and at level 3 sophisticated camera motion analysis is carried out to estimate the camera parameters and detect the “motion activity segments”. Keyframes are then selected from each segment and necessary elimination is applied to obtain the final result.
Some other work integrates certain mathematical methodologies into the summarization process based on low-level features. In one such approach, descriptors are first extracted from each video frame by applying a segmentation algorithm to both color and motion domains, which forms the feature vector. Then all frames’ feature vectors in the shot are gathered to form a curve in a high-dimensional feature space. Finally, keyframes are extracted by estimating appropriate curve points that characterize the feature trajectory, where the curvature is measured based on the magnitude of the second derivative of the feature vectors with respect to time. In , Stefanidis et al. from University of Maine (Department of Spatial Information Engineering) present an approach to summarize video datasets by analyzing the trajectories of contained objects. Basically, critical points on the trajectory that best describe the object behavior during that segment are identified and subsequently used to extract the keyframes. The Self-Organizing Map (SOM) technique is used to identify the trajectory nodes.
Segment-based Keyframe Extraction
One major drawback of using one or more keyframes for each shot is that it does not scale up for long videos since scrolling through hundreds of images is still time-consuming, tedious and ineffective. Therefore, recently more and more people begin to work on higher-level video unit which is called a video segment. A video segment could be a scene, an event, or even the entire sequence. In this context, the segment-based keyframe set will surely become more concise than the shot-based keyframe set.
In one approach, first all the video frames are clustered into predefined number of clusters, and then the entire video is segmented by determining to which cluster the frames of a contiguous segment belong. Next an importance measure is computed for each segment based on its length and rarity, and all segments with their importance lower than a certain threshold will be discarded. The frame that is closest to the center of each qualified segment is then extracted as the representative keyframe, with the image size proportional to its importance index. Finally, a frame-packing algorithm is proposed to efficiently pack the extracted frames into a pictorial summary. Figure 2 shows one of their example summaries.
N frames, which are mostly dissimilar from each other in terms of visual content, are first selected from all video frames, and then they are classified into M clusters using a hierarchical clustering algorithm. Finally one representative frame is chosen from each cluster where temporal constraints are employed to help obtain a semi-uniform keyframe distribution.
In another approach, based on the detected shot structure, all shots are first classified into a group of clusters using proposed “time-constrained clustering” algorithm. Then meaningful story units or scenes are extracted, from which 3 categories of temporal events are detected including dialogue, action and others. Next, for each story unit, a representative image (R-image) is selected to represent each of its component shot clusters, and a corresponding dominance value will be computed based on either the frequency count of those shots with similar visual content or the duration of the shots in the segment. All of the extracted R-images with respect to a certain story unit are then resized and organized into a single regular-sized image according to a predefined visual layout, which is called a video poster in their work. The size of each R-image is set such that the larger the image dominance, the larger its size. As a result, the video summary consists of a series of video posters with each summarizing one story unit and each containing a layout of sub-images that abstract the underlying shot clusters. Some interesting results are reported. Two major drawbacks of this approach are:
1. Since the visual layouts are pre-designed and can’t be adjusted to accommodate for the variable complexity of different story units, the shot clusters with low priority may not be assigned any sub-rectangles in the poster layout, thus losing their respective R-frames in the final summary.
2. The number of video posters is determined by the number of detected story units, thus the inaccuracy introduced in the scene detection algorithm will certainly affect the final summarization result. Also, there is no way to obtain a scalable video summary, which may be desirable in certain cases.
There are techniques where no shot detection is needed. On the contrary, the entire video sequence is first uniformly segmented into L-frame long units, and then a unit change value is computed for each unit, which equals to the distance between the first and last frame of the unit. Next, all the changes are sorted and classified into 2 clusters, the small-change cluster and the large-change cluster, based on a predefined ratio r. Then for the units within the small-change cluster, the first and last frames are extracted as the R-frames, while for those in the large-change cluster, all frames are kept as the R-frames. Finally, if the desired number of keyframes has been obtained, the algorithm will stop, otherwise, the retained R-frames will be regrouped as a new video, and a new round of keyframe selection will be initialed. This work showed some interesting ideas, yet the uniform segmentation and subsequent two-class clustering may be too coarse. A simple color histogram-based distance between the first and last frame of a segment cannot truthfully reflect the variability of the underlying content, and if these two frames happen to have similar color composition, even if this segment is quite complex, it will still be classified into the small-change cluster. Therefore, the final summarization result may miss significant parts of the video information while at the same time retaining all the redundancies of other video parts.
Generating a hierarchical video summarization is favourable since a multilevel video summarization will facilitate quick discovery of the video content and enable browsing interesting segments at various levels of details. Specifically, given a quota of total number of desired keyframes, each shot is first assigned a budget of allowable keyframes based on the total cumulative actions in that shot, which forms their finest-level summary. To achieve coarser-level summary, a pair-wise K-means algorithm is applied to cluster temporally adjacent keyframes based on a predetermined compaction ratio r, where the number of iterations is controlled by certain stopping criterion, for instance, the amount of decrease in distortion, or a predetermined number of iteration steps. While this algorithm does produce a hierarchical summary, the temporal-constrained K-means clustering will not be able to merge two frames when they are visually similar but temporally apart. In some cases, a standard K-means will work better when preserving original temporal order is not required.
Fuzzy schemes can also be used for summarization work. Specifically, for each video frame, a recursive shortest spanning tree (RSST) algorithm is first applied to perform the color and motion segmentation, then a fuzzy classification is carried out to cluster all extracted color and motion features to predetermined classes, which then forms a fixed-dimensional feature vector. Finally keyframes are extracted from the video sequence by minimizing a cross-correlation criterion using a genetic algorithm (GA). The major drawback of this work is the high computational complexity required for extracting the fuzzy feature vector.
Video summarization task can be treated in a more mathematical way where a video sequence is represented as a curve in a high-dimensional feature space. First, a 13-dimensional feature space is formed by the time coordinate and three coordinates of the largest “blobs” (image regions) using four intervals (bins) for each luminance and chrominance channel. The curve is then simplified by using the multidimensional curve splitting algorithm, which results in a linearized curve, characterized by “perceptually significant” points that are connected by straight lines. The keyframe sequence is finally obtained by collecting frames found at those significant points. With a splitting condition that checks the dimensionality of the curve segment being split, the curve can also be recursively simplified at different levels of detail, where the final level is determined by a pre-specified threshold that evaluates the distance between the curve and its linear approximation. A major potential problem of this approach is the difficulty in evaluating the applicability of obtained keyframes, since there is no comprehensive user study to prove that the frames lying at “perceptually significant” points do capture all important instances of the video.
Video skimming consists of a collection of image sequences along with the related audios from an original video. It possesses a higher level of semantic meaning of an original video than the video summary does. We will discuss the video skimming in the following two subsections according to its classification: highlight and summary sequence.
A highlight has the most interesting parts of a video. It is similar to a trailer of a movie, showing the most attractive scenes without revealing the ending of a film. Thus, highlight is used in a film domain frequently. A general method to produce highlights is discussed here. The basic idea of producing a highlight is to extract the most interesting and exciting scenes that contain important people, sounds, and actions, then concatenate them. Pfeiffer et al. (1996) used visual features to produce a highlight of a feature film and stated that a good cinema trailer must have the following five features: (1) important objects/people, (2) action, (3) mood, (4) dialog, and (5) a disguised ending. These features mean that a highlight should include important objects and people appearing in an original film, many actions to attract viewers, the basic mood of a movie, and dialogs containing important information. Finally, the highlight needs to hide the ending of a movie.
In the VAbstract system a scene is considered as the basic entity for a highlight. Therefore, the scene boundary detection is performed first using existing techniques. Then, it finds the high-contrast scenes to fulfill the trailer Feature 1, the high-motion scenes to fulfill Feature 2, the scenes with basic color composition similar to the average color composition of the whole movie to fulfill Feature 3, the scenes with dialog of various speeches to fulfill Feature 4, and deletes any scene from the last part of an original video to fulfill Feature 5. Finally, all the selected scenes are concatenated together in temporal order to form a movie trailer. The figure below shows the VAbstract system algorithm.
We will now discuss the main steps in VAbstract system, which are scene boundary detection, extraction of dialog scene, extraction of high-motion scene, and extraction of average color.
1. Scene Boundary Detection: Scene change can be determined by the combination of video- and audio-cut detections. Video-cut detection finds sharp transition, namely cut between frames. The results of this video-cut detection are shots. To group the relevant shots into a scene, audio-cut detection is used. A video cut can be detected by using color histogram. If the color histogram difference between two consecutive frames exceeds a threshold, then a cut is determined.
2. Extraction of Dialog Scene: A heuristic method is used to detect dialog scenes. It is based on the finding that a dialog is characterized by the existence of two “a”s with significantly different fundamental frequencies, which indicates that those two “a”s are spoken by two different people. Therefore, the audio track is first transformed to a short-term frequency spectrum and then normalized to compare with the spectrum of a spoken “a.” Because “a” is spoken as a long sound and occurs frequently in most conversations, this heuristic method is easy to implement and effective in practice.
3. Extraction of High-Motion Scene: Motion in a scene often includes camera motion, object motion, or both. A scene with a high degree of motion will be included in the highlight.
4. Extraction of Average Color Scene: A video’s mood is embodied by the colors of each frame. The scenes in the highlight should have the color compositions similar to the entire video. Here, the color composition has physical color properties such as luminance, hue, and saturation. It computes the average color composition of the entire video and finds scenes whose color compositions are similar to the average.
Video synopsis (or abstraction) is a temporally compact representation that aims to enable video browsing and retrieval.We present an approach to video synopsis which optimally reduces the spatio-temporal redundancy in video.As an example, consider the schematic video clip represented as a space-time volume in Fig. 1. The video begins with a person walking on the ground, and after a period of inactivity a bird is flying in the sky. The inactive frames are omitted in most video abstraction methods. Video synopsis is substantially more compact, by playing the person and the bird simultaneously. This makes an optimal useof image regions by shifting events from their original time intervalto another time interval when no other activity takes place at this spatial location. Such manipulations relax the chronological consistency of events as was first presented
Figure 1. The input video shows a walking person, and after a period
of inactivity displays a flying bird. A compact video synopsis
can be produced by playing the bird and the person simultaneously
There are two main approaches for video synopsis (or
video abstraction). In one approach, a set of salient images (key frames) is selected from the original video sequence. The key frames that are selected are the ones that best represent the video. In another approach a collection of short video sequences is selected . The second approach is less compact, but gives a better impression of the scene dynamics. Those approaches (and others) are described in comprehensive surveys on video abstraction . In both approaches above, entire frames are used as the fundamental building blocks. A different methodology uses mosaic images together with some meta-data for video indexing . In this case the static synopsis image includes objects from different times.
This work assumes that every input pixel has been labeled with its level of “activity”. Evaluation of the activitylevel is out of the scope of our work, and can be done usingone of various methods for detecting irregularities [4, 17],moving object detection, and object tracking.We have selected for our experiments a simple and commonly used activity indicator, where an input pixel I(x, y, t) is labeled as “active” if its color difference from the temporal median at location (x, y) is larger than a given threshold.
Active pixels are defined by the characteristic function
χ(p) = 1 if p is active
To clean the activity indicator from noise, a median filter is applied to χ before continuing with the synopsis process.
Video Synopsis by Energy Minimization
Let N frames of an input video sequence be represented in a 3D space-time volume I(x, y, t), where (x, y) are the spatial coordinates of this pixel, and 1 ≤ t ≤ N is the frame number.We would like to generate a synopsis video S(x, y,t) having the following properties:
• The video synopsis S should be substantially shorter
than the original video I.
• Maximum “activity” from the original video should
appear in the synopsis video.
• The motion of objects in the video synopsis should be
similar to their motion in the original video.
• The video synopsis should look good, and visible
seams or fragmented objects should be avoided.
The synopsis video S having the above properties is generated with a mapping M, assigning to every coordinate (x, y, t) in the synopsis S the coordinates of a source pixel from I. We focus in this paper on time shift of pixels, keeping the spatial locations fixed. Thus, any synopsis pixel
S(x, y, t) can come from an input pixel I(x, y,M(x, y, t)).
The time shift M is obtained by solving an energy minimization
problem, where the cost function is given by
E(M) = Ea(M) + αEd(M), (1)
where Ea(M) indicates the loss in activity, and Ed(M) indicates the discontinuity across seams. The loss of activity will be the number of active pixels in the input video I that do not appear in the synopsis video S,
The discontinuity cost Ed is defined as the sum of color
differences across seams between spatiotemporal neighbors
in the synopsis video and the corresponding neighbors in
the input video
A demonstration of the spacetime
operations that create a short video synopsis by minimizing
the cost function (1) is shown in Fig. 2
Figure 2. In this space-time representation of video, moving objects
created the “activity strips”. The upper part represents the
original video, while the lower part represents the video synopsis.
(a) The shorter video synopsis S is generated from the input video
I by including most active pixels. To assure smoothness, when
pixel A in S corresponds to pixel B in I, their “cross border”
neighbors should be similar.
(b) Consecutive pixels in the synopsis video are restricted to come
from consecutive input pixels.
The low-level approach for dynamic video synopsis as described earlier is limited to satisfying local properties such as avoiding visible seams. Higher level object-based properties can be incorporated when objects can be detected. For example, avoiding the stroboscopic effect requires the detection and tracking of each object in the volume.
This section describes an implementation of objectbased
approach for dynamic video synopsis. Several objectbased video summary methods exist in the literature and they all use the detected objects for the selection of significant frames. Unlike these methods,we shift objects in time and create new synopsis frames that never appeared in the input sequence in order to make a better use of space and time.
Moving objects are detected by comparing each pixel to the temporal median and thresholding this difference. This is followed by noise cleaning using a spatial median filter, and by grouping together spatiotemporal
connected components. This process results in a set of objects, where each object b is represented by its characteristicfunction
χb(x, y, t) = 1 if (x, y, t) belongs to b
(5)From each object, segments are created by selecting subsets of frames in which the object appears. Such segments can represent different time intervals, optionally taken at different sampling rates.The video synopsis S will be constructed from the input video I using the following steps:
1. Objects b1 . . . br are extracted from the input video I.
2. A set of non-overlapping segments B is selected from the original objects.
3. A temporal shift M is applied to each selected segment, creating a shorter video synopsis while avoiding occlusions between objects and enabling seamless stitching. This is explained in Fig. 1 and Fig. 4. An example is shown in Fig. 5.
Steps 2 and 3 above are inter-related, as we would like to select the segments and shift them in time to obtain a short and seamless video synopsis.
Figure 4. A few examples for a schematic temporal rearrangement
of objects. Moving or active objects created the “activity strips”.
The upper parts represents the original video, and the lower parts
represent the video synopsis.
(a) Two objects recorded at different times are shifted to the same
time interval in the video synopsis.
(b) A single object moving during a long period is broken into
segments having a shorter time intervals, and those are played simultaneously
creating a dynamic stroboscopic effect.
© Intersection of objects does not disturb the synopsis when object
volumes are broken into segments.
Video-Synopsis with a Pre-determined Length
In this paragraph we describe the case where a short synopsis
video of a predetermined length K is constructed from a longer video. In this scheme, each object is partitioned into overlapping and consecutive segments of length K. All the segments are time-shifted to begin at time t = 1, and we are left with deciding which segments to include in the synopsis
video. Obviously, with this scheme some objects may not appear in the synopsis video.
We first define an occlusion cost between all pairs of segments.
Let bi and bj be two segments with appearance times
ti and tj , and let the support of each segment be represented
by its characteristic function χ (as in Eq.5).
The cost between these two segments is defined to be the
sum of color differences between the two segments, after
being shifted to time t = 1.
To avoid showing the same spatio-temporal pixel twice (which is admissible but wasteful) we set v(bi, bj) = ∞for segments bi and bj that intersect in the original movie. In addition, if the stroboscopic effect is undesirable, it can be
avoided by setting v(bi, bj) = ∞ for all bi and bj that were
sampled from the same object.
One frame from a video synopsis with the dynamic
Lossless Video Synopsis
For some applications, such as video surveillance, we may prefer a longer synopsis video, but in which all activities are guaranteed to appear. In this case, the objective is not to select a set of object segments as was done in the
previous section, but rather to find a compact temporal rearrangement
of the object segments.
Again, we use Simulated Annealing to minimize the energy.In this case, a state corresponds to a set of time shifts for all segments, and two states are defined as neighbors if their time shifts differ for only a single segment. There are two issues that should be notes in this case:
• Object segments that appear in the first or last frames should remain so in the synopsis video. (otherwise they may suddenly appear or disappear). We take care that each state will satisfy this constraint by fixing the temporal shifts of all these objects accordingly.
• The temporal arrangement of the input video is commonly a local minima of the energy function, and therefore is not a preferable choice for initializing the Annealing process. We initialized our Simulated Annealing
with a shorter video, were all objects overlap.
Panoramic Video Synopsis
When a video camera is scanning a scene, much redundancy can be eliminated by using a panoramic mosaic. Yet,existing methods construct a single panoramic image, in which the scene dynamics is lost. Limited dynamics can be represented by a stroboscopic image where moving objects are displayed at several locations along their paths.A panoramic synopsis video can be created by simultaneously displaying actions that took place at different times in different regions of the scene. A substantial condensation may be obtained, since the duration of activity for each object is limited to the time it is being viewed by the camera.A special case is when the camera tracks an object (such as the running lioness shown in Fig. 7). In this case, a short video synopsis can be obtained only by allowing the Stroboscopic effect. Constructing the panoramic video synopsis is done in a similar manner to the regular video synopsis, with a preliminary
stage of aligning all the frames to some reference frame.
Figure: When a camera tracks the running lioness, the synopsis
video is a panoramic mosaic of the background, and the foreground
includes several dynamic copies of the running lioness.
Video Indexing Through Video Synopsis
Video synopsis can be used for video indexing, providing
the user with efficient and intuitive links for accessing actions in videos. This can be done by associating with every synopsis pixel a pointer to the appearance of the corresponding object in the original video. In video synopsis, the information of the video is project and implimentationed into the ”space of activities”, in which only activities matter, regardless of their temporal context (although we still preserve the spatial context). As activities are concentrated in a short period, specific activities in the video can
be accessed with ease.
Two video synopsis approaches were presented: one approach uses low level graph optimization, where each pixel in the synopsis video is a node in this graph. This approach has the benefit of obtaining the synopsis video directly from the input video, but the complexity of the solution may be very high. An alternative approach is to first detect moving objects, and perform the optimization on the detected objects .While a preliminary step of motion segmentation is needed in the second approach, it is much faster, and object based constraints are possible.
The activity in the resulting video synopsis is much more condensed than the activity in any ordinary video, and viewing such a synopsis may seem awkward to the non experienced viewer.Special attention should be given to the possibility of obtaining dynamic stroboscopy. While allowing a further reduction in the length of the video synopsis, dynamic stroboscopy may need further adaptation from the user. It does take some training to realize that multiple spatial occurrences of a single object indicate a longer activity time .While we have detailed a specific implementation for dynamic video synopsis, many extensions are straight forward .
Video synopsis has been proposed as an approach for condensing the activity in a video into a very short time period. This condensed representation can enable efficient access to activities in video sequences.
1 Making a Long video short: dynamic video synopsis By Alex Rav-Acha Yael Pritch Shmuel Peleg from 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
2 R Hammoud & R Mohr (2000, Aug.) – A probabilistic framework of selecting effective key frames from video browsing and indexing –Proceedings of International Workshop on Real-Time Image Sequence Analysis, Oulu, Finland, 79-88.
3 D M Russell – A design pattern-based video summarization technique: moving from low-level signals to high-level structure – Proc. of the 33rd Hawaii International Conference on System Sciences, vol. 1, Jan. 2000
4 J. Assa, Y. Caspi, and D. Cohen-Or. Action synopsis: Pose selection and illustration. In SIGGRAPH, pages 667–676, 2005.
5 O. Boiman and M. Irani. Detecting irregularities in images and in video. In ICCV, pages I: 462–469, Beijing, 2005.
6 A. M. Ferman and A. M. Tekalp. Multiscale content extraction and representation for video indexing. Proc. of SPIE, 3229:23–31, 1997.
7 M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu. Efficient
representations of video sequences and their applications.Signal Processing: Image Communication, 8(4):327–351, 1996.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion