Pivot Vector Space Approach in Audio-Video Mixing full report
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
computer science technology
Active In SP

Posts: 740
Joined: Jan 2010
20-01-2010, 11:31 PM

.doc   Pivot Vector Space Approach in Audio-Video Mixing report.doc (Size: 95.5 KB / Downloads: 135)
Audio-Video mixing is an important aspect of cinematography.Most videos such as movies and sitcoms have several segments devoid of any speech.Adding carefully chosen music to such segments conveys emotions such as joy,tension or melancholy. In a typically professional video production,skilled audio-mixing artists aesthetically add appropriate audio to the given video shot. This process is tedious, time-consuming and expensive.
The PIVOT VECTOR SPACE APPROACH in audio mixing is a novel technique that automatically picks the best audio clip (from the available database) to mix with the given video shot.This technique uses a pivot vector space mixing framework to incorporate the artistic heuristics for mixing audio with video. This technique eliminates the need for professional audio mixing artists and hence it is not expensive.It also saves time and is very convenient.

The PIVOT VECTOR SPACE APPROACH is a novel technique of audio-video mixing which automatically selects the best audio clip from the available database, to be mixed with the given video shot. Till the development of this technique, audio-video mixing is a process that could be done only by professional audio-mixing artists. However employing these artists is very expensive and is not feasible for home video mixing. Besides, the process is time-consuming and tedious.
In todayâ„¢s era, significant advances are happening constantly in the field of Information Technology. The development in the IT related fields such as multimedia is extremely vast. This is evident with the release of a variety of multimedia products such as mobile handsets, portable MP3 players, digital video camcorders, handicams etc. Hence, certain activities such as production of home videos is easy due to products such as handicams, digital video camcorders etc. Such a scenario was not there a decade ago ,since no such products were available in the market. As a result production of home videos is not possible since it was reserved completely for professional video artists.
So in todayâ„¢s world, a large amount of home videos are being made and the number of amateur and home video enthusiasts is very large.A home video artist can never match the aesthetic capabilities of a professional audio mixing artist. However employing a professional mixing artist to develop home video is not feasible as it is expensive, tedious and time consuming.
The PIVOT VECTOR SPACE APPROACH is a technique that all amateur and home video enthusiasts can use in the creation of video footage that gives a professional look and feel. This technique saves cost and is fast. Since it is fully automatic, the user need not worry about his aesthetic capabilities. The PIVOT VECTOR SPACE APPROACH uses a pivot vector space mixing framework to incorporate the artistic heuristics for mixing audio with video .These artistic heuristics use high level perceptual descriptors of audio and video characteristics. Low-level signal processing techniques compute these descriptors.

Movies comprise images (still or moving) ;graphic traces(texts and signs);recorded speech, music, and noises; and sound effects. The different roles of music in movies can be categorized into :--
¢ Setting the scene(create atmosphere of time and place)
¢ Adding emotional meaning ,
¢ Serving as a background filler,
¢ Creating continuity across shots or scenes, and
¢ Emphasizing climaxes(alert the viewer to climaxes and emotional points of scenes).
The links between music and moving images are extremely important, and the juxtaposition of such elements must be carried out according to some aesthetic rules. The scientist Zettl explicitly defined such rules in the form of a table, presenting the features of moving images that match the features of music. Zettl based these proposed mixing rules on the following aspects:--
¢ Tonal matching(related to the emotional meaning defined by Copland)
¢ Structural matching(related to emotional meaning and emphasizing climaxes defined by Copland)
¢ Thematic matching(related to setting the scene as defined by Copland)
¢ Historical-geographical matching(related to setting the scene as defined by Copland)

In the following TABLE ,we summarize the work of Zettl by presenting aesthetic features that correspond in video and music. The table also indicates extractable features because many video and audio features defined by Zettl are high level perceptual features and canâ„¢t be extracted by the state of the art in computational media aesthetics.

Light type No/no Rhythm Yes/no
Light mode Yes/no Key No/no
Light falloff Yes/yes Dynamics Yes/yes
Color energy Yes/yes Dynamics Yes/yes
Color hue Yes/yes Pitch Yes/yes
Color saturation Yes/yes Timbre No/no
Color brightness Yes/yes Dynamics Yes/yes
Space screen size No/no Dynamics Yes/yes
Space graphic weight No/no Chords and beat No/no
Space general shape No/no Sound shape No/no
Object in frame No/no Chord tension No/no
Space texture Yes/no Chords No/no
Space field density/frame No/no Harmonic density No/no
Space field density/period No/no Melodic density No/no
Space field
complexity/frames No/no Melodic density No/no
Space graphic vectors No/no Melodic line No/no
Space index vectors No/no Melodic progression No/no
Space principal vector Yes/no Sound vector
orientation No/no
Motion vectors Yes/yes Tempo Yes/yes
Zooms Yes/no Dynamics Yes/yes
Vector continuity Yes/no Melodic progression No/no
Transitions Yes/no Modulation change No/no
Rhythm No/no Sound rhythm No/no
Energy vector magnitude No/no Dynamics Yes/yes
Vector field energy Yes/no Sound vector energy No/no

The table shows, from the cinematic point of view,a set of attributed features(such as color and motion) required to describe videos.The computations for extracting aesthetic attributed features from low-level video features occur at the video shot granularity. Because some attributed features are based on still images(such as high light falloff),we compute them on the key frame of a video shot. We try to optimize the trade-off in accuracy and computational efficiency among the competing extraction methods. Also, even though we assume that the videos considered come in the MPEG format(widely used by several home video camcorders),the features exist independently of a particular representation format.
The important video aesthetic features are as follows:--
Light falloff refers to the brightness contrast between the light and shadow sides of an object and the rate of change from light to shadow. If the brightness contrast between the lighted side of an object and the attached shadow is high, the frame has fast falloff. This means the illuminated side is relatively bright and the attached shadow is quite dense and dark. If the contrast is low, the resulting falloff is considered slow. No falloff(or extremely low falloff) means that the object is lighted equally on all sides.
To compute light falloff, we need a coarse background and foreground classification and extraction of the object edges. We adapt a simplified version of the algorithm in Wang et al that detects the focused objects in a frame using multiresolution wavelet frequency analysis and statistical methods. In a frame, the focused objects have more details within the object than the out of focus background. As a result, the focused object regions have a larger fraction of high-valued wavelet coefficients in the high frequency bands of the transform. We partition a reference frame of a shot into blocks and classify each block as background or foreground.the boundary of the background-foreground blocks provides the first approximation of the object boundary.
The second step involves refining this boundary through a multiscale approach. We perform successive refinements at every scale to obtain the pixel-level boundary. After removing the small isolated regions and smoothening the edge, we calculate the contrast along the edge and linearly quantize the values. The falloff edge often has the highest contrast along the edge, so we select the average value in the highest bin as the value of light falloff in this frame.
The color features extracted from a video shot consists of four features:-

The computation process is similar for the first three as follows:-
1. Compute the color histogram features on the frames, set of intraframes:if we use the hue, saturation, and intensity(HSI) color space, the 3 histograms hist(H),hist(S) and hist(Brightness(B)) are respectively based on the H,S and I components of the colors. We then obtain the dominant saturation, hue and brightness in a shot.
2. Choose the feature valuesV(H),V(S),V(B) that correspond respectievely to the dominant bin of each of hist(H),hist(S) and hist(B).All these value are normalized in [0,1].
To measure the video segment™s motion intensity, we use descriptors. They describe a set of automatically extractable descriptors of motion activities, which are computed from the MPEG motion vectors and can capture the intensity of a video shot™s motion activity. Here we use the max2 descriptor, which discards 10 percent of the motion vectors to filter out spurious vectors or very small objects. We selected this descriptor for 2 reasons::--The extraction of motion vectors from MPEG-1 and “2 compressed video streams is fast and efficient. Second, home videos normally have moderate motion intensity and are shot by amateur users who introduce high tilt up and down so that camera motion is not stable. So if we use the average descriptors, the camera motion™s influence will be high. If we use the mean descriptor, the value will come close to zero, which will fail to capture the object™s movement.Interestingly,max2 is also the best performing descriptor.

The descriptions discussed previously focus on features extraction, not on the attributed feature definitions. However, we can determine such attributed features. We collected a set of 30 video shots from 2 different sources:movies and home videos. We used this data set as the training set.A professional video expert manually annotated each shot from this training set, ascribing the label high, medium,or low for each of the aesthetic features from the earlier table.Next,we obtained the mean and standard deviation of the assumed Gaussian probability distribution for the feature value of each label.We subsequently used these values ,listed in the next TABLE for estimating the confidence level of the attributed feature for any test shot.

Music perception is an extremely complex psycho-acoustical phenomenon that is not well understood. So instead of directly extracting the musicâ„¢s perceptual features, we can use the low-level signal features of audio clips, which can provide clues on how to estimate the numerous perceptual features.

We described here the required basic features that are extracted from an audio excerpt.
Spectral centroid
The spectral centroid is commonly associated with the measure of a soundâ„¢s brightness.We obtain this measure by evaluating the center of gravity using the frequency and magnitude information of Fourier transforms.The individual centroid C(n) of a spectral frame is the average frequency weighted by the amplitude ,divided by the sum of the amplitude.
In the context of discrete-time signals, a zero crossing is said to occur if two successive samples have opposite signs. The rate at which Zero crossings occur is a simple measure of the frequency content of the signal.This is particularly true of the narrowband signals. Because audio signals might include both narrowband and broadband signals, the interpretation of the average zero-crossing rate is less precise. However, we can still obtain rough estimates of the spectral properties using a representation on the short-time average zero-crossing rate.
The volume distribution of audio clips reveals the signal magnitudeâ„¢s temporal variation. It represents the subjective measure, which depends on the human listenerâ„¢s frequency response. Normally volume is approximated by the root mean square value of the signal magnitude within each frame.To measure the temporal variation of the audio clipâ„¢s volume, we define two domain measures based on the volume distribution. The first is the volume standard deviation over a clip, normalized by the maximum volume in the clip.The second is the volume dynamic range given by
We can relate the low-level audio features described previously with Table 1â„¢s perceptual labels required for our matching framework.
Dynamics refers to the volume of musical sound related to the musicâ„¢s loudness or softness, which is always a relative indication, dependent on the context. Using only the audio signalâ„¢s volume features is not sufficient to capture the music clip dynamics because an audio signal could have a high volume but low dynamics. Thus we should incorporate the spectral centroid, zero crossings, and volume of each frame to evaluate the audio signalâ„¢s dynamics. We use a preset threshold for each feature to decide whether the audio clipâ„¢s dynamics is high, medium or low.

Tempo features
One of the most important features that makes the music flow unique and differentiates it from other types of audio signal is temporal organization(beat rate).Humans perceive musical temporal flow as a rhythm related to the flow of music with the time. One aspect of rhythm is the beat rate ,which refers to a perceived pulse marking off equal duration units. This pulse is felt more strongly in some music pieces than others, but it is almost always present. When we listen to music, we feel the regular repetition of these beats and try to synchronize our feelings to what we hear by tapping our feet or hands. In fact, using certain kinds of instruments like bass drums and bass guitars synchronizes the rhythm flow in music.
Extracting rhythmic information from raw sound samples is difficult. This is because there is no ground truth for the rhythm in the simple measurement of an acoustic signal. The only basis is what human listenerâ„¢s perceive as the rhythmical aspects of the musical content of that signal. Several studies have focused on extracting the rhythmic information from the digital music representations such as the musical instrument digital interface (MIDI) ,or with reference to a music score. Neither of these approaches is suited for analyzing raw audio data. For the purpose of our analysis,we adopted the algorithm proposed by Tzanetakis. This technique decomposes the audio input signal into five bands(11 to 5.5,5.5 to 2.25,2.25 to 1.25,1.25 to 0.562, and 0.562 to 0.281 KHz) using the discrete wavelet transform(DWT),with each band representing a one-octave range. Following this decomposition, the time domain envelope of each band is extracted separately by applying full wave rectification, low pass filtering, and down sampling to each band. The envelope of each band is then summed together and an auto-correlation function is computed. The peaks of the autocorrelation function correspond to the various periodicities of the signal
envelope. The output of this algorithm lets us extract several interesting features from a music sample. We use DWT together with an envelope extraction technique and auto-correlation to construct a beat histogram.the set of features based on the beat histogram”which represents the tempo of musical clips”includes
¢ Relative amplitudes of the first and second peaks and their corresponding periods,
¢ Ratio of the amplitude between the second peak divided by the amplitude of the first peak,and
¢ Overall sum of the histogram (providing an indication of overall beat strength).
We can use the amplitude and periodicity of the most prominent peaks as a music tempo feature. The periodicity of the highest peak ,representing the number of beats per minute, is a measure of the audio clipâ„¢s tempo. We normalized the tempo to scale in the range [0,1].From the manual preclassification of all audio clips in the database and extensive experiments, we realized a set of empirical thresholds to classify whether the audio clips have a high, medium or low tempo.
Perceptual pitch feature
Pitch perception plays an important role in human hearing, and the auditory system apparently assigns a pitch to anything that comes to its attention. The seemingly easy concept of pitch in practice is fairly complex. This is because pitch exists as an acoustic property(repetition rate),as a psychological percept(perceived pitch),and also as an abstract symbolic entity related to interval and keys.
The existing computational multipitch algorithms are clearly inferior to the human auditory system in accuracy and flexibility. Researchers have proposed many approaches to simulate human perception .These generally follow one of two paradigms: place(frequency) theory or timing periodicity theory. In our approach, we do not look for accurate pitch measurement;instead we only want to approximate whether the level of the polyphonic musicâ„¢s multipitch is high, medium or low. This featureâ„¢s measurement has a highly subjective interpretation and there is no standard scale to define the pitchâ„¢s highness. For this purpose, we follow the simplified model for multipitch analysis proposed in Tero and Matti. In this approach, prefiltering preprocesses the audio signal to simulate the equal loudness curve sensitivity of the human ear and warping simulates the adaptation in the hair cell models. The frequency of the preprocessed audio signal is decomposed into two channels .The autocorrelation directly analyzes the low frequencies(below 1 KHz) channel ,while a half-wave rectifier first rectifies the high frequencies (above 1KHz) channels and then passes through a low pass filter. Next, we compute the autocorrelation of each band in a frequency domain by using the discrete Fourier transform as corr(T)=IDFT{[DFT{s(T)}].
The sum of the two channelâ„¢s autocorrelation functions represents the summary of autocorrelation functions(SACF).The peaks of SACF denote the potential pitch periods in the analyzed signal. However, the SACF include redundant and spurious information making it difficult to estimate the true pitch peaks. A peak enhancement technique can add more selectivity by pruning the redundant and spurious peaks. At this stage, the peak locations and their intensity estimate all possible periodicities in the autocorrelation function. However, to obtain more robust pitch estimation, we should combine evidence at all subharmonics of each pitch. We achieve this by accumulating the most prominent peaks and their corresponding periodicities in a folded histogram over a 2088 sample window size at a 22,050 sampling rate ,with a hop window size of 512 samples. In the folded histogram, all notes are transposed into a single octave(array of size 12) and mapped to a circle of fifths, so that the adjacent bins are spaced a fifth apart, rather than in semitones. Once the pitch histogram for an audio file is extracted, it is transformed to a single feature vector consisting of the following values::--
¢ bin number of maximum peaks of the histogram, corresponding to the main pitch class of the musical piece;
¢ amplitude of the maximum peaks;and
¢ interval between the two highest peaks.
The amplitude of the most prominent peak and its periodicity,can roughly indicate whether the pitch is high, medium or low. To extract the pitch level, we use a set of empirical threshold values and the same procedures as for tempo features extraction.
We collected a set of music clips from different music CDs,each with several music styles.Our data set contained 50 samples(each music excerpt is 30 seconds long).In our experiments, we sampled the audio signal 22 KHz and divided it into frames containing 512 samples each. We computed the clip level features based on the frame-level features.To compute each audio clipâ„¢s perceptual features(dynamics, tempo, and pitch as mentioned previously),a music expert analyzes the audio database and defines the high, medium and low attributes for each feature, in a similar manner as with video shots. We eventually obtain the mean and variance of the Gaussian probability distribution to estimate the confidence level of the ternary labels for any given music clip. Table 2 lists the mean and variance parameters for each feature.

We aim to define a vector space P that serves as a pivot between the video and audio aesthetics representations. This space is independent of any media, and the dimensions represent the aesthetic characteristics of a particular media. Pivot space P is a space on IR(p) and is defined with the p set of aesthetic features in which the music and videos are mapped. The initial spaces V (for video) and M (for music) are respectively spaces on IR(v) and IR(m),with v being the number of tuples(video feature,description) extracted for the video,and m the number of tuples(audio feature,description) extracted for the music excerpts.
The PIVOT REPRESENTATION consists of ::--

¢ Media representation of base vector spaces V and M
¢ Pivot space mapping
Once we define the aesthetic features, we consider how to represent video and audio clips into their aesthetic spaces V or M.In the two spaces,a dimension corresponds to an attributed feature. Instances of such attributed features for video data includes brightness_high,brightness_low,and so on.One video shot is associated with one vector in the V space.Obtaining the values for each dimension resembles handling fuzzy linguistic variables,with the aesthetic feature playing the role of a linguistic variable and the attribute descriptor acting as a linguistic value which is represented using diagram.In the diagram, we represent sharp boundaries between fuzzy membership functions in a manner that removes correlation between them.The X-axis refers to the actual computed feature value and the Y-axis simultaneously indicates the aesthetic label and the confidence value.
Using a training collection for each linguistic value obtains the membership function. As described previously, we assume that each attributed feature follows a Gaussian probability distribution function, that is ,we compute the mean and standard deviation on the samples so the probability density function becomes available for each aesthetic feature i and attribute j in (low,medium,high).We next translate each Gaussian representation into a fuzzy membership function,which can compute labels for the video or musical parts.Because we consider three kinds of attributes for each feature”namely high,medium,and low”one membership function represents each attribute .
The following steps define the membership functions for M(ij) for each aesthetic feature I and attribute j in (low,medium,high):--

1. Thresholding the Gaussian distributions to ensure that the membership function fits into the interval [0,1].
2. Forcing the membership function for the low attributes to remain constant and equal to the minimum of 1 and the maximum of the Gaussian function for values smaller than the mean .
3. Forcing the membership function for the high attributes to remain constant and equal to the minimum of 1,and the maximum of the Gaussian function for values greater than the mean.
4. Removing cross correlation by defining strict separation between the fuzzy membership functions of the different attributes for the same feature.
5. Formally,the membership functions M(i)small defined for the linguistic values that correspond to low attributes is M(i)small(x)=min(1,f(u-I)small for xmember of [u,y].So for each feature I (in the video or music space),we compute the M(i)low,M(i)Medium or M(i)High fuzzy membership functions using the previous equations.We then express the values obtained after a video shot or music excerpt analysis in the aesthetic space using the linguistic membership function value.So the dimension space v(respectievely m) of V(respectievely M) equals 3 times the number of video(respectively music) aesthetic features.
We represent one video shot as a point S in the space V.Each of the coordinates S(i,j) in S(i) is in [0,1],with values close to 0 indicating that the corresponding aesthetic attributed feature well describes shot.Similar remarks hold for the music excerpts represented as E(k) in the space M.
The mapping going from the IR(v) (respectively IR(m)) to the IR(p) space is provided by a p*v (respectively p*m) matrix T(v) (respectively T(m)) that expresses a rotation or project and implimentationion. Rotation allows the mapping of features from one space to another. Several features of the IR(v) space might be project and implimentationed in one feature of IR(p);it is also possible that one feature of IR(v) might be project and implimentationed onto several features of IR(p).The mapping is single stochastic, implying that the sum of each column of T(v) and T(m) equals 1.This ensures that the coordinate values in the pivot space still fall in the interval [0,1] and can be considered fuzzy values. The advantage of the mapping described here is that it is incremental. This is because if we can extract new video features, the modification of the transformation matrix T(v) preserves all the existing mapping of music parts. We directly extrapolated the transformation matrices T(v) and T(m) form Table 1 and define links between video and music-attributed aesthetic features and pivot-attributed aesthetic features.
The fundamental role of the pivot space is allowing the comparison between video and music attributed aesthetic features. We compute the compatibility between the music described with V1 and the music described as M1 as the reciprocal of the Euclidian distance between Vâ„¢1 and Mâ„¢1.For instance, the cosine measure (as used in vector space textual information retrieval) is not adequate because we do not seek similar profiles in terms of vector direction, but on the distance between vectors. The use of Euclidean distance is meaningful;when the membership values for one video shot and one music excerpt are close for the same attributed feature, then the two media parts are similar on this dimension, and when the values are dissimilar, then the attributed feature for the media are different. Euclidean distance holds here also because we assume independence between the dimensions.
We assign one dimension of P for the three attributes”high,medium,and low related to the following::--
¢ Dynamics (related to light falloff,color energy,and color brightness for video and dynamics for music).Without any additional knowledge,we only assume that each of the attributed features of dynamics in the pivot space are equally based on the corresponding attributed video features.For instance,the high dynamics dimension in P comprises one-third each of the high light falloff,color energy,and color brightness.
¢ Motion (related to motion vectors of video and tempo of music).
¢ Pitch(related to the color hue of video and pitch of music).
In the following ,we will represent a music excerpt or a video shot in the pivot space using a 9-dimensional vector corresponding respectievely to the following attributed eatures:low_dynamics,medium_dynamics,high_dynamics,low_motion,medium_motion,high_motion,low_pitch,medium_pitch,and high_pitch.

From the retrieval point of view, the approach presented in the previous section provides a ranking of each music excerpt for each video shot and vice versa, by computing the Euclidean distances between their representatives in the pivot space.
However, our aim is to find the optimal set of music excerpts for a given video where the compatibility of the video and music determines optimality.
One simple solution would be only to select the best music excerpt for each shot and play these music excerpts with the video.However,this approach is not sufficient because music excerpt duration differs from shot duration. So, we might use several music excerpts for one shot, or have several shots fitting the duration of each music excerpt. We chose to first define the overall best match value between the video shots and the music excerpts. If we obtain several best matches, we take the longest shot, assuming that for the longer shots, we will have more accurate feature extraction.(longer shots are less prone to errors caused by small perturbations).Then we use media continuity heuristics to ensure availability of long sequences of music excerpts belonging to the same music piece.
Suppose that we obtained the best match for shot,and the music excerpt l from music piece k,namely M(k,l).Then we assign the music excerpts M(k,m) (with m<l) to the part of the video before shot(i),and the music excerpts M(k,n) (n>l) to the parts of the video after shot(I).Thus we achieve musical continuity.If the video is longer than the music piece,we apply the same process on the remaining parts of the video,by placing priority on the remaining parts that are contiguous to the already mixed video parts,ensuring some continuity.The previous description does not describe the handling of all the specific cases that could occur during the mixing(for example,several music parts might have the same matching value for one shoot),but it gives a precise enough idea of the actual process.


Before the development of the PIVOT VECTOR SPACE APPROACH ,audio-video mixing process can be carried out only by professional mixing artists.
In todayâ„¢s era the development in the field of MULTIMEDIA technology is so vast as this can be seen with the releases of a number of multimedia products in the market. Products such as Digital video camcorders, Handicams greatly helped even normal home users to produce their own video. However, employing professional audio-mixing artists is not feasible since it is expensive, time-consuming and tedious.
The Pivot vector space approach enables all the home video users and amateur video enthusiasts to give a professional look and feel to their videos. This technique also eliminates the need for professional mixing artists and hence saves cost. Besides, it is not time-consuming.
Since this approach is fully automatic as it automatically selects the best audio clip (available from the given database) to be mixed with the given video shot ,the user need not worry about his aesthetic capabilities in selecting the audio clip.

In todayâ„¢s INFORMATION TECHNOLOGY era ,the advances in the various IT fields such as MULTIMEDIA,NETWORKING etc is very fast.Newer and better technologies arise as each day passes. This is evident with the release of a number of Technology packed products such as portable MP3 players,digital cameras,digital video camcorders,Handicams,Mobile handsets etc.
Before the advent of such technologies, activities such as Production of videos etc could be done only by professional video artists. However in todayâ„¢s era,with the releases of products such as Handicams,Digital video camcorders;production of videos is easy for all the home video users and amateur video enthusiasts.As a result, a large amount of home video footage is being produced now.
The PIVOT VECTOR SPACE APPROACH is a novel technique for these users since it is able to provide a professional look and feel to these videos.It eliminates the need for professional mixing artists and hence cuts down the cost ,time and labour involved.Hence,the demand for such a technique will be only increasing in the coming years .This technique will definitely have a great impact on the IT market today.

The PIVOT VECTOR SPACE APPROACH is a new dimension in the field of AUDIO-VIDEO mixing. Before the advent of this technology, audio-video mixing was a process carried out only by professional mixing artists. However, this process is expensive, tedious and time-consuming. This entire scenario changed with the emergence of the PIVOT VECTOR SPACE approach. Since this technique is fully Automatic, it enabled a home video user to provide a professional look and feel to his video. This technique also eliminates the need for professional mixing artists, thereby significantly reducing the cost, time and labour involved. In todayâ„¢s era,a large amount of home video footage is being produced due to products such as Digital video camcorders, Handicams etc.Hence,this technique will be of great use to all the amateur video enthusiasts and home video users.

IEEE Multimedia journal (Computational Media Aesthetics)
CHIP magazine
DIGIT magazine


I express my sincere gratitude to Dr. Agnisarman Namboodiri, Head of Department of Information Technology and Computer Science , for his guidance and support to shape this paper in a systematic way.
I am also greatly indebted to Mr. Saheer H. and
Ms. S.S. Deepa, Department of IT for their valuable suggestions in the preparation of the paper.
In addition I would like to thank all staff members of IT department and all my friends of S7 IT for their suggestions and constrictive criticism.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
Active In SP

Posts: 1,124
Joined: Jun 2010
11-10-2010, 04:04 PM

Audio mixing is an important aspect of cinematography. Most videos such as movies and sitcoms have several segments devoid of any speech. Adding carefully chosen music to such segments conveys emotions such as joy, tension, or melancholy. It also acts as a mechanism to bridge scenes and can add to the heightened sense of excitement in a car chase or reflect the somber mood of a tragic situation. In a typical professional video production, skilled audiomixing artists aesthetically add appropriate audio to the given video shots. This process is tedious, time-consuming, and expensive. With the rapid proliferation in the use of digital video camcorders, amateur video enthusiasts are producing a huge amount of home video footage. Many home video users would like to make their videos appear like professional productions before they share it with family and friends. To meet this demand, companies such as Muvee Technologies produce software tools to give home videos a professional look. Our work is motivated by similar goals. The software tool available from Muvee lets a user choose a video segment, audio clip, and mixing style (for example, music video or slow romantic). The Muvee software automatically sets the chosen video to the given audio clip incorporating special effects like gradual transitions, the type of which depends on the chosen style. If a user chooses an appropriate audio and style for the video, the result is indeed impressive. However, a typical home video user would lack the high skill level of a professional audio mixer needed to choose the right audio clip for a given video. It’s quite possible to choose an inappropriate audio clip (say the one with a fast tempo) for a video clip (one that’s slow with hardly any motion). The result in such a case would certainly be less than desirable. Our aim is to approximately simulate the decision-making process of a professional audio mixer by employing the implicit aesthetic rules that professionals use. We have developed a novel technique that automatically picks the best audio clip (from the available database) to mix with a given video shot. Our technique uses a pivot vector space mixing framework to incorporate the artistic heuristics for mixing audio with video. These artistic heuristics use highlevel perceptual descriptors of audio and video characteristics. Low-level signal processing techniques compute these descriptors. Our technique’s experimental results appear highly promising despite the fact that we have currently developed computational procedures for only a subset of the entire suite of perceptual features available for mixing. Many open issues in the area of audio and video mixing exist and some possible problems in computational media aesthetics 1 need future work.
For more information about this article,please follow the link:
seminar addict
Super Moderator

Posts: 6,592
Joined: Jul 2011
02-02-2012, 10:40 AM

to get information about the topic Pivot Vector Space Approach in Audio Video Mixing full report ,ppt and related topic refer the link bellow


seminar paper
Active In SP

Posts: 6,455
Joined: Feb 2012
07-04-2012, 04:38 PM

Pivot Vector Space Approach in Audio-Video Mixing

.pdf   pivt.pdf (Size: 351.72 KB / Downloads: 48)


Setting the scene(To create atmosphere of time and
Adding emotional meaning
Serving as a background filler
Creating continuity across shots or scenes
Emphasizing climaxes


This consists of features which required to describe
• Light falloff
• Color features
• Motion vectors

These features exists independently of a particular
representation format.


The mapping going from video‐attributed aesthetic
feature space to the pivot‐attributed aesthetic feature
Cause rotation or project and implimentationion
The mapping is incremental.
The extraction of new videos preserve all the existing
mapping of music parts.

This is a technique that all amateur and home video
artists can use in the creation of video footage that gives
a professional look and feel. Since it is fully
automatic,the user need not worry about his aesthetic

Important Note..!

If you are not satisfied with above reply ,..Please


So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page

Quick Reply
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  digital image processing full report computer science technology 40 47,783 08-03-2014, 12:52 PM
Last Post: aumsree
  solar power satellite full report computer science technology 30 33,681 28-02-2014, 07:01 PM
Last Post: sandhyaswt16
  electronic nose full report project report tiger 15 17,851 24-10-2013, 11:23 AM
Last Post: Guest
  smart pixel arrays full report computer science technology 5 8,105 30-07-2013, 05:10 PM
Last Post: Guest
  smart antenna full report computer science technology 18 22,596 25-07-2013, 01:55 PM
Last Post: livon
  speed detection of moving vehicle using speed cameras full report computer science technology 14 22,669 19-07-2013, 04:40 PM
Last Post: study tips
  speech recognition full report computer science technology 17 22,855 14-05-2013, 12:28 PM
Last Post: study tips
  global system for mobile communication full report computer science technology 9 8,522 06-02-2013, 10:01 AM
Last Post: seminar tips
  satrack full report computer science technology 10 17,505 02-02-2013, 10:53 AM
Last Post: seminar tips
  Wireless Battery Charger Chip for Smart-Card Applications full report project topics 8 7,576 28-01-2013, 03:02 PM
Last Post: varna prasad