Novel True-Motion Estimation Algorithm and Its Application to Motion-Compensated Temporal Frame Interpolation
In this paper, a new low-complexity true-motion estimation (TME) algorithm is proposed for video processing applications, such as motion-compensated temporal frame interpolation (MCTFI) or motion-compensated frame rate up-conversion (MCFRUC). Regular motion estimation, which is often used in video coding, aims to find the motion vectors (MVs) to reduce the temporal redundancy, whereas TME aims to track the projected object motion as closely as possible. TME is obtained by imposing implicit and/or explicit smoothness constraints on the block-matching algorithm. To produce better quality-interpolated frames, the dense motion field at interpolation time is obtained for both forward and backward MVs; then, bidirectional motion compensation using forward and backward MVs is applied by mixing both elegantly. Finally, the performance of the proposed algorithm for MCTFI is demonstrated against recently proposed methods and smoothness constraint optical flow employed by a professional video production suite. Experimental results show that the quality of the interpolated frames using the proposed method is better when compared with the MCFRUC techniques.
The spatiotemporal spectra of a video that contains a moving object form a plane in the 3D frequency domain. This plane, which is described as the theoretical motion plane, reflects the velocity of the moving objects, which is calculated from the slope. However, if the resolution of the frequency analysis method is not high enough to obtain actual spectra from the object signal, the spatiotemporal spectra disperse away from the theoretical motion plane. In this paper, we propose a high-resolution frequency analysis method, described as 3D nonharmonic analysis (NHA), which is only weakly influenced by the analysis window. In addition, we estimate the motion vectors of objects in a video using the plane-clustering method, in conjunction with the least-squares method, for 3D NHA spatiotemporal spectra. We experimentally verify the accuracy of the 3D NHA and its usefulness for a sequence containing complex motions, such as cross-over motion, through comparison with 3D fast Fourier transform. The experimental results show that increasing the frequency resolution contributes to high-accuracy estimation of a motion plane.
We introduce a new edge-directed interpolator based on locally defined, straight line approximations of image isophotes. Spatial derivatives of image intensity are used to describe the principal behavior of pixel-intersecting isophotes in terms of their slopes. The slopes are determined by inverting a tridiagonal matrix and are forced to vary linearly from pixel-to-pixel within segments. Image resizing is performed by interpolating along the approximated isophotes. The proposed method can accommodate arbitrary scaling factors, provides state-of-the-art results in terms of PSNR as well as other quantitative visual quality metrics, and has the advantage of reduced computational complexity that is directly proportional to the number of pixels.
We present a framework for image inpainting that utilizes the diffusion framework approach to spectral dimensionality reduction. We show that on formulating the inpainting problem in the embedding domain, the domain to be inpainted is smoother in general, particularly for the textured images. Thus, the textured images can be inpainted through simple exemplar-based and variational methods. We discuss the properties of the induced smoothness and relate it to the underlying assumptions used in contemporary inpainting schemes. As the diffusion embedding is nonlinear and noninvertible, we propose a novel computational approach to approximate the inverse mapping from the inpainted embedding space to the image domain. We formulate the mapping as a discrete optimization problem, solved through spectral relaxation. The effectiveness of the presented method is exemplified by inpainting real images, where it is shown to compare favorably with contemporary state-of-the-art schemes.
The restoration of images by digital inpainting is an active field of research and such algorithms are, in fact, now widely used. Conventional methods generally apply textures that are most similar to the areas around the missing region or use a large image database. However, this produces discontinuous textures and thus unsatisfactory results. Here, we propose a new technique to overcome this limitation by using signal prediction based on the nonharmonic analysis (NHA) technique proposed by the authors. NHA can be used to extract accurate spectra, irrespective of the window function, and its frequency resolution is less than that of the discrete Fourier transform. The proposed method sequentially generates new textures on the basis of the spectrum obtained by NHA. Missing regions from the spectrum are repaired using an improved cost function for 2D NHA. The proposed method is evaluated using the standard images Lena, Barbara, Airplane, Pepper, and Mandrill. The results show an improvement in MSE of about 10–20 compared with the examplar-based method and good subjective quality.
Linear discriminant analysis (LDA) is a well-known dimensionality reduction technique, which is widely used for many purposes. However, conventional LDA is sensitive to outliers because its objective function is based on the distance criterion using L2-norm. This paper proposes a simple but effective robust LDA version based on L1-norm maximization, which learns a set of local optimal projection vectors by maximizing the ratio of the L1-norm-based between-class dispersion and the L1-norm-based within-class dispersion. The proposed method is theoretically proved to be feasible and robust to outliers while overcoming the singular problem of the within-class scatter matrix for conventional LDA. Experiments on artificial datasets, standard classification datasets and three popular image databases demonstrate the efficacy of the proposed method.
A key problem in visual tracking is how to effectively combine spatio-temporal visual information from throughout a video to accurately estimate the state of an object. We address this problem by incorporating Dempster–Shafer (DS) information fusion into the tracking approach. To implement this fusion task, the entire image sequence is partitioned into spatially and temporally adjacent subsequences. A support vector machine (SVM) classifier is trained for object/nonobject classification on each of these subsequences, the outputs of which act as separate data sources. To combine the discriminative information from these classifiers, we further present a spatio-temporal weighted DS (STWDS) scheme. In addition, temporally adjacent sources are likely to share discriminative information on object/nonobject classification. To use such information, an adaptive SVM learning scheme is designed to transfer discriminative information across sources. Finally, the corresponding DS belief function of the STWDS scheme is embedded into a Bayesian tracking model. Experimental results on challenging videos demonstrate the effectiveness and robustness of the proposed tracking approach.
Registration of two high-dimensional data sets often involves dimensionality reduction to yield a single-band image from each data set followed by pairwise image registration. We develop a new application-specific algorithm for dimensionality reduction of high-dimensional data sets such that the weighted harmonic mean of Cramér-Rao lower bounds for the estimation of the transformation parameters for registration is minimized. The performance of the proposed dimensionality reduction algorithm is evaluated using three remotes sensing data sets. The experimental results using mutual information-based pairwise registration technique demonstrate that our proposed dimensionality reduction algorithm combines the original data sets to obtain the image pair with more texture, resulting in improved image registration.
We propose to use the similarity between the sample instance and a number of exemplars as features in visual object detection. Concepts from multiple-kernel learning and multiple-instance learning are incorporated into our scheme at the feature level by properly calculating the similarity. The similarity between two instances can be measured by various metrics and by using the information from various sources, which mimics the use of multiple kernels for kernel machines. Pooling of the similarity values from multiple instances of an object part is introduced to cope with alignment inaccuracy between object instances. To deal with the high dimensionality of the multiple-kernel multiple-instance similarity feature, we propose a forward feature-selection technique and a coarse-to-fine learning scheme to find a set of good exemplars, hence we can produce an efficient classifier while maintaining a good performance. Both the feature and the learning technique have interesting properties. We demonstrate the performance of our method using both synthetic data and real-world visual object detection data sets.
We present an efficient and noise robust template matching method based on asymmetric correlation (ASC). The ASC similarity function is invariant to affine illumination changes and robust to extreme noise. It correlates the given non-normalized template with a normalized version of each image window in the frequency domain. We show that this asymmetric normalization is more robust to noise than other cross correlation variants, such as the correlation coefficient. Direct computation of ASC is very slow, as a DFT needs to be calculated for each image window independently. To make the template matching efficient, we develop a much faster algorithm, which carries out a prediction step in linear time and then computes DFTs for only a few promising candidate windows. We extend the proposed template matching scheme to deal with partial occlusion and spatially varying light change. Experimental results demonstrate the robustness of the proposed ASC similarity measure compared to state-of-the-art template matching methods.
The alternating direction method of multipliers (ADMM) has recently sparked interest as a flexible and efficient optimization tool for inverse problems, namely, image deconvolution and reconstruction under non-smooth convex regularization. ADMM achieves state-of-the-art speed by adopting a divide and conquer strategy, wherein a hard problem is split into simpler, efficiently solvable sub-problems (e.g., using fast Fourier or wavelet transforms, or simple proximity operators). In deconvolution, one of these sub-problems involves a matrix inversion (i.e., solving a linear system), which can be done efficiently (in the discrete Fourier domain) if the observation operator is circulant, i.e., under periodic boundary conditions. This paper extends ADMM-based image deconvolution to the more realistic scenario of unknown boundary, where the observation operator is modeled as the composition of a convolution (with arbitrary boundary conditions) with a spatial mask that keeps only pixels that do not depend on the unknown boundary. The proposed approach also handles, at no extra cost, problems that combine the recovery of missing pixels (i.e., inpainting) with deconvolution. We show that the resulting algorithms inherit the convergence guarantees of ADMM and illustrate its performance on non-periodic deblurring (with and without inpainting of interior pixels) under total-variation and frame-based regularization.
Integration of Gibbs Markov Random Field and Hopfield-Type Neural Networks for Unsupervised Change Detection in Remotely Sensed Multitemporal Images
In this paper, a spatiocontextual unsupervised change detection technique for multitemporal, multispectral remote sensing images is proposed. The technique uses a Gibbs Markov random field (GMRF) to model the spatial regularity between the neighboring pixels of the multitemporal difference image. The difference image is generated by change vector analysis applied to images acquired on the same geographical area at different times. The change detection problem is solved using the maximum a posteriori probability (MAP) estimation principle. The MAP estimator of the GMRF used to model the difference image is exponential in nature, thus a modified Hopfield type neural network (HTNN) is exploited for estimating the MAP. In the considered Hopfield type network, a single neuron is assigned to each pixel of the difference image and is assumed to be connected only to its neighbors. Initial values of the neurons are set by histogram thresholding. An expectation–maximization algorithm is used to estimate the GMRF model parameters. Experiments are carried out on three-multispectral and multitemporal remote sensing images. Results of the proposed change detection scheme are compared with those of the manual-trial-and-error technique, automatic change detection scheme based on GMRF model and iterated conditional mode algorithm, a context sensitive change detection scheme based on HTNN, the GMRF model, and a graph-cut algorithm. A comparison points out that the proposed method provides more accurate change detection maps than other methods.
Manifold learning is widely used in machine learning and pattern recognition. However, manifold learning only considers the similarity of samples belonging to the same class and ignores the within-class variation of data, which will impair the generalization and stableness of the algorithms. For this purpose, we construct an adjacency graph to model the intraclass variation that characterizes the most important properties, such as diversity of patterns, and then incorporate the diversity into the discriminant objective function for linear dimensionality reduction. Finally, we introduce the orthogonal constraint for the basis vectors and propose an orthogonal algorithm called stable orthogonal local discriminate embedding. Experimental results on several standard image databases demonstrate the effectiveness of the proposed dimensionality reduction approach.
For images, gradient domain composition methods like Poisson blending offer practical solutions for uncertain object boundaries and differences in illumination conditions. However, adapting Poisson image blending to video presents new challenges due to the added temporal dimension. In video, the human eye is sensitive to small changes in blending boundaries across frames and slight differences in motions of the source patch and target video. We present a novel video blending approach that tackles these problems by merging the gradient of source and target videos and optimizing a consistent blending boundary based on a user-provided blending trimap for the source video. Our approach extends mean-value coordinates interpolation to support hybrid blending with a dynamic boundary while maintaining interactive performance. We also provide a user interface and source object positioning method that can efficiently deal with complex video sequences beyond the capabilities of alpha blending.
We develop new metrics for texture similarity that accounts for human visual perception and the stochastic nature of textures. The metrics rely entirely on local image statistics and allow substantial point-by-point deviations between textures that according to human judgment are essentially identical. The proposed metrics extend the ideas of structural similarity and are guided by research in texture analysis-synthesis. They are implemented using a steerable filter decomposition and incorporate a concise set of subband statistics, computed globally or in sliding windows. We conduct systematic tests to investigate metric performance in the context of “known-item search,” the retrieval of textures that are “identical” to the query texture. This eliminates the need for cumbersome subjective tests, thus enabling comparisons with human performance on a large database. Our experimental results indicate that the proposed metrics outperform peak signal-to-noise ratio (PSNR), structural similarity metric (SSIM) and its variations, as well as state-of-the-art texture classification metrics, using standard statistical measures.
The tracking and recognition of facial activities from images or videos have attracted great attention in computer vision field. Facial activities are characterized by three levels. First, in the bottom level, facial feature points around each facial component, i.e., eyebrow, mouth, etc., capture the detailed face shape information. Second, in the middle level, facial action units, defined in the facial action coding system, represent the contraction of a specific set of facial muscles, i.e., lid tightener, eyebrow raiser, etc. Finally, in the top level, six prototypical facial expressions represent the global facial muscle movement and are commonly used to describe the human emotion states. In contrast to the mainstream approaches, which usually only focus on one or two levels of facial activities, and track (or recognize) them separately, this paper introduces a unified probabilistic framework based on the dynamic Bayesian network to simultaneously and coherently represent the facial evolvement in different levels, their interactions and their observations. Advanced machine learning methods are introduced to learn the model based on both training data and subjective prior knowledge. Given the model and the measurements of facial motions, all three levels of facial activities are simultaneously recognized through a probabilistic inference. Extensive experiments are performed to illustrate the feasibility and effectiveness of the proposed model on all three level facial activities.
A Generalized Random Walk With Restart and its Application in Depth Up-Sampling and Interactive Segmentation
In this paper, the origin of random walk with restart (RWR) and its generalization are described. It is well known that the random walk (RW) and the anisotropic diffusion models share the same energy functional, i.e., the former provides a steady-state solution and the latter gives a flow solution. In contrast, the theoretical background of the RWR scheme is different from that of the diffusion-reaction equation, although the restarting term of the RWR plays a role similar to the reaction term of the diffusion-reaction equation. The behaviors of the two approaches with respect to outliers reveal that they possess different attributes in terms of data propagation. This observation leads to the derivation of a new energy functional, where both volumetric heat capacity and thermal conductivity are considered together, and provides a common framework that unifies both the RW and the RWR approaches, in addition to other regularization methods. The proposed framework allows the RWR to be generalized (GRWR) in semilocal and nonlocal forms. The experimental results demonstrate the superiority of GRWR over existing regularization approaches in terms of depth map up-sampling and interactive image segmentation.
Variational optical flow techniques allow the estimation of flow fields from spatio-temporal derivatives. They are based on minimizing a functional that contains a data term and a regularization term. Recently, numerous approaches have been presented for improving the accuracy of the estimated flow fields. Among them, tensor voting has been shown to be particularly effective in the preservation of flow discontinuities. This paper presents an adaptation of the data term by using anisotropic stick tensor voting in order to gain robustness against noise and outliers with significantly lower computational cost than (full) tensor voting. In addition, an anisotropic complementary smoothness term depending on directional information estimated through stick tensor voting is utilized in order to preserve discontinuity capabilities of the estimated flow fields. Finally, a weighted non-local term that depends on both the estimated directional information and the occlusion state of pixels is integrated during the optimization process in order to denoise the final flow field. The proposed approach yields state-of-the-art results on the Middlebury benchmark.
This paper presents a saliency-based video object extraction (VOE) framework. The proposed framework aims to automatically extract foreground objects of interest without any user interaction or the use of any training data (i.e., not limited to any particular type of object). To separate foreground and background regions within and across video frames, the proposed method utilizes visual and motion saliency information extracted from the input video. A conditional random field is applied to effectively combine the saliency induced features, which allows us to deal with unknown pose and scale variations of the foreground object (and its articulated parts). Based on the ability to preserve both spatial continuity and temporal consistency in the proposed VOE framework, experiments on a variety of videos verify that our method is able to produce quantitatively and qualitatively satisfactory VOE results.
We propose a compressive sensing algorithm that exploits geometric properties of images to recover images of high quality from few measurements. The image reconstruction is done by iterating the two following steps: 1) estimation of normal vectors of the image level curves, and 2) reconstruction of an image fitting the normal vectors, the compressed sensing measurements, and the sparsity constraint. The proposed technique can naturally extend to nonlocal operators and graphs to exploit the repetitive nature of textured images to recover fine detail structures. In both cases, the problem is reduced to a series of convex minimization problems that can be efficiently solved with a combination of variable splitting and augmented Lagrangian methods, leading to fast and easy-to-code algorithms. Extended experiments show a clear improvement over related state-of-the-art algorithms in the quality of the reconstructed images and the robustness of the proposed method to noise, different kind of images, and reduced measurements.