1 Introduction
Performance capture methods enable the reconstruction of the motion, the dynamic surface geometry, and the appearance of real world scenes from multiple video recordings, for example, the deforming geometry of body and apparel of an actor, or his facial expressions [5, 7, 2, 18]. Many methods to capture spacetime coherent surfaces reconstruct a coarsetomedium scale 4D model of the scene in a first step, e.g. by deforming a mesh or a rigged template such that it aligns with the images [5, 18]. Finer scale shape detail is then added in a second refinement step. In this second step, some methods align the surface to a combination of silhouette constraints and sparse image features [7]. But such approaches merely recover medium scale detail and may suffer from erroneous feature correspondences between images and shape. Photoconsistency constraints can also be used to compute smaller scale deformations via stereobased refinement [5, 14]. However, existing approaches that follow that path often resort to discrete sampling of local displacements, since phrasing dense stereo based refinement as a continuous optimization problem has been more challenging [9]. Some recent methods resort to shadingbased techniques to capture smallscale displacements, such as shapefromshading or photometric stereo [22, 21, 18]
. However, the methods either require controlled and calibrated lighting, or complex inverse estimation of lighting and appearance when they are applied under uncontrolled recording conditions.
In this paper, we contribute with a new effective solution to the refinement step using multiview photoconsistency constraints. As input, our method expects synchronized and calibrated multiple video of a scene and a reconstructed coarse mesh animation, as it can be obtained with previous methods from the literature. Background subtraction or image silhouettes are not required for refinement.
Our first contribution is a new shape representation that models the mesh surface with a dense collection of 3D Gaussian functions centered at each vertex and each having an associated color. A similar decomposition into 2D Gaussian functions is applied to each input video frame.
This scene representation enables our second contribution, namely the formulation of dense photoconsistencybased surface refinement as a global optimization problem in the position of each vertex on the surface. Unlike previous performance capture methods, we are able to phrase the modeltoimage photoconsistency energy that guides the deformation as a closed form expression, and we can compute its analytic derivatives. Our problem formulation has the additional advantage that it enables implicit handling of occlusions, as well as spatial and temporal coherence constraints, while preserving a smooth consistency energy function. We can effectively minimize this function in terms of dense local surface displacements with standard gradientbased solvers. In addition to these advantages, unlike many previous methods, our framework does not require a potentially errorprone sparse set of feature correspondences or discrete sampling and testing of surface displacements, and thus provides a new way of continuous optimization of the dense surface deformation.
We used our approach for reconstructing fullbody performances of human actors wearing loose clothing, and performing different motions. Initial coarse reconstructions of the scene were obtained with the approaches by Gall et al. [7] and Starck and Hilton [14]. Our results (Fig. 1 and Sect. 6) show that our approach is able to reconstruct more of the finescale detail that is present in the input video sequences, than the baseline methods, for instance the wrinkles in a skirt. We also demonstrate these improvements quantitatively.
2 Related Work
Markerless performance capture methods are able to reconstruct dense dynamic surface geometry of moving subjects from multiview video, for instance of people in loose clothing, possibly along with pose parameters of an underlying kinematic skeleton [16]. Most of them use data from dense multicamera systems and recorded under controlled studio environments. Some methods employ variants of shapefromsilhouette or active or passive stereo [23, 11, 14, 20, 17], which usually results in temporally incoherent reconstructions. Spacetime coherency is easier to achieve with modelbased approaches that deform a static shape template (obtained by a laser scan or imagebased reconstruction) such that it matches the subject, e.g. a person [4, 5, 18, 1, 7] or a person’s apparel [2]. Some of them jointly track a skeleton and the nonrigidly deforming surface [18, 1, 6]; also multiperson reconstruction has been demonstrated [10]. Other approaches use a generally deformable template without embedded skeleton to capture 4D models, e.g. an elastically deformable surface or volume [5, 13], or a patchbased surface representation [3]. Most of the approaches mentioned so far either only reconstruct coarse dynamic surface models that lack fine scale detail, or coarse reconstruction is a first stage. Fine scale detail is then added to the coarse result in a second refinement step.
Some methods use a combination of silhouette constraints and sparse feature correspondences to estimate, at best, a medium scale nonrigid 4D surface detail [7]. Other approaches use stereobased photoconsistency constraints in addition to silhouettes to achieve denser estimates of finer scale deformations [14, 5]. It is an involved problem to phrase dense stereobased surface refinement as a continuous optimization problem, as it is done in variational approaches [9]. Thus, stereobased refinement in performance capture often resorts to discrete surface displacement sampling which are less efficient, and with which globally smooth and coherent solutions are harder to achieve.
In this paper, we propose a new formulation of stereobased surface refinement as a continuous optimization problem, which is based on a new surface representation with Gaussian functions. In addition, our refinement method also succeeds if silhouettes are not available, making the approach more generally applicable.
An alternative way to recover finescale deforming surface detail is to use shadingbased methods, e.g. shapefromshading or photometric stereo [21]. Many of these approaches require controlled and calibrated lighting [8, 19], which reduces their applicability. More recently, shadingbased refinement of dynamic scenes captured under more general lighting was shown [22], but these approaches are computationally challenging as they require to solve an inverse rendering problem to obtain estimates of illumination, appearance and shape at the same time.
The method we propose has some similarity to the work of Sand et al. [12] who capture skin deformation as a displacement field on a template mesh; however, they require markerbased skeleton capture, and only fit the surface to match the silhouettes in multiview video. Our problem formulation is inspired by the work of Stoll et al. [15] who used a collection of Gaussian functions in 3D and 2D for markerless skeletal pose estimation. Estimation of surface detail was not the goal of that work. Our paper extends their basic concept to the different problem of dense stereobased surface estimation using continuous optimization of a smooth energy that can be formulated in closed form, and that has analytic derivatives.
3 Overview
An overview of our approach is shown in Fig. 2. The input to our algorithm is a calibrated and synchronized multiview video sequence showing images of the human subject. In addition, we assume as input a spatiotemporally coherent coarse animated mesh sequence, reconstructed from multiview video related approaches [7, 14].
Our method refines the initial coarse animation such that the fine dynamic surface details are incorporated to the meshes. First, we create an implicit representation of the input mesh using a dense collection of 3D Gaussian functions on the surface with associated colors. The input images are also represented as a set of 2D Gaussian associated to image patches in each camera view. Thereafter, continuous optimization is performed to maximize the color consistency between the collection of 3D surface Gaussians and the set of 2D image Gaussians. The optimization displaces the 3D Gaussians along the associated vertex normal of the coarse mesh which yields the necessary vertex displacement.
Our optimization scheme has a smooth energy function, that, thanks to our Gaussiansbased model, can be expressed in closed form. It further allows us to analytically compute derivatives, enabling the possibility of using efficient gradientbased solvers.
4 Implicit Model
Our framework converts the input coarse animation and input multiview images into implicit representations using a collection of Gaussians: 3D surface Gaussians on the mesh surface with associated colors and 2D image Gaussians, with associated colors, assigned to image patches in each camera view.
4.1 3D Surface Gaussian
Our implicit model for the input mesh is obtained by placing a 3D Gaussian at each mesh vertex , , being the number of vertices. A 3D unnormalized isotropic Gaussian function on the surface is defined simply with a mean
, that coincides with the vertex location, and a standard deviation
(equally set to for all 3D Gaussians on surface) as follows:(1) 
with . Note that although has infinite support, for visualization purposes we represent its projection as a square having center (i.e. diagonals intersection) in and side length equal to (see Fig. 3).
We further assign a HSV color value to each surface Gaussian. In order to derive the colors we choose a reference frame where the initial coarse reconstruction is as close as possible to the real shape. This is typically the first frame in each sequence. For each vertex of the input mesh, we first choose the camera view that sees vertex best, i.e. where normal and camera viewing direction align best. Thereafter, the 3D Gaussian associated to is projected to the image from the best camera view and the underlying pixel color average is assigned as a color attribute.
4.2 2D Image Gaussian
Our implicit model for the input images of all cameras , being the number of cameras, is obtained by assigning 2D Gaussian functions , , to each image patch, , of all camera views. Similar to Stoll et al. [15] we decompose each input frame into squared regions of coherent color by means of quadtree decomposition (with maximal depth set to 8). A 2D Gaussian is assigned to each patch (Fig. 4), such that its mean corresponds to the patch center, and its standard deviation to half of the square patch side length. The underlying average HSV color is also assigned to the 2D Gaussians as additional attribute.
4.3 Projection of 3D Surface Gaussians
In order to evaluate the similarity between the 3D surface Gaussians and the 2D image Gaussians , we project each to the 2D image space. The 3D surface Gaussian mean is projected using the camera projection matrix , similarly to any 3D point in the space, as follows:
(2) 
with being the respective coordinates of the projected mean in homogeneous coordinates (i.e. the dimension is set to ). The 3D standard deviation is projected using the following formula:
(3) 
where is the camera focal length.
5 Surface Refinement
We employ an analysisbysynthesis approach to refine the input coarse mesh animation, at every frame, by optimizing the following energy with respect to the collection of 3D surface Gaussian means :
(4) 
The term measures the color similarity of the projected collection of 3D surface Gaussians with the 2D image Gaussians obtained from each camera view. The additional term is used to keep the distribution of the 3D surface Gaussians geometrically smooth, whereas is an user defined smoothness weight, typically set to 1. Since we constrain the 3D Gaussians to move along the corresponding vertex (normalized) normal direction :
(5) 
aiming at maintaining a regular distribution of 3D Gaussians on the surface, we only need to optimize for single scalar values .
5.1 Similarity Term
We exploit the power of the implicit Gaussian representation of both input images and surface in order to derive a closedform analytical formulation for our similarity term. In principle, one pair of image Gaussian and projected surface Gaussian should have high similarity measures when they show similar properties in terms of color and their spacial localization is sufficiently close. This measure can be formulated as the integral of the product of the projected surface Gaussian and image Gaussian , weighted by their color similarity , as follows:
(6) 
In the above equation measures the Euclidean distance between the colors, while
is the Wendland radial basis function modeled by:
(7) 
where is esperimentally set to for all test sequences. The main advantage of using a Gaussian representation is that the integral in Eq. 6 has a closedform solution, namely another Gaussian with combined properties:
(8) 
We first calculate the similarity for all components of the two models for each camera view. Then, we normalize the result considering the maximum obtainable overlap , of an image Gaussian with itself, and the number of cameras as follows:
(9) 
In this equation, the inner minimization implicitly handles occlusions on the surface as it prevents occluded Gaussians projections into the same image location to contribute multiple times to the energy. This is an elegant way for handling occlusion while preserving at the same time energy smoothness. In fact, exact occlusion detection and handling algorithms are nonsmooth or hard to express in closedform.
In order to improve computational efficiency, we evaluate only for visible surface Gaussians from each camera view. The Gaussian overlap is then computed against visible projected Gaussians and 2D image Gaussians in a local neighborhood.
5.2 Regularization Term
Our regularization term constraints the 3D surface Gaussians in the local neighborhood and each Gaussian such that the final reconstructed surface is sufficiently smooth. This is accomplished by minimizing the following equation:
(10) 
where is a set of surface Gaussian indices that are neighbors of , is the geodesic surface distance between and measured in number of edges, and is defined in Eq. 7, where .
5.3 Optimization
Our formulation allows us to compute analytic derivatives of the energy (Eq. 4), for which we provide complete derivation in an additional document. The derivative of the similarity term, with respect to each is:
(11) 
The derivative of the overlap is defined as:
(12) 
where is the projection matrix of camera , is the vertex normal associated to the model gaussian in homogeneous coordinates (i.e. the dimension is set to ), is the zcomponent of the projected mean, and
(13) 
The derivative of the regularization term is given by:
(14) 
We efficiently optimize our energy function using a conditioned gradient ascent approach. The general gradient ascent method is a firstorder optimization procedure that aims at finding local maxima by taking steps proportional to the energy gradient. The conditioner is a scalar factor associated to the analytical derivatives that increases (resp. decreases) stepbystep when the gradient sign is constant (resp. fluctuating). The use of the conditioner brings three main advantages: it allows for faster convergence to the final solution, it prevents typical zigzaging while approaching local maxima, and it constraints at the same time the analytical derivative size.
6 Results
We tested our approach on three different datasets: , and . Input multiview video sequences, as well as camera settings and initial coarse mesh reconstruction were provided by Gall et al. [7] and Starck and Hilton [14]. All the sequences are recorded with 8 synchronized and calibrated cameras and number of frame ranging between 250 and 721 (see Table 1). The input provided coarse mesh are obtained utilizing lowquality refining technique based on sparse feature matching, shapefromsilhouette and multiview 3D reconstruction, and therefore lack of surface details.
Sequence  Frames  Iter/s  Frame/min  

721  3053  2.01  0.8  
573  3430  1.90  0.76  
250  3880  1.67  0.66 
In order to refine the input mesh sequences, we first subdivide the input coarse topology, by inserting additional triangles and vertices, aiming at increasing the scale level of detail. Then we generate a collection of Gaussians on the surface as explained in Sect. 3. Since for the input sequences most of the finescale deformations happen on the clothing, we decided to focus on the refinement of those areas, generating surface Gaussians only for the correspondent vertices. Table 1 shows the amount of 3D surface Gaussians created for each sequence.
When rendering the final resulting mesh sequences, we added an extra epsilon to the computed vertex displacements equal to the standard deviation of the surface Gaussians used. This is needed in order to compensate for the small surface bias (shrink along the normal during optimization) that is due to the spatial extent of the Gaussians.
Evaluation. Our results (Fig. 1, Fig. 5 and the accompanying video) show that our approach is able to plausibly reconstruct more finescale details, e.g. the wrinkles and folds in the skirt, and produces closer model alignment to the images than the baseline methods ([7, 14]).
In order to verify the quantitative performance of our approach, we textured the model by assigning surface Gaussians colors to the correspondent mesh vertices. Then, we used optical flow to generate displacement flow vectors between the input images and the reprojected textured mesh models (original and refined) for all time steps. Fig.
6 plots the average optical flow displacement error difference between the input and the resulting animation sequences over time for a single camera view. As shown in the graphs, our method decreases the average flow displacement error, leading to quantitatively more accurate results.We created an additional experiment to verify the performance of our refinement framework. For this experiment, we first spatiallysmooth the input mesh sequence aiming at eliminating most of the bakedin surface details, if any. The smooth mesh animation is then used as input to our system. As we show in Fig. 7 and in the accompanying video, our approach is able to plausibly refine the input smooth mesh animation, reconstructing finescale details in the skirt, tshirt and shorts. Quantitative evaluation for the smooth input sequence is provided in an additional document.
We evaluated the performance of our system on an Intel Xeon Processor E51620, Quadcore with Hyperthreading and 16GB of RAM. Table 1 summarizes the performances we obtained for the three tested sequences. We believe we can further reduce the computation time by parallelizing orthogonal steps and implementing our method on GPU.
Limitations. Our approach is subject to a few limitations. We assume the input mesh sequence to be sufficiently accurate, such that smaller details can be easily and correctly captured by simply displacing vertices along their correspondent vertex normals. In cases where the input reconstructed meshes present misalignments with respect to the images (e.g. ) or if it is necessary to reconstruct stronger deformations, then our method is unable to perform adequately. In this respect, our refinement should be reformulated allowing more complex displacements, e.g. without any normal constraint. However such weaker prior on vertices motion requires more complex regularization formulation in order to maintain smooth surface, also to handle unwanted selfintersections and collapsing vertices. On top of that the increased number of parameters to optimize for (i.e. 3 times more, when optimizing for all 3 vertices dimensions, , and
) would spoil computational efficiency and raise the probability of getting stack in local maxima solutions. The risk of returning local maxima solutions is still high when employing local solvers (
e.g. gradient ascent) on nonconvex problems as in our case. A possible solution is to use more advanced solvers, e.g. global solvers, when computational efficiency is not a requirement.Another limitation of our approach is the inability to densely refine plain colored surfaces with few texture (e.g. and ). A solution here is to employ a more complex color model that takes into account e.g. illumination and shading effects, at the cost of increased computational expenses. We would like to investigate these limitations as a future work.
7 Conclusions
We presented a new effective framework for performance capture of deforming meshes with finescale timevarying surface detail from multiview video recordings. Our approach captures the finescale deformation of the mesh vertices by maximizing photoconsistency on all vertex positions. This can be done efficiently by densely optimizing a new modeltoimage consistency energy function that uses our proposed implicit representation of the deformable mesh using a collection of 3D Gaussians for the surface and a set of 2D Gaussians for the input images. Our proposed formulation enables a smooth closedform energy with implicit occlusion handling and analytic derivatives. We qualitatively and quantitatively evaluated our refinement strategy on 3 input sequences, showing that we are able to capture and model finerscale details.
References
 [1] L. Ballan and G. M. Cortelazzo. Markerless motion capture of skinned models in a four camera setup using optical flow and silhouettes. In 3DPVT, June 2008.
 [2] D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur. Markerless garment capture. ACM Trans. Graph., 27(3):1–9, 2008.
 [3] C. Cagniart, E. Boyer, and S. Ilic. Freeform mesh tracking: a patchbased approach. In Proc. IEEE CVPR, 2010.
 [4] J. Carranza, C. Theobalt, M. Magnor, and H.P. Seidel. Freeviewpoint video of human actors. In ACM TOG (Proc. SIGGRAPH ’03), page 22, 2003.
 [5] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.P. Seidel, and S. Thrun. Performance capture from sparse multiview video. ACM Trans. Graph., 27(3), 2008.
 [6] J. Gall, C. Stoll, E. Aguiar, C. Theobalt, B. Rosenhahn, and H.P. Seidel. Motion capture using joint skeleton tracking and surface estimation. In Proc. IEEE CVPR, pages 1746–1753, 2009.

[7]
J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel.
Motion capture using joint skeleton tracking and surface estimation.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2009.  [8] C. Hernandez, G. Vogiatzis, G. J. Brostow, B. Stenger, and R. Cipolla. Nonrigid photometric stereo with colored lights. In Proc. ICCV, pages 1–8, 2007.
 [9] K. Kolev, M. Klodt, T. Brox, and D. Cremers. Continuous global optimization in multiview 3d reconstruction. International Journal of Computer Vision, 84(1):80–96, 2009.
 [10] Y. Liu, J. Gall, C. Stoll, Q. Dai, H.P. Seidel, and C. Theobalt. Markerless motion capture of multiple characters using multiview image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2720–2735, 2013.
 [11] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMillan. Imagebased visual hulls. In SIGGRAPH, pages 369–374, 2000.
 [12] P. Sand, L. McMillan, and J. Popović. Continuous capture of skin deformation. ACM TOG, 22(3):578–586, July 2003.
 [13] Y. Savoye. Iterative cagebased registration from multiview silhouettes. In Proceedings of the 10th European Conference on Visual Media Production, CVMP ’13, pages 8:1–8:10. ACM, 2013.
 [14] J. Starck and A. Hilton. Surface capture for performancebased animation. IEEE Computer Graphics and Applications, 27(3):21–31, 2007.
 [15] C. Stoll, N. Hasler, J. Gall, H.P. Seidel, and C. Theobalt. Fast articulated motion tracking using a sums of gaussians body model. In D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. V. Gool, editors, ICCV, pages 951–958. IEEE, 2011.
 [16] C. Theobalt, E. de Aguiar, C. Stoll, H.P. Seidel, and S. Thrun. Performance capture from multiview video. In R. Ronfard and G. Taubin, editors, Image and Geometry Procesing for 3DCinematography, page 127ff. Springer, 2010.
 [17] T. Tung, S. Nobuhara, and T. Matsuyama. Complete multiview reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In Proc. IEEE ICCV, pages 1709 –1716, 2009.
 [18] D. Vlasic, I. Baran, W. Matusik, and J. Popović. Articulated mesh animation from multiview silhouettes. ACM TOG (Proc. SIGGRAPH), 2008.
 [19] D. Vlasic, P. Peers, I. Baran, P. Debevec, J. Popović, S. Rusinkiewicz, and W. Matusik. Dynamic shape capture using multiview photometric stereo. In ACM TOG (Proc. SIGGRAPH Asia ’09), 2009.
 [20] M. Waschbüsch, S. Würmlin, D. Cotting, F. Sadlo, and M. Gross. Scalable 3D video of dynamic scenes. In Proc. Pacific Graphics, pages 629–638, 2005.
 [21] C. Wu, Y. Liu, Q. Dai, and B. Wilburn. Fusing multiview and photometric stereo for 3d reconstruction under uncalibrated illumination. IEEE TVCG, 17(8):1082–1095, 2011.
 [22] C. Wu, K. Varanasi, Y. Liu, H.P. Seidel, and C. Theobalt. Shadingbased dynamic shape refinement from multiview video under general illumination. In Proc. iCCV, ICCV ’11, pages 1108–1115. IEEE, 2011.

[23]
C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski.
Highquality video view interpolation using a layered representation.
ACM Trans. Graph., 23(3):600–608, Aug. 2004.
Comments
There are no comments yet.