Publications





    Publications

  • Learning Deep Parsimonious Representations

    Renjie Liao, Alexander Schwing, Richard Zemel, Raquel Urtasun
    In Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016

    @inproceedings{LiaoNIPS16,
    title = {Learning Deep Parsimonious Representations},
    author = {Renjie Liao and Alexander Schwing and Richard Zemel and Raquel Urtasun},
    booktitle = {Neural Information Processing Systems},
    year = {2016},
    month = {December}
    }

    In this paper we aim at facilitating generalization for deep networks while supporting interpretability of the learned representations. Towards this goal, we propose a clustering based regularization that encourages parsimonious representations. Our k-means style objective is easy to optimize and flexible, supporting various forms of clustering, such as sample clustering, spatial clustering, as well as co-clustering. We demonstrate the effectiveness of our approach on the tasks of unsupervised learning, classification, fine grained categorization, and zero-shot learning.

  • Proximal Deep Structured Models

    Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016

    @inproceedings{ShenlongNIPS16,
    title = {Proximal Deep Structured Models},
    author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {Neural Information Processing Systems},
    year = {2016},
    month = {December}
    }

    Many problems in real-world applications involve predicting continuous-valued random variables that are statistically related. In this paper, we propose a powerful deep structured model that is able to learn complex non-linear functions which encode the dependencies between continuous output variables. We show that inference in our model using proximal methods can be efficiently solved as a feed-forward pass of a special type of deep recurrent neural network. We demonstrate the effectiveness of our approach in the tasks of image denoising, depth refinement and optical flow estimation.

  • Understanding the Effective Receptive Field in Deep Convolutional Neural Networks

    Wenjie Luo, Yujia Li, Raquel Urtasun, Richard Zemel
    In Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016

    @inproceedings{LuoNIPS16,
    title = {Understanding the Effective Receptive Field in Deep Convolutional Neural Networks},
    author = {Wenjie Luo and Yujia Li and Raquel Urtasun and Richard Zemel},
    booktitle = {Neural Information Processing Systems},
    year = {2016},
    month = {December}
    }

    We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. We introduce the notion of an effective receptive field, and show that it both has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. We analyze the effective receptive field in several architecture designs, and the effect of nonlinear activations, dropout, sub-sampling and skip connections on it. This leads to suggestions for ways to address its tendency to be too small.

  • Exploiting Semantic Information and Deep Matching for Optical Flow

    Min Bai, Wenjie Luo, Kaustav Kundu, Raquel Urtasun
    In European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, October 2016

    @inproceedings{BaiECCV16,
    title = {Exploiting Semantic Information and Deep Matching for Optical Flow},
    author = {Min Bai and Wenjie Luo and Kaustav Kundu and Raquel Urtasun},
    booktitle = {European Conference on Computer Vision},
    year = {2016},
    month = {October}
    }

    We tackle the problem of estimating optical flow from a monocular camera in the context of autonomous driving. We build on the observation that the scene is typically composed of a static background, as well as a relatively small number of traffic participants which move rigidly in 3D. We propose to estimate the traffic participants using instance-level segmentation. For each traffic participant, we use the epipolar constraints that govern each independent motion for faster and more accurate estimation. Our second contribution is a new convolutional net that learns to perform flow matching, and is able to estimate the uncertainty of its matches. This is a core element of our flow estimation pipeline. We demonstrate the effectiveness of our approach in the challenging KITTI 2015 flow benchmark, and show that our approach outperforms published approaches by a large margin.

  • HouseCraft: Building Houses from Rental Ads and Street Views

    Hang Chu, Shenlong Wang, Raquel Urtasun, Sanja Fidler
    In European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, October 2016

    @inproceedings{ChuECCV16,
    title = {HouseCraft: Building Houses from Rental Ads and Street Views},
    author = {Hang Chu and Shenlong Wang and Raquel Urtasun and Sanja Fidler},
    booktitle = {European Conference on Computer Vision},
    year = {2016},
    month = {October}
    }

    In this paper, we utilize rental ads to create realistic textured 3D models of building exteriors. In particular, we exploit the address of the property and its floorplan, which are typically available in the ad. The address allows us to extract Google StreetView images around the building, while the building’s floorplan allows for an efficient parametrization of the building in 3D via a small set of random variables. We propose an energy minimization framework which jointly reasons about the height of each floor, the vertical positions of windows and doors, as well as the precise location of the building in the world’s map, by exploiting several geometric and semantic cues from the StreetView imagery. To demonstrate the effectiveness of our approach, we collected a new dataset with 174 houses by crawling a popular rental website. Our experiments show that our approach is able to precisely estimate the geometry and location of the property, and can create realistic 3D building models.

  • Efficient Deep Learning for Stereo Matching

    Wenjie Luo, Alexander Schwing, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016
    Spotlight presentation

    @article{LuoCVPR16,
    title = {Efficient Deep Learning for Stereo Matching},
    author = {Wenjie Luo and Alexander Schwing and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    In the past year, convolutional neural networks have been shown to perform extremely well for stereo estimation. However, current architectures rely on Siamese networks which exploit concatenation followed by further processing layers, requiring a minute of GPU computation per image pair. In contrast, in this paper we propose a matching network which is able to produce very accurate results in less than a second of GPU computation. Towards this goal, we exploit a product layer which simply computes the inner product between the two representations of a Siamese architecture. We train our network by treating the problem as multi-class classification, where the classes are all possible disparities. This allows us to get calibrated scores, which result in much better matching performance when compared to existing approaches.

  • HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images

    Gellert Mattyus, Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016

    @article{MattyusCVPR16,
    title = {HD Maps: Fine-grained Road Segmentation by Parsing Ground and Aerial Images},
    author = {Gellert Mattyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    In this paper we present an approach to enhance existing maps with fine grained segmentation categories such as parking spots and sidewalk, as well as the number and location of road lanes. Towards this goal, we propose an ef- ficient approach that is able to estimate these fine grained categories by doing joint inference over both, monocular aerial imagery, as well as ground images taken from a stereo camera pair mounted on top of a car. Important to this is reasoning about the alignment between the two types of imagery, as even when the measurements are taken with sophisticated GPS+IMU systems, this alignment is not sufficiently accurate. We demonstrate the effectiveness of our approach on a new dataset which enhances KITTI with aerial images taken with a camera mounted on an airplane and flying around the city of Karlsruhe, Germany.

  • Instance-Level Segmentation with Deep Densely Connected MRFs

    Ziyu Zhang, Sanja Fidler, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016

    @article{ZhangCVPR16,
    title = {Instance-Level Segmentation with Deep Densely Connected MRFs},
    author = {Ziyu Zhang and Sanja Fidler and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    Our aim is to provide a pixel-level object instance labeling of a monocular image. We build on recent work [Zhang et al., ICCV15] that trained a convolutional neural net to predict instance labeling in local image patches, extracted exhaustively in a stride from an image. A simple Markov random field model using several heuristics was then proposed in [Zhang et al., ICCV15] to derive a globally consistent instance labeling of the image. In this paper, we formulate the global labeling problem with a novel densely connected Markov random field and show how to encode various intuitive potentials in a way that is amenable to efficient mean field inference [Kr�enb�l et al., NIPS11]. Our potentials encode the compatibility between the global labeling and the patch-level predictions, contrast-sensitive smoothness as well as the fact that separate regions form different instances. Our experiments on the challenging KITTI benchmark [Geiger et al., CVPR12] demonstrate that our method achieves a significant performance boost over the baseline [Zhang et al., ICCV15].

  • Learning a Global Patch Collider for Correspondences Estimation

    Shenlong Wang, Sean Fanello, Christoph Rhemann, Shahram Izadi, Pushmeet Kohli
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016
    Oral presentation

    @article{WangCVPR16,
    title = {Learning a Global Patch Collider for Correspondences Estimation},
    author = {Shenlong Wang and Sean Fanello and Christoph Rhemann and Shahram Izadi and Pushmeet Kohli},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    Coming Soon

  • Learning Aligned Cross-Modal Representations from Weakly Aligned Data

    Lluís Castrejón, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016

    @article{CastrejonCVPR16,
    title = {Learning Aligned Cross-Modal Representations from Weakly Aligned Data},
    author = {Lluís Castrejón and Yusuf Aytar and Carl Vondrick and Hamed Pirsiavash and Antonio Torralba},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    People can recognize scenes across many different
    modalities beyond natural images. In this paper, we investigate
    how to learn cross-modal scene representations
    that transfer across modalities. To study this problem, we
    introduce a new cross-modal scene dataset. While convolutional
    neural networks can categorize cross-modal scenes
    well, they also learn an intermediate representation not
    aligned across modalities, which is undesirable for crossmodal
    transfer applications. We present methods to regularize
    cross-modal convolutional neural networks so that
    they have a shared representation that is agnostic of the
    modality. Our experiments suggest that our scene representation
    can help transfer representations across modalities
    for retrieval. Moreover, our visualizations suggest that
    units emerge in the shared representation that tend to activate
    on consistent concepts independently of the modality.

  • Monocular 3D Object Detection for Autonomous Driving

    Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016

    @article{ChenCVPR16,
    title = {Monocular 3D Object Detection for Autonomous Driving},
    author = {Xiaozhi Chen and Kaustav Kundu and Ziyu Zhang and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    The goal of this paper is to perform 3D object detec- tion from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain high quality object detections. The focus of this paper is on proposal generation. In particular, we propose an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane. We then score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation demonstrates that our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.

  • MovieQA: Understanding Stories in Movies through Question-Answering

    Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Raquel Urtasun, Sanja Fidler
    In Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, June 2016
    Spotlight presentation

    @article{TapaswiCVPR16,
    title = {MovieQA: Understanding Stories in Movies through Question-Answering},
    author = {Makarand Tapaswi and Yukun Zhu and Rainer Stiefelhagen and Raquel Urtasun and Sanja Fidler},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2016},
    month = {June}
    }

    We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 7702 questions about 294 movies with high semantic diversity. The questions range from simpler “Who” did “What” to “Whom”, to “Why” and “How” certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information — full-length movies, plots, subtitles, scripts and for a subset DVS. We analyze our data through various statistics and intelligent baselines. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We plan to create a benchmark with an active leader board, to encourage inspiring work in this challenging domain.

  • Training Deep Neural Networks via Direct Loss Minimization

    Yang Song, Alexander Schwing, Richard Zemel, Raquel Urtasun
    In International Conference on Machine Learning (ICML), New York City, USA, June 2016
    Oral presentation

    @inproceedings{SongICML2016,
    author = {Yang Song and Alexander Schwing and Richard Zemel and Raquel Urtasun},
    title = {Training Deep Neural Networks via Direct Loss Minimization},
    booktitle = {International Conference on Machine Learning},
    year = {2016},
    month = {June}
    }

    Supervised training of deep neural nets typically relies on minimizing cross-entropy. However, in many domains, we are interested in performing well on metrics specific to the application. In this paper we propose a direct loss minimization approach to train deep neural networks, which provably minimizes the application-specific loss function. This is often non-trivial, since these functions are neither smooth nor decomposable and thus are not amenable to optimization with standard gradient-based methods. We demonstrate the effectiveness of our approach in the context of maximizing average precision for ranking problems. Towards this goal, we develop a novel dynamic programming algorithm that can efficiently compute the weight updates. Our approach proves superior to a variety of baselines in the context of action classification and object detection, especially in the presence of label noise.

  • Adversarial Manipulation of Deep Representations

    Sara Sabour, Yanshuai Cao, Fartash Faghri, David Fleet
    In International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016

    @inproceedings{SabourICLR16,
    title = {Adversarial Manipulation of Deep Representations},
    author = {Sara Sabour and Yanshuai Cao and Fartash Faghri and David Fleet},
    booktitle = {International Conference on Learning Representations},
    year = {2016},
    month = {May}
    }

    We show that the representation of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels, while we concentrate on the internal layers of DNN representations. In this way our new class of adversarial images differs qualitatively from others. While the adversary is perceptually similar to one image, its internal representation appears remarkably similar to a different image, one from a different class, bearing little if any apparent similarity to the input; they appear generic and consistent with the space of natural images. This phenomenon raises questions about DNN representations, as well as the properties of natural images themselves.

  • Order-Embeddings of Images and Language

    Ivan Vendov, Ryan Kiros, Sanja Fidler, Raquel Urtasun
    In International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016
    Oral presentation

    @article{VendrovICLR16,
    title = {Order-Embeddings of Images and Language},
    author = {Ivan Vendov and Ryan Kiros and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Learning Representations},
    year = {2016},
    month = {May}
    }

    Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

  • Sequential Inference for Deep Gaussian Process

    Yali Wang, Marcus Brubaker, Raquel Urtasun
    In International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain, May 2016

    @article{WangAISTATS16,
    title = {Sequential Inference for Deep Gaussian Process},
    author = {Yali Wang and Marcus Brubaker and Raquel Urtasun},
    booktitle = {International Conference on Artificial Intelligence and Statistics},
    year = {2016},
    month = {May}
    }

    Coming soon

  • Blending Learning and Inference in Structured Prediction

    Tamir Hazan, Alexander Schwing, Raquel Urtasun
    In Journal of Machine Learning Research (JMLR), 2016

    @article{HazanJMLR16,
    title = {Blending Learning and Inference in Structured Prediction},
    author = {Tamir Hazan and Alexander Schwing and Raquel Urtasun},
    journal = {Journal of Machine Learning Research},
    year = {2016}
    }

    In this paper we derive an efficient algorithm to learn the parameters of structured predictors in general graphical models. This algorithm blends the learning and inference tasks, which results in a significant speedup over traditional approaches, such as conditional random fields and structured support vector machines. For this purpose we utilize the structures of the predictors to describe a low dimensional structured prediction task which encourages local consistencies within the different structures while learning the parameters of the model. Convexity of the learning task provides the means to enforce the consistencies between the different parts. The inference-learning blending algorithm that we propose is guaranteed to converge to the optimum of the low dimensional primal and dual programs. Unlike many of the existing approaches, the inference-learning blending allows us to learn efficiently high-order graphical models, over regions of any size, and very large number of parameters. We demonstrate the effectiveness of our approach, while presenting state-of-the-art results in stereo estimation, semantic segmentation, shape reconstruction, and indoor scene understanding.

  • Disentangling the Roles of Junctions and Spatial Relations Between Contours for Scene Categorization

    John D. Wilder, Sven Dickinson, Allan Jepson, Dirk B. Walther
    2016

    @article{wilder2016disentangling,
    title={Disentangling the Roles of Junctions and Spatial Relations Between Contours for Scene Categorization},
    author={John D. Wilder and Sven Dickinson and Allan Jepson and Dirk B. Walther},
    year={2016}
    }

    Humans can rapidly and accurately determine the class of a natural scene. With severely limited exposure time (~50ms stimulus duration) it is unlikely that the visual system recognizes the individual objects. Instead, efficiently obtained summary statistics might be used to classify the scene. Walther et al (2011) found that observers can rapidly classify line-drawings of natural scenes. Walther and Shen (2014) showed that line-drawing scenes with randomly translated contours result in different confusions than with intact line drawings, suggesting that relationships between lines (e.g. junctions) are important for scene classification. In the current study, we showed subjects intact line-drawings of natural scenes or manipulated line-drawings from the same database. Manipulated scenes either contained a portion of the line segments at junctions, or they contained only line segments between (not including) the junctions. The total amount of line content was equal in both manipulations. Subjects made similar confusions in all conditions (intact, junction, non-junction). This suggests that junctions (and possibly their relationships) can be used to perform scene classification, in the absence of their connecting lines. Subject performance (percent correct) was better in the non-junction images than in the junction images, suggesting that non-junction relationships between lines (e.g. parallelism) are powerful cues to scene category, and this information can be rapidly extracted for use in scene classification. It is unclear if the observers extrapolate line segments in order to infer junctions (in non-junction images), or interpolate a line between junctions (in junction images), which should be controlled in future experiments.

  • Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015
    Oral presentation

    @inproceedings{ZhuICCV15,
    title = {Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books},
    author = {Yukun Zhu and Ryan Kiros and Richard Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

  • Lost Shopping! Monocular Localization in Large Indoor Spaces

    Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015
    Oral presentation

    @inproceedings{WangICCV15,
    title = {Lost Shopping! Monocular Localization in Large Indoor Spaces},
    author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    In this paper we propose a novel approach to localization in very large indoor spaces (i.e., 200+ store shopping malls) that takes a single image and a floor plan of the environment as input. We formulate the localization problem as inference in a Markov random field, which jointly reasons about text detection (localizing shop’s names in the image with precise bounding boxes), shop facade segmentation, as well as camera’s rotation and translation within the entire shopping mall. The power of our approach is that it does not use any prior information about appearance and instead exploits text detections corresponding to the shop names. This makes our method applicable to a variety of domains and robust to store appearance variation across countries, seasons, and illumination conditions. We demonstrate the performance of our approach in a new dataset we collected of two very large shopping malls, and show the power of holistic reasoning.

  • A Learning Framework for Generating Region Proposals with Mid-level Cues

    Tom Lee, Sanja Fidler, Sven Dickinson
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{TLeeICCV15,
    title = {A Learning Framework for Generating Region Proposals with Mid-level Cues},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    The object categorization community’s migration from object detection to large-scale object categorization has seen a shift from sliding window approaches to bottom-up region segmentation, with the resulting region proposals offering discriminating shape and appearance features through an attempt to explicitly segment the objects in a scene from their background. One powerful class of region proposal techniques is based on parametric energy minimization (PEM) via parametric maxflow. In this paper, we incorporate PEM into a novel structured learning framework that learns how to combine a set of mid-level grouping cues to yield a small set of region proposals with high recall. Second, we diversify our region proposals and rank them with region-based convolutional neural network features. Our novel approach, called parametric min-loss, casts perceptual grouping and cue combination in a learning framework which yields encouraging results on VOC’2012.

  • Enhancing World Maps by Parsing Aerial Images

    Gellert Matthyus, Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{MatthyusICCV15,
    title = {Enhancing World Maps by Parsing Aerial Images},
    author = {Gellert Matthyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    In recent years, contextual models that exploit maps have been shown to be very effective for many recognition and localization tasks. In this paper, we propose to exploit aerial images in order to enhance freely available world maps. Towards this goal, we make use of OpenStreetMap and formulate the problem as the one of inference in a Markov random field parameterized in terms of the location of the road-segment centerlines as well as their width. This parameterization enables very efficient inference and returns only topologically correct roads. In particular, we can segment all OSM roads in the world in a single day using a small cluster of 10 computers. Importantly, our approach generalizes very well; it can be trained using a single aerial image and produces very accurate results in any location across the globe. We demonstrate the effectiveness of our approach over the previous state-of-the-art on two new benchmarks that we collect. We additionally show how our enhanced maps can be exploited for semantic segmentation of ground images.

  • FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation

    Philip Lenz, Andreas Geiger, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{LenzICCV15,
    title = {FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation},
    author = {Philip Lenz and Andreas Geiger and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    One of the most popular approaches to multi-target tracking is tracking-by-detection. Current min-cost flow algorithms which solve the data association problem optimally have three main drawbacks: they are computationally expensive, they assume that the whole video is given as a batch, and they scale badly in memory and computation with the length of the video sequence. In this paper, we address each of these issues, resulting in a computationally and memory-bounded solution. First, we introduce a dynamic version of the successive shortest-path algorithm which solves the data association problem optimally while reusing computation, resulting in significantly faster inference than standard solvers. Second, we address the optimal solution to the data association problem when dealing with an incoming stream of data (i.e., online setting). Finally, we present our main contribution which is an approximate online solution with bounded memory and computation which is capable of handling videos of arbitrarily length while performing tracking in real time. We demonstrate the effectiveness of our algorithms on the KITTI and PETS2009 benchmarks and show state-of-the-art performance, while being significantly faster than existing solvers.

  • Monocular Object Instance Segmentation and Depth Ordering with CNNs

    Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{ZhangICCV15,
    title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
    author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

  • Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

    Jimmy Ba, Kevin Swersky, Sanja Fidler, Ruslan Salakhutdinov
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{BaICCV15,
    title = {Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions},
    author = {Jimmy Ba and Kevin Swersky and Sanja Fidler and Ruslan Salakhutdinov},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of having to explicitly define these attributes. We present a new model that can classify unseen categories from their textual description. Specifically, we use text features to predict the output weights of both the convolutional and the fully connected layers in a deep convolutional neural network (CNN). We take advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. The proposed model also allows us to automatically generate a list of pseudo- attributes for each visual category consisting of words from Wikipedia articles. We train our models end-to-end us- ing the Caltech-UCSD bird and flower datasets and evaluate both ROC and Precision-Recall curves. Our empirical results show that the proposed model significantly outperforms previous methods.

  • 3D Object Proposals for Accurate Object Class Detection

    Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun
    In Neural Information Processing Systems (NIPS), Montreal, Canada, December 2015

    @inproceedings{XiaozhiNIPS15,
    title = {3D Object Proposals for Accurate Object Class Detection},
    author = {Xiaozhi Chen and Kaustav Kundu and Yukun Zhu and Andrew Berneshawi and Huimin Ma and Sanja Fidler and Raquel Urtasun},
    booktitle = {Neural Information Processing Systems},
    year = {2015},
    month = {December}
    }

    The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on Car and Cyclist, and is competitive for the Pedestrian class.

  • Efficient Non-greedy Optimization of Decision Trees and Forests

    Mohammad Norouzi, Maxwell Collins, Matthew Johnson, David Fleet, Pushmeet Kohli
    In Neural Information Processing Systems (NIPS), Montreal, Canada, December 2015

    @inproceedings{NorouziNIPS15,
    title = {Efficient Non-greedy Optimization of Decision Trees and Forests},
    author = {Mohammad Norouzi and Maxwell Collins and Matthew Johnson and David Fleet and Pushmeet Kohli},
    booktitle = {Neural Information Processing Systems},
    year = {2015},
    month = {December}
    }

    Coming soon

  • Enhancing World Maps by Parsing Aerial Images

    Gellert Matthyus, Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{MatthyusICCV15,
    title = {Enhancing World Maps by Parsing Aerial Images},
    author = {Gellert Matthyus and Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    In recent years, contextual models that exploit maps have been shown to be very effective for many recognition and localization tasks. In this paper, we propose to exploit aerial images in order to enhance freely available world maps. Towards this goal, we make use of OpenStreetMap and formulate the problem as the one of inference in a Markov random field parameterized in terms of the location of the road-segment centerlines as well as their width. This parameterization enables very efficient inference and returns only topologically correct roads. In particular, we can segment all OSM roads in the world in a single day using a small cluster of 10 computers. Importantly, our approach generalizes very well; it can be trained using a single aerial image and produces very accurate results in any location across the globe. We demonstrate the effectiveness of our approach over the previous state-of-the-art on two new benchmarks that we collect. We additionally show how our enhanced maps can be exploited for semantic segmentation of ground images.

  • FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation

    P. Lenz, A. Geiger, R. Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{LenzICCV15,
    title = {FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation},
    author = {P. Lenz and A. Geiger and R. Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    One of the most popular approaches to multi-target tracking is tracking-by-detection. Current min-cost flow algorithms which solve the data association problem optimally have three main drawbacks: they are computationally expensive, they assume that the whole video is given as a batch, and they scale badly in memory and computation with the length of the video sequence. In this paper, we address each of these issues, resulting in a computationally and memory-bounded solution. First, we introduce a dynamic version of the successive shortest-path algorithm which solves the data association problem optimally while reusing computation, resulting in significantly faster inference than standard solvers. Second, we address the optimal solution to the data association problem when dealing with an incoming stream of data (i.e., online setting). Finally, we present our main contribution which is an approximate online solution with bounded memory and computation which is capable of handling videos of arbitrarily length while performing tracking in real time. We demonstrate the effectiveness of our algorithms on the KITTI and PETS2009 benchmarks and show state-of-the-art performance, while being significantly faster than existing solvers.

  • Lost Shopping! Monocular Localization in Large Indoor Spaces

    Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015
    Oral presentation

    @inproceedings{WangICCV15,
    title = {Lost Shopping! Monocular Localization in Large Indoor Spaces},
    author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    In this paper we propose a novel approach to localization in very large indoor spaces (i.e., 200+ store shopping malls) that takes a single image and a floor plan of the environment as input. We formulate the localization problem as inference in a Markov random field, which jointly reasons about text detection (localizing shop’s names in the image with precise bounding boxes), shop facade segmentation, as well as camera’s rotation and translation within the entire shopping mall. The power of our approach is that it does not use any prior information about appearance and instead exploits text detections corresponding to the shop names. This makes our method applicable to a variety of domains and robust to store appearance variation across countries, seasons, and illumination conditions. We demonstrate the performance of our approach in a new dataset we collected of two very large shopping malls, and show the power of holistic reasoning.

  • Monocular Object Instance Segmentation and Depth Ordering with CNNs

    Ziyu Zhang, Alex Schwing, Sanja Fidler, Raquel Urtasun
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{ZhangICCV15,
    title = {Monocular Object Instance Segmentation and Depth Ordering with CNNs},
    author = {Ziyu Zhang and Alex Schwing and Sanja Fidler and Raquel Urtasun},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    In this paper we tackle the problem of instance level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance level segmentations where the instance ID encodes depth ordering from large image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional nets applied at overlapping patches of different resolutions as well as the output of a connected component algorithm and predicts very accurate instance level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show very good performance on both tasks.

  • Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions

    Jimmy Ba, Kevin Swersky, Sanja Fidler, Ruslan Salakhutdinov
    In International Conference on Computer Vision (ICCV), Santiago, Chile, December 2015

    @inproceedings{BaICCV15,
    title = {Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions},
    author = {Jimmy Ba and Kevin Swersky and Sanja Fidler and Ruslan Salakhutdinov},
    booktitle = {International Conference on Computer Vision},
    year = {2015},
    month = {December}
    }

    One of the main challenges in Zero-Shot Learning of visual categories is gathering semantic attributes to accompany images. Recent work has shown that learning from textual descriptions, such as Wikipedia articles, avoids the problem of having to explicitly define these attributes. We present a new model that can classify unseen categories from their textual description. Specifically, we use text features to predict the output weights of both the convolutional and the fully connected layers in a deep convolutional neural network (CNN). We take advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. The proposed model also allows us to automatically generate a list of pseudo- attributes for each visual category consisting of words from Wikipedia articles. We train our models end-to-end us- ing the Caltech-UCSD bird and flower datasets and evaluate both ROC and Precision-Recall curves. Our empirical results show that the proposed model significantly outperforms previous methods.

  • Skip-Thought Vectors

    Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
    In Neural Information Processing Systems (NIPS), Montreal, Canada, December 2015

    @inproceedings{KirosNIPS15,
    title = {Skip-Thought Vectors},
    author = {Ryan Kiros and Yukun Zhu and Ruslan Salakhutdinov and Richard Zemel and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
    booktitle = {Neural Information Processing Systems},
    year = {2015},
    month = {December}
    }

    We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

  • A Framework for Symmetric Part Detection in Cluttered Scenes

    Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, Sven Dickinson
    In Symmetry, Vol. 7, Num. 3, pp. 1333-1351, September 2015

    @article{LeeSymmetry2015,
    title = {A Framework for Symmetric Part Detection in Cluttered Scenes},
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Cristian Sminchisescu and Sven Dickinson},
    journal = {Symmetry},
    volume = {7},
    number = {3},
    pages = {1333-1351},
    year = {2015},
    month = {September}
    }

    The role of symmetry in computer vision has waxed and waned in importance
    during the evolution of the field from its earliest days. At first figuring prominently in support
    of bottom-up indexing, it fell out of favour as shape gave way to appearance and recognition
    gave way to detection. With a strong prior in the form of a target object, the role of the
    weaker priors offered by perceptual grouping was greatly diminished. However, as the field
    returns to the problem of recognition from a large database, the bottom-up recovery of the
    parts that make up the objects in a cluttered scene is critical for their recognition. The medial
    axis community has long exploited the ubiquitous regularity of symmetry as a basis for the
    decomposition of a closed contour into medial parts. However, today’s recognition systems
    are faced with cluttered scenes and the assumption that a closed contour exists, i.e., that
    figure-ground segmentation has been solved, rendering much of the medial axis community’s
    work inapplicable. In this article, we review a computational framework, previously reported
    in [Lee et al., ICCV’13, Levinshtein et al., ICCV’09, Levinshtein et al., IJCV’13], that bridges the representation power of the medial axis and the need to recover and
    group an object’s parts in a cluttered scene. Our framework is rooted in the idea that a
    maximally-inscribed disc, the building block of a medial axis, can be modelled as a compact
    superpixel in the image. We evaluate the method on images of cluttered scenes.

  • Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

    Dahua Lin, Chen Kong, Sanja Fidler, Raquel Urtasun
    In British Machine Vision Conference (BMVC), Swansea, UK, September 2015

    @inproceedings{LinBMVC15,
    title = {Generating Multi-Sentence Lingual Descriptions of Indoor Scenes},
    author = {Dahua Lin and Chen Kong and Sanja Fidler and Raquel Urtasun},
    booktitle = {British Machine Vision Conference},
    year = {2015},
    month = {September},
    {oral}
    }

    This paper proposes a novel framework for generating lingual descriptions of
    indoor scenes. Whereas substantial efforts have been made to tackle this
    problem, previous approaches focusing primarily on generating a single sentence
    for each image, which is not sufficient for describing complex scenes. We
    attempt to go beyond this, by generating coherent descriptions with multiple
    sentences. Our approach is distinguished from conventional ones in several
    aspects: (1) a 3D visual parsing system that jointly infers objects,
    attributes, and relations; (2) a generative grammar learned automatically from
    training text; and (3) a text generation algorithm that takes into account the
    coherence among sentences. Experiments on the augmented NYU-v2 dataset show
    that our framework can generate natural descriptions with substantially higher
    ROGUE scores compared to those produced by the baseline.

  • Homogeneous Codes for Energy-Efficient Illumination and Imaging

    Matthew O’Toole, Supreeth Achar, Srinivasa G. Narasimhan, Kiriakos N. Kutulakos
    In ACM Transactions on Graphics (SIGGRAPH), Vol. 34, Num. 4, August 2015

    @article{TooleSIGGRAPH15,
    title = {Homogeneous Codes for Energy-Efficient Illumination and Imaging},
    author = {Matthew O’Toole and Supreeth Achar and Srinivasa G. Narasimhan and Kiriakos N. Kutulakos},
    journal = {ACM Transactions on Graphics},
    year = {2015},
    volume = {34},
    number = {4},
    month = {August}
    }

    Programmable coding of light between a source and a sensor has led to several important results in computational illumination, imaging and display. Little is known, however, about how to utilize energy most effectively, especially for applications in live imaging. In this paper, we derive a novel framework to maximize energy efficiency by “homogeneous matrix factorization” that respects the physical constraints of many coding mechanisms (DMDs/LCDs, lasers, etc.). We demonstrate energy-efficient imaging using two prototypes based on DMD and laser illumination. For our DMD-based prototype, we use fast local optimization to derive codes that yield brighter images with fewer artifacts in many transport probing tasks. Our second prototype uses a novel combination of a low-power laser projector and a rolling shutter camera. We use this prototype to demonstrate never-seen-before capabilities such as (1) capturing live structured-light video of very bright scenes—even a light bulb that has been turned on; (2) capturing epipolar-only and indirect-only live video with optimal energy efficiency; (3) using a low-power projector to reconstruct 3D objects in challenging conditions such as strong indirect light, strong ambient light, and smoke; and (4) recording live video from a projector’s—rather than the camera’s—point of view.

  • Learning Deep Structured Models

    L.C. Chen, A. Schwing, A. Yuille, R. Urtasun
    In International Conference on Machine Learning (ICML), Lille, France, July 2015
    Oral presentation

    @inproceedings{ChenICML15,
    title = {Learning Deep Structured Models},
    author = {L.C. Chen and A. Schwing and A. Yuille and R. Urtasun},
    booktitle = {International Conference on Machine Learning},
    year = {2015},
    month = {July}
    }

    Many problems in real-world applications involve predicting several random variables that are statistically related. Markov random fields
    (MRFs) are a great mathematical tool to encode such dependencies. The goal of this paper is to combine MRFs with deep learning to estimate
    complex representations while taking into account the dependencies between the output random variables. Towards this goal, we propose a
    training algorithm that is able to learn structured models jointly with deep features that form the MRF potentials. Our approach is efficient as it
    blends learning and inference and makes use of GPU acceleration. We demonstrate the effectiveness of our algorithm in the tasks of predicting
    words from noisy images, as well as tagging of Flickr photographs. We show that joint learning of the deep features and the MRF parameters results in significant performance gains.

  • Building proteins in a day: Efficient 3D molecular reconstruction

    M.A. Brubaker, A. Punjani, D.J. Fleet
    In Computer Vision and Pattern Recognition (CVPR), Boston, USA, June 2015

    @inproceedings{BrubakerCVPR15,
    title = {Building proteins in a day: Efficient 3D molecular reconstruction},
    author = {M.A. Brubaker and A. Punjani and D.J. Fleet},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015},
    month = {June}
    }

    Discovering the 3D atomic structure of molecules such
    as proteins and viruses is a fundamental research problem
    in biology and medicine. Electron Cryomicroscopy (CryoEM)
    is a promising vision-based technique for structure estimation
    which attempts to reconstruct 3D structures from
    2D images. This paper addresses the challenging problem
    of 3D reconstruction from 2D Cryo-EM images. A
    new framework for estimation is introduced which relies
    on modern stochastic optimization techniques to scale to
    large datasets. We also introduce a novel technique which
    reduces the cost of evaluating the objective function during
    optimization by over five orders or magnitude. The net
    result is an approach capable of estimating 3D molecular
    structure from large scale datasets in about a day on a single
    workstation.

  • Holistic 3D Scene Understanding from a Single Geo-tagged Image

    Shenlong Wang, Sanja Fidler, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Boston, June 2015

    @inproceedings{WangCVPR15,
    title = {Holistic 3D Scene Understanding from a Single Geo-tagged Image},
    author = {Shenlong Wang and Sanja Fidler and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015}
    month = {June},
    {oral}
    }

    In this paper we are interested in exploiting geographic
    priors to help outdoor scene understanding. Towards this
    goal we propose a holistic approach that reasons jointly
    about 3D object detection, pose estimation, semantic segmentation
    as well as depth reconstruction from a single image.
    Our approach takes advantage of large-scale crowdsourced
    maps to generate dense geographic, geometric and
    semantic priors by rendering the 3D world. We demonstrate
    the effectiveness of our holistic model on the challenging
    KITTI dataset, and show significant improvements
    over the baselines in all metrics and tasks.

  • Learning to Segment Under Various Weak Supervisions

    J. Xu, A. Schwing, R. Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Boston, June 2015

    @inproceedings{XuCVPR15,
    title = {Learning to Segment Under Various Weak Supervisions},
    author = {J. Xu and A. Schwing and R. Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015},
    month = {June}
    }

    Despite the promising performance of conventional
    fully supervised algorithms, semantic segmentation has remained
    an important, yet challenging task. Due to the limited
    availability of complete annotations, it is of great interest
    to design solutions for semantic segmentation that take
    into account weakly labeled data, which is readily available
    at a much larger scale. Contrasting the common theme to
    develop a different algorithm for each type of weak annotation,
    in this work, we propose a unified approach that incorporates
    various forms of weak supervision – image level
    tags, bounding boxes, and partial labels – to produce a
    pixel-wise labeling. We conduct a rigorous evaluation on
    the challenging Siftflow dataset for various weakly labeled
    settings, and show that our approach outperforms the state-of-the-art
    by 12% on per-class accuracy, while maintaining
    comparable per-pixel accuracy.

  • Neuroaesthetics in Fashion: Modeling the Perception of Beauty

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Boston, June 2015
    Featured in over 60 news articles including New Scientist, Quartz, and Cosmopolitan [read more]

    @inproceedings{SimoCVPR15,
    title = {Neuroaesthetics in Fashion: Modeling the Perception of Beauty},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015},
    month = {June}
    }

    In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable
    a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a
    Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments
    the user is wearing, the type of the user, the photograph’s setting (e.g., the scenery behind the user), and the fashionability score.
    Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change
    in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines.
    We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information
    which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability
    scores across the globe and across a span of 6 years.

  • Real-Time Coarse-to-fine Topologically Preserving Segmentation

    Jian Yao, Marko Boben, Sanja Fidler, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Boston, June 2015

    @inproceedings{YaoCVPR15,
    title = {Real-Time Coarse-to-fine Topologically Preserving Segmentation},
    author = {Jian Yao and Marko Boben and Sanja Fidler and Raquel Urtasun},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015},
    month = {June}
    }

    In this paper, we tackle the problem of unsupervised segmentation
    in the form of superpixels. Our main emphasis is
    on speed and accuracy. We build on [Yamaguchi et al., ECCV’14] to define the problem
    as a boundary and topology preserving Markov random
    field. We propose a coarse to fine optimization technique
    that speeds up inference in terms of the number of updates
    by an order of magnitude. Our approach is shown to outperform
    [Yamaguchi et al., ECCV’14] while employing a single iteration. We evaluate
    and compare our approach to state-of-the-art superpixel algorithms
    on the BSD and KITTI benchmarks. Our approach
    significantly outperforms the baselines in the segmentation
    metrics and achieves the lowest error on the stereo task.

  • Rent3D: Floor-Plan Priors for Monocular Layout Estimation

    Chenxi Liu, Alex Schwing, Kaustav Kundu, Raquel Urtasun, Sanja Fidler
    In Computer Vision and Pattern Recognition (CVPR), Boston, June 2015

    @inproceedings{ApartmentsCVPR15,
    title = {Rent3D: Floor-Plan Priors for Monocular Layout Estimation},
    author = {Chenxi Liu and Alex Schwing and Kaustav Kundu and Raquel Urtasun and Sanja Fidler},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015},
    month = {June},
    {oral}
    }

    The goal of this paper is to enable a 3D “virtual-tour” of an apartment given a small set of monocular images of different rooms,
    as well as a 2D floor plan. We frame the problem as the one of inference in a Markov random field which reasons about the layout
    of each room and its relative pose (3D rotation and translation) within the full apartment. This gives us information, for example,
    about in which room the picture was taken.
    What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge.
    In particular, we exploit the floor plan to impose aspect ratio constraints across the layouts of different rooms, as well as to
    extract semantic information, e.g., the location of windows which are labeled in floor plans. We show that this information
    can significantly help in resolving the challenging room-apartment alignment problem.
    We also derive an efficient exact inference algorithm which takes only a few ms per apartment. This is due to the fact that we
    exploit integral geometry as well as our new bounds on the aspect ratio of rooms which allow us to carve the space, reducing
    significantly the number of physically possible configurations.
    We demonstrate the effectiveness of our approach in a new dataset which contains over 200 apartments.

  • segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

    Yukun Zhu, Raquel Urtasun, Ruslan Salakhutdinov, Sanja Fidler
    In Computer Vision and Pattern Recognition (CVPR), Boston, June 2015

    @inproceedings{ZhuSegDeepM15,
    title = {segDeepM: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection},
    author = {Yukun Zhu and Raquel Urtasun and Ruslan Salakhutdinov and Sanja Fidler},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2015},
    month = {June}
    }

    In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection.
    We frame the problem as inference in a Markov Random Field, in which each detection hypothesis scores object appearance as well as
    contextual information using Convolutional Neural Networks, and allows the hypothesis to choose and score a segment out of a large
    pool of accurate object segmentation proposals. This enables the detector to incorporate additional evidence when it is available
    and thus results in more accurate detections. Our experiments show an improvement of 4.1%.

  • Estimating Drivable Collision-Free Space from Monocular Video

    J. Yao, S. Ramalingam, Y. Taguchi, Y. Miki, R. Urtasun
    In Winter Conference on Applications of Computer Vision (WACV), Hawaii, USA, January 2015

    @inproceedings{YaoWACV15,
    title = {Estimating Drivable Collision-Free Space from Monocular Video},
    author = {J. Yao and S. Ramalingam and Y. Taguchi and Y. Miki and R. Urtasun},
    booktitle = {Winter Conference on Applications of Computer Vision},
    year = {2015},
    month = {January}
    }

    In this paper we propose a novel algorithm for estimating
    the drivable collision-free space for autonomous navigation
    of on-road and on-water vehicles. In contrast to
    previous approaches that use stereo cameras or LIDAR, we
    show a method to solve this problem using a single camera.
    Inspired by the success of many vision algorithms that
    employ dynamic programming for efficient inference, we reduce
    the free space estimation task to an inference problem
    on a 1D graph, where each node represents a column in the
    image and its label denotes a position that separates the free
    space from the obstacles. Our algorithm exploits several
    image and geometric features based on edges, color, and
    homography to define potential functions on the 1D graph,
    whose parameters are learned through structured SVM. We
    show promising results on the challenging KITTI dataset as
    well as video collected from boats.

  • Distributed Algorithms for Large Scale Learning and Inference in Graphical Models

    A. Schwing, T. Hazan, M. Pollefeys, R. Urtasun
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015

    @article{SchwingPAMI15,
    title = {Distributed Algorithms for Large Scale Learning and Inference in Graphical Models},
    author = {A. Schwing and T. Hazan and M. Pollefeys and R. Urtasun},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2015}
    }

    Coming soon

  • Efficient optimization for sparse Gaussian process regression

    Y. Cao, M. Brubaker, D.J. Fleet, A. Hertzmann
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015

    @article{CaoPAMI15,
    title = {Efficient optimization for sparse Gaussian process regression},
    author = {Y. Cao and M. Brubaker and D.J. Fleet and A. Hertzmann},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2015}
    }

    Coming soon

  • Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding

    Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, Devi Parikh
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015

    @article{MottaghiPAMI15,
    title = {Human-Machine CRFs for Identifying Bottlenecks in Scene Understanding},
    author = {Roozbeh Mottaghi and Sanja Fidler and Alan Yuille and Raquel Urtasun and Devi Parikh},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2015}
    }

    Recent trends in image understanding have pushed for scene understanding models that jointly reason about various
    tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers.
    In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular
    semantic segmentation, object detection and scene recognition. Towards this goal, we “plug-in” human subjects for each of the
    various components in a conditional random field model. Comparisons among various hybrid human-machine CRFs give us
    indications of how much “head room” there is to improve scene understanding by focusing research efforts on various individual
    tasks.

  • Map-Based Probabilistic Visual Self-Localization

    M. Brubaker, A. Geiger, R. Urtasun
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015

    @article{BrubakerPAMI15,
    title = {Map-Based Probabilistic Visual Self-Localization},
    author = {M. Brubaker and A. Geiger and R. Urtasun},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2015}
    }

    Accurate and efficient self-localization is a critical problem for autonomous systems. This paper describes an affordable
    solution to vehicle self-localization which uses odometry computed from two video cameras and road maps as the sole inputs. The
    core of the method is a probabilistic model for which an efficient approximate inference algorithm is derived. The inference algorithm is
    able to utilize distributed computation in order to meet the real-time requirements of autonomous systems in some instances. Because
    of the probabilistic nature of the model the method is capable of coping with various sources of uncertainty including noise in the visual
    odometry and inherent ambiguities in the map (e.g., in a Manhattan world). By exploiting freely available, community developed maps
    and visual odometry measurements, the proposed method is able to localize a vehicle to 4m on average after 52 seconds of driving on
    maps which contain more than 2,150km of drivable roads.

  • Characterizing the location of spinal and vertebral levels in the human cervical spinal cord

    D.W. Cadotte, A. Cadotte, J. Cohen-Adad, D.J. Fleet, M. Livne, J.R. Wilson, D. Mikulis, N. Nugaeva, M.G. Fehlings
    In American Journal of Neuroradiology, Vol. 36, Num. 4, December 2014

    @article{CadotteAJN14,
    title = {Characterizing the location of spinal and vertebral levels in the human cervical spinal cord},
    author = {D.W. Cadotte and A. Cadotte and J. Cohen-Adad and D.J. Fleet and M. Livne and J.R. Wilson and D. Mikulis and N. Nugaeva and M.G. Fehlings},
    journal = {American Journal of Neuroradiology},
    year = {2014},
    volume = {36},
    number = {4},
    month = {December}
    }

    Advanced MR imaging techniques are critical to understanding the pathophysiology of conditions involving
    the spinal cord. We provide a novel, quantitative solution to map vertebral and spinal cord levels accounting for anatomic variability within the
    human spinal cord. For the first time, we report a population distribution of the segmental anatomy of the cervical spinal cord that has direct
    implications for the interpretation of advanced imaging studies most often conducted across groups of subjects

  • Efficient Inference of Continuous Markov Random Fields with Polynomial Potentials

    S. Wang, A. Schwing, R. Urtasun
    In Neural Information Processing Systems (NIPS), Montreal, Canada, December 2014

    @inproceedings{WangNIP14,
    title = {Efficient Inference of Continuous Markov Random Fields with Polynomial Potentials},
    author = {S. Wang and A. Schwing and R. Urtasun},
    booktitle = {Neural Information Processing Systems},
    year = {2014},
    month = {December}
    }

    In this paper, we prove that every multivariate polynomial with even degree can
    be decomposed into a sum of convex and concave polynomials. Motivated by
    this property, we exploit the concave-convex procedure to perform inference on
    continuous Markov random fields with polynomial potentials. In particular, we
    show that the concave-convex decomposition of polynomials can be expressed as
    a sum-of-squares optimization, which can be efficiently solved via semidefinite
    programing. We demonstrate the effectiveness of our approach in the context
    of 3D reconstruction, shape from shading and image denoising, and show that
    our method significantly outperforms existing techniques in terms of efficiency as
    well as quality of the retrieved solution.

  • Message Passing Inference for Large Scale Graphical Models with High Order Potentials

    J. Zhang, A. Schwing, R. Urtasun
    In Neural Information Processing Systems (NIPS), Montreal, Canada, December 2014

    @inproceedings{ZhangNIP14,
    title = {Message Passing Inference for Large Scale Graphical Models with High Order Potentials},
    author = {J. Zhang and A. Schwing and R. Urtasun},
    booktitle = {Neural Information Processing Systems},
    year = {2014},
    month = {December}
    }

    To keep up with the Big Data challenge, parallelized algorithms based on dual decomposition
    have been proposed to perform inference in Markov random fields.
    Despite this parallelization, current algorithms struggle when the energy has high
    order terms and the graph is densely connected. In this paper we propose a partitioning
    strategy followed by a message passing algorithm which is able to exploit
    pre-computations. It only updates the high-order factors when passing messages
    across machines. We demonstrate the effectiveness of our approach on the task of
    joint layout and semantic segmentation estimation from single images, and show
    that our approach is orders of magnitude faster than current methods.

  • A High Performance CRF Model for Clothes Parsing

    Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun
    In Asian Conference on Computer Vision (ACCV), Singapore, November 2014

    @inproceedings{SimoACCV14,
    title = {A High Performance CRF Model for Clothes Parsing},
    author = {Edgar Simo-Serra and Sanja Fidler and Francesc Moreno-Noguer and Raquel Urtasun},
    booktitle = {Asian Conference on Computer Vision},
    year = {2014},
    month = {November}
    }

    In this paper we tackle the problem of clothing parsing: Our goal is to
    segment and classify different garments a person is wearing. We frame the problem
    as the one of inference in a pose-aware Conditional Random Field (CRF)
    which exploits appearance, figure/ground segmentation, shape and location priors
    for each garment as well as similarities between segments, and symmetries
    between different human body parts. We demonstrate the effectiveness of our approach
    on the Fashionista dataset [Yamaguchi et al., CVPR’12] and show that we can obtain a significant
    improvement over the state-of-the-art.

  • Multi-cue mid-level grouping

    Tom Lee, Sanja Fidler, Sven Dickinson
    In Asian Conference on Computer Vision (ACCV), Singapore, November 2014

    @inproceedings{LeeACCV14,
    title = {Multi-cue mid-level grouping},
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    booktitle = {Asian Conference on Computer Vision},
    year = {2014},
    month = {November}
    }

    Region proposal methods provide richer object hypotheses
    than sliding windows with dramatically fewer proposals, yet they still
    number in the thousands. This large quantity of proposals typically results
    from a diversification step that propagates bottom-up ambiguity in
    the form of proposals to the next processing stage. In this paper, we take
    a complementary approach in which mid-level knowledge is used to resolve
    bottom-up ambiguity at an earlier stage to allow a further reduction
    in the number of proposals. We present a method for generating regions
    using the mid-level grouping cues of closure and symmetry. In doing so,
    we combine mid-level cues that are typically used only in isolation, and
    leverage them to produce fewer but higher quality proposals. We emphasize
    that our model is mid-level by learning it on a limited number of
    objects while applying it to different objects, thus demonstrating that
    it is transferable to other objects. In our quantitative evaluation, we 1)
    establish the usefulness of each grouping cue by demonstrating incremental
    improvement, and 2) demonstrate improvement on two leading
    region proposal methods with a limited budget of proposals.

  • Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation

    K. Yamaguchi, D. McAllester, R. Urtasun
    In European Conference on Computer Vision (ECCV), Zurich, Switzerland, September 2014

    @inproceedings{YamaguchiECCV14,
    title = {Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation},
    author = {K. Yamaguchi and D. McAllester and R. Urtasun},
    booktitle = {European Conference on Computer Vision},
    year = {2014},
    month = {September}
    }

    In this paper we propose a slanted plane model for jointly
    recovering an image segmentation, a dense depth estimate as well as
    boundary labels (such as occlusion boundaries) from a static scene given
    two frames of a stereo pair captured from a moving vehicle. Towards
    this goal we propose a new optimization algorithm for our SLIC-like
    objective which preserves connectedness of image segments and exploits
    shape regularization in the form of boundary length. We demonstrate the
    performance of our approach in the challenging stereo and flow KITTI
    benchmarks and show superior results to the state-of-the-art. Importantly,
    these results can be achieved an order of magnitude faster than
    competing approaches.

  • Bayesian Filtering with Online Gaussian Process Latent Variable Models

    Y. Wang, M Brubaker, B. Chaibdraa, R. Urtasun
    In Conference on Uncertainty in Artificial Intelligence (UAI), Quebec City, Canada, July 2014

    @inproceedings{WangUAI14,
    title = {Bayesian Filtering with Online Gaussian Process Latent Variable Models},
    author = {Y. Wang and M Brubaker and B. Chaibdraa and R. Urtasun},
    booktitle = {Conference on Uncertainty in Artificial Intelligence},
    year = {2014},
    month = {July}
    }

    In this paper we present a novel non-parametric
    approach to Bayesian filtering, where the prediction
    and observation models are learned in an
    online fashion. Our approach is able to handle
    multimodal distributions over both models by
    employing a mixture model representation with
    Gaussian Processes (GP) based components. To
    cope with the increasing complexity of the estimation
    process, we explore two computationally
    efficient GP variants, sparse online GP and local
    GP, which help to manage computation requirements
    for each mixture component. Our experiments
    demonstrate that our approach can track
    human motion much more accurately than existing
    approaches that learn the prediction and observation
    models offline and do not update these
    models with the incoming data stream.

  • Temporal Frequency Probing for 5D Transient Analysis of Light Transport

    Matthew P. O’Toole, Felix Heide, Lei Xiao, Matthias Hullin, Wolfgang Heidrich, Kiriakos N. Kutulakos
    In ACM Transactions on Graphics (SIGGRAPH), Vol. 33, Num. 4, July 2014

    @article{TooleSIGGRAPH14,
    title = {Temporal Frequency Probing for 5D Transient Analysis of Light Transport},
    author = {Matthew P. O’Toole and Felix Heide and Lei Xiao and Matthias Hullin and Wolfgang Heidrich and Kiriakos N. Kutulakos},
    journal = {ACM Transactions on Graphics},
    year = {2014},
    volume = {33},
    number = {4},
    month = {July}
    }

    We analyze light propagation in an unknown scene using projectors and cameras that operate at transient timescales. In this new photography regime, the projector emits a spatio-temporal 3D signal and the camera receives a transformed version of it, determined by the set of all light transport paths through the scene and the time delays they induce. The underlying 3D-to-3D transformation encodes scene geometry and global transport in great detail, but individual transport components (e.g., direct reflections, inter-reflections, caustics, etc.) are coupled nontrivially in both space and time.

    To overcome this complexity, we observe that transient light transport is always separable in the temporal frequency domain. This makes it possible to analyze transient transport one temporal frequency at a time by trivially adapting techniques from conventional projector-to-camera transport. We use this idea in a prototype that offers three never-seen-before abilities: (1) acquiring time-of-flight depth images that are robust to general indirect transport, such as interreflections and caustics; (2) distinguishing between direct views of objects and their mirror reflection; and (3) using a photonic mixer device to capture sharp, evolving wavefronts of “light-in-flight”.

  • 3D Shape and Indirect Appearance by Structured Light Transport

    Matthew P. O’Toole, John Mather, Kiriakos N. Kutulakos
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014
    Best Paper Honorable Mention

    @inproceedings{TooleCVPR14,
    title = {3D Shape and Indirect Appearance by Structured Light Transport},
    author = {Matthew P. O’Toole and John Mather and Kiriakos N. Kutulakos},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    We consider the problem of deliberately manipulating the direct and indirect light flowing through a time-varying, fully-general scene in order to simplify its visual analysis. Our approach rests on a crucial link between stereo geometry and light transport: while direct light always obeys the epipolar geometry of a projector-camera pair, indirect light overwhelmingly does not. We show that it is possible to turn this observation into an imaging method that analyzes light transport in real time in the optical domain, prior to acquisition. This yields three key abilities that we demonstrate in an experimental camera prototype: (1) producing a live indirect-only video stream for any scene, regardless of geometric or photometric complexity; (2) capturing images that make existing structured-light shape recovery algorithms robust to indirect transport; and (3) turning them into one-shot methods for dynamic 3D shape capture.

  • Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision

    Liang-Chieh Chen, Sanja Fidler, Alan Yuille, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{ChenCVPR14,
    author = {Liang-Chieh Chen and Sanja Fidler and Alan Yuille and Raquel Urtasun},
    title = {Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    Labeling large-scale datasets with very accurate object
    segmentations is an elaborate task that requires a high degree
    of quality control and a budget of tens or hundreds of
    thousands of dollars. Thus, developing solutions that can
    automatically perform the labeling given only weak supervision
    is key to reduce this cost. In this paper, we show how
    to exploit 3D information to automatically generate very accurate
    object segmentations given annotated 3D bounding
    boxes. We formulate the problem as the one of inference in
    a binary Markov random field which exploits appearance
    models, stereo and/or noisy point clouds, a repository of 3D
    CAD models as well as topological constraints. We demonstrate
    the effectiveness of our approach in the context of autonomous
    driving, and show that we can segment cars with
    the accuracy of 86% intersection-over-union, performing as
    well as highly recommended MTurkers!

  • Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Nam-Gyu Cho, Sanja Fidler, Raquel Urtasun, Alan Yuille
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{PartsCVPR14,
    author = {Xianjie Chen and Roozbeh Mottaghi and Xiaobai Liu and Nam-Gyu Cho and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    Detecting objects becomes difficult when we need to deal
    with large shape deformation, occlusion and low resolution.
    We propose a novel approach to i) handle large deformations
    and partial occlusions in animals (as examples
    of highly deformable objects), ii) describe them in terms of
    body parts, and iii) detect them when their body parts are
    hard to detect (e.g., animals depicted at low resolution). We
    represent the holistic object and body parts separately and
    use a fully connected model to arrange templates for the
    holistic object and body parts. Our model automatically
    decouples the holistic object or body parts from the model
    when they are hard to detect. This enables us to represent a
    large number of holistic object and body part combinations
    to better deal with different “detectability” patterns caused
    by deformations, occlusion and/or low resolution.

    We apply our method to the six animal categories in the
    PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides
    a richer representation for objects. During training we use
    annotations for body parts (e.g., head, torso, etc), making
    use of a new dataset of fully annotated object parts for PASCAL
    VOC 2010, which provides a mask for each part.

  • Fast exact search in Hamming space with multi-index hashing

    M. Norouzi, A. Punjani, D. J. Fleet
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 36, Num. 6, pp. 1107-1119, June 2014

    @article{NorouziPAMI14,
    title = {Fast exact search in Hamming space with multi-index hashing},
    author = {M. Norouzi and A. Punjani and D. J. Fleet},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2014},
    volume = {36},
    number = {6},
    pages = {1107–1119},
    month = {June}
    }

    There is growing interest in representing image data and feature descriptors using compact binary codes for fast near
    neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than
    32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on
    binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straightforward
    to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes.
    Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits.

  • Globally Convergent Parallel MAP LP Relaxation Solver using the Frank-Wolfe Algorithm

    A. Schwing, T. Hazan, M. Pollefeys, R. Urtasun
    In International Conference on Machine Learning (ICML), Beijing, China, June 2014

    @inproceedings{SchwingICML14,
    title = {Globally Convergent Parallel MAP LP Relaxation Solver using the Frank-Wolfe Algorithm},
    author = {A. Schwing and T. Hazan and M. Pollefeys and R. Urtasun},
    booktitle = {International Conference on Machine Learning},
    year = {2014},
    month = {June}
    }

    Estimating the most likely configuration (MAP)
    is one of the fundamental tasks in probabilistic
    models. While MAP inference is typically
    intractable for many real-world applications,
    linear programming relaxations have been
    proven very effective. Dual block-coordinate
    descent methods are among the most efficient
    solvers, however, they are prone to get stuck
    in sub-optimal points. Although subgradient
    approaches achieve global convergence, they
    are typically slower in practice. To improve
    convergence speed, algorithms which compute
    the steepest -descent direction by solving a
    quadratic program have been proposed. In this
    paper we suggest to decouple the quadratic program
    based on the Frank-Wolfe approach. This
    allows us to obtain an efficient and easy to parallelize
    algorithm while retaining the global convergence
    properties. Our method proves superior
    when compared to existing algorithms on a set of
    spin-glass models and protein design tasks.

  • Posebits for monocular human pose estimation

    G. Pons-Moll, D.J. Fleet, B. Rosenhahn
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{PonsCVPR14,
    title = {Posebits for monocular human pose estimation},
    author = {G. Pons-Moll and D.J. Fleet and B. Rosenhahn},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    We advocate the inference of qualitative information
    about 3D human pose, called posebits, from images.
    Posebits represent boolean geometric relationships between
    body parts (e.g. left-leg in front of right-leg or hands close
    to each other). The advantages of posebits as a mid-level
    representation are 1) for many tasks of interest, such qualitative
    pose information may be sufficient (e.g. semantic
    image retrieval), 2) it is relatively easy to annotate large
    image corpora with posebits, as it simply requires answers
    to yes/no questions; and 3) they help resolve challenging
    pose ambiguities and therefore facilitate the difficult task of
    image-based 3D pose estimation. We introduce posebits, a
    posebit database, a method for selecting useful posebits for
    pose estimation and a structural SVM model for posebit inference.
    Experiments show the use of posebits for semantic
    image retrieval and for improving 3D pose estimation.

  • Tell Me What You See and I will Show You Where It Is

    J. Xu, A. Schwing, R. Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{XuCVPR14,
    author = {J. Xu and A. Schwing and R. Urtasun},
    title = {Tell Me What You See and I will Show You Where It Is},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    We tackle the problem of weakly labeled semantic segmentation,
    where the only source of annotation are image
    tags encoding which classes are present in the scene. This
    is an extremely difficult problem as no pixel-wise labelings
    are available, not even at training time. In this paper, we
    show that this problem can be formalized as an instance of
    learning in a latent structured prediction framework, where
    the graphical model encodes the presence and absence of a
    class as well as the assignments of semantic labels to superpixels.
    As a consequence, we are able to leverage standard
    algorithms with good theoretical properties. We demonstrate
    the effectiveness of our approach using the challenging
    SIFT-flow dataset and show average per-class accuracy
    improvements of 7% over the state-of-the-art

  • The Role of Context for Object Detection and Semantic Segmentation in the Wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, Alan Yuille
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{MottaghiCVPR14,
    author = {Roozbeh Mottaghi and Xianjie Chen and Xiaobai Liu and Sanja Fidler and Raquel Urtasun and Alan Yuille},
    title = {The Role of Context for Object Detection and Semantic Segmentation in the Wild},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    In this paper we study the role of context in existing state-of-the-art
    detection and segmentation approaches. Towards
    this goal, we label every pixel of PASCAL VOC 2010 detection
    challenge with a semantic category. We believe this
    data will provide plenty of challenges to the community, as
    it contains 520 additional classes for semantic segmentation
    and object detection. Our analysis shows that nearest
    neighbor based approaches perform poorly on semantic
    segmentation of contextual classes, showing the variability
    of PASCAL imagery. Furthermore, improvements of existing
    contextual models for detection is rather modest. In
    order to push forward the performance in this difficult scenario,
    we propose a novel deformable part-based model,
    which exploits both local context around each candidate detection
    as well as global context at the level of the scene.
    We show that this contextual reasoning significantly helps
    in detecting objects at all scales.

  • Visual Semantic Search: Retrieving Videos via Complex Textual Queries

    Dahua Lin, Sanja Fidler, Chen Kong, Raquel Urtasun
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{LinCVPR14,
    author = {Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
    title = {Visual Semantic Search: Retrieving Videos via Complex Textual Queries},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    In this paper, we tackle the problem of retrieving videos
    using complex natural language queries. Towards this goal,
    we first parse the sentential descriptions into a semantic
    graph, which is then matched to visual concepts using a
    generalized bipartite matching algorithm. Our approach
    exploits object appearance, motion and spatial relations,
    and learns the importance of each term using structure prediction.
    We demonstrate the effectiveness of our approach
    on a new dataset designed for semantic search in the context
    of autonomous driving, which exhibits complex and highly
    dynamic scenes with many objects. We show that our approach
    is able to locate a major portion of the objects described
    in the query with high accuracy, and improve the
    relevance in video retrieval.

  • What are you talking about? Text-to-Image Coreference

    Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, Sanja Fidler
    In Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014

    @inproceedings{KongCVPR14,
    title = {What are you talking about? Text-to-Image Coreference},
    author = {Chen Kong and Dahua Lin and Mohit Bansal and Raquel Urtasun and Sanja Fidler},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2014},
    month = {June}
    }

    In this paper we exploit natural sentential descriptions
    of RGB-D scenes in order to improve 3D semantic parsing.
    Importantly, in doing so, we reason about which particular
    object each noun/pronoun is referring to in the image. This
    allows us to utilize visual information in order to disambiguate
    the so-called coreference resolution problem that
    arises in text. Towards this goal, we propose a structure
    prediction model that exploits potentials computed from text
    and RGB-D imagery to reason about the class of the 3D objects,
    the scene type, as well as to align the nouns/pronouns
    with the referred visual objects. We demonstrate the effectiveness
    of our approach on the challenging NYU-RGBD v2
    dataset, which we enrich with natural lingual descriptions.
    We show that our approach significantly improves 3D detection
    and scene classification accuracy, and is able to reliably
    estimate the text-to-image alignment. Furthermore,
    by using textual and visual information, we are also able to
    successfully deal with coreference in text, improving upon
    the state-of-the-art Stanford coreference system.

  • Transductive Gaussian Processes for Image Denoising

    S. Wang, L. Zhang, R. Urtasun
    In International Conference on Computational Photography (ICCP), Santa Clara, California, May 2014
    Oral presentation

    @inproceedings{WangICCP14,
    author = {S. Wang and L. Zhang and R. Urtasun},
    title = {Transductive Gaussian Processes for Image Denoising},
    booktitle = {International Conference on Computational Photography},
    year = {2014},
    month = {May}
    }

    In this paper we are interested in exploiting selfsimilarity
    information for discriminative image denoising.
    Towards this goal, we propose a simple yet powerful denoising
    method based on transductive Gaussian processes,
    which introduces self-similarity in the prediction stage. Our
    approach allows to build a rich similarity measure by learning
    hyper parameters defining multi-kernel combinations.
    We introduce perceptual-driven kernels to capture pixelwise,
    gradient-based and local-structure similarities. In addition,
    our algorithm can integrate several initial estimates
    as input features to boost performance even further. We
    demonstrate the effectiveness of our approach on several
    benchmarks. The experiments show that our proposed denoising
    algorithm has better performance than competing
    discriminative denoising methods, and achieves competitive
    result with respect to the state-of-the-art.

  • Zero-Shot Learning by Convex Combination of Semantic Embeddings

    Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, Jeffrey Dean
    In International Conference on Learning Representations (ICLR), Banff, Canada, April 2014

    @inproceedings{NorouziICLR14,
    title = {Zero-Shot Learning by Convex Combination of Semantic Embeddings},
    author = {Mohammad Norouzi and Tomas Mikolov and Samy Bengio and Yoram Singer and Jonathon Shlens and Andrea Frome and Greg S. Corrado and Jeffrey Dean},
    booktitle = {International Conference on Learning Representations},
    year = {2014},
    month = {April}
    }

    Several recent publications have proposed methods for mapping images into continuous
    semantic embedding spaces. In some cases the embedding space is trained
    jointly with the image transformation. In other cases the semantic embedding
    space is established by an independent natural language processing task, and then
    the image transformation into that space is learned in a second stage. Proponents
    of these image embedding systems have stressed their advantages over the traditional
    n-way classification framing of image understanding, particularly in terms
    of the promise for zero-shot learning �the ability to correctly annotate images of
    previously unseen object categories. In this paper, we propose a simple method
    for constructing an image embedding system from any existing n-way image classifier
    and a semantic word embedding model, which contains the n class labels in
    its vocabulary. Our method maps images into the semantic embedding space via
    convex combination of the class label embedding vectors, and requires no additional
    training. We show that this simple and direct method confers many of the
    advantages associated with more complex image embedding schemes, and indeed
    outperforms state of the art methods on the ImageNet zero-shot learning task.

  • Detecting Curved Symmetric Parts using a Deformable Disc Model

    Tom Lee, Sanja Fidler, Sven Dickinson
    In International Conference on Computer Vision (ICCV), Sydney, Australia, December 2013

    @inproceedings{LeeICCV13,
    author = {Tom Lee and Sanja Fidler and Sven Dickinson},
    title = {Detecting Curved Symmetric Parts using a Deformable Disc Model},
    booktitle = {International Conference on Computer Vision},
    year = {2013},
    month = {December}
    }

    Symmetry is a powerful shape regularity that’s been exploited by perceptual grouping researchers in both human and computer vision to recover part structure from an image without a priori knowledge of scene content. Drawing on the concept of a medial axis, defined as the locus of centers of maximal inscribed discs that sweep out a symmetric part, we model part recovery as the search for a sequence of deformable maximal inscribed disc hypotheses generated from a multiscale superpixel segmentation, a framework proposed by [Levinshtein et al., ICCV’09]. However, we learn affinities between adjacent superpixels in a space that’s invariant to bending and tapering along the symmetry axis, enabling us to capture a wider class of symmetric parts. Moreover, we introduce a global cost that perceptually integrates the hypothesis space by combining a pairwise and a higher-level smoothing term, which we minimize globally using dynamic programming. The new framework is demonstrated on two datasets, and is shown to significantly outperform the baseline [Levinshtein et al., ICCV’09].

  • Efficient optimization for sparse Gaussian process regression

    Y. Cao, M. Brubaker, D.J. Fleet, A. Hertzmann
    In Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2013

    @inproceedings{NorouziCVPR13,
    title = {Efficient optimization for sparse Gaussian process regression},
    author = {Y. Cao and M. Brubaker and D.J. Fleet and A. Hertzmann},
    booktitle = {Neural Information Processing Systems},
    year = {2013},
    month = {December}
    }

    We propose an efficient optimization algorithm for selecting a subset of training
    data to induce sparsity for Gaussian process regression. The algorithm estimates
    an inducing set and the hyper-parameters using a single objective, either the
    marginal likelihood or a variational free energy. The space and time complexity
    are linear in training set size, and the algorithm can be applied to large regression
    problems on discrete or continuous domains. Empirical evaluation shows state-of-the-art
    performance in discrete cases and competitive results in the continuous case.

  • Multiscale Symmetric Part Detection and Grouping

    A. Levinshtein, C. Sminchisescu, S. Dickinson
    In International Journal of Computer Vision (IJCV), Vol. 104, Num. 2, pp. 117-134, September 2013

    @article{svenIJCV13,
    title = {Multiscale Symmetric Part Detection and Grouping},
    author = {A. Levinshtein and C. Sminchisescu and S. Dickinson},
    journal = {International Journal of Computer Vision},
    volume ={104},
    number = {2},
    pages = {117–134},
    year = {2013},
    month = {September}
    }

    Skeletonization algorithms typically decompose
    an object’s silhouette into a set of symmetric parts, offering
    a powerful representation for shape categorization. However,
    having access to an object’s silhouette assumes correct
    figure-ground segmentation, leading to a disconnect with
    the mainstream categorization community, which attempts
    to recognize objects from cluttered images. In this paper,
    we present a novel approach to recovering and grouping
    the symmetric parts of an object from a cluttered scene. We
    begin by using a multiresolution superpixel segmentation to
    generate medial point hypotheses, and use a learned affinity
    function to perceptually group nearby medial points likely
    to belong to the same medial branch. In the next stage, we
    learn higher granularity affinity functions to group the resulting
    medial branches likely to belong to the same object. The
    resulting framework yields a skeletal approximation that is
    free of many of the instabilities that occur with traditional
    skeletons. More importantly, it does not require a closed contour,
    enabling the application of skeleton-based categorization
    systems to more realistic imagery.

  • Server-customer interaction tracker (SCIT): a computer vision-based system to estimate dirt loading cycles


    In Journal of Construction Engineering and Management, Vol. 139, Num. 7, pp. 785-794, July 2013

    @article{svenJCEM13,
    title = {Server-customer interaction tracker (SCIT): a computer vision-based system to estimate dirt loading cycles},
    authors = {E. Rezazadeh Azar and S. Dickinson and B. McCabe},
    journal = {Journal of Construction Engineering and Management},
    volume = {139},
    number = {7},
    month = {July},
    year = {2013},
    pages = {785–794}
    }

    Real-time monitoring of the heavy equipment can help practitioners improve machine intensive
    and cyclic earthmoving operations. It can also provide reliable data for future
    planning. Surface earthmoving job sites are among the best candidates for vision-based
    systems due to relatively clear sightlines and recognizable equipment. Several cutting
    edge computer vision algorithms are integrated with spatiotemporal information, and
    background knowledge to develop a framework, called server-customer interaction
    tracker (SCIT), which recognizes and measures the dirt loading cycles. The SCIT system
    detects dirt loading plants, including excavator and dump trucks, tracks them, and then
    uses captured spatiotemporal data to recognize loading cycles. A novel hybrid tracking
    algorithm is developed for the SCIT system to track dump trucks under visually noisy
    conditions of loading zones. The developed framework was evaluated using videos taken
    under various conditions. The SCIT system with novel hybrid tracking engine
    demonstrated reliable performance as the comparison of the machine-generated and
    ground truth data showed high accuracy.

  • Cartesian k-means

    M. Norouzi, D.J. Fleet
    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    @inproceedings{NorouziCVPR13,
    title = {Cartesian k-means},
    author = {M. Norouzi and D.J. Fleet},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2013},
    month = {June}
    }

    A fundamental limitation of quantization techniques like
    the k-means clustering algorithm is the storage and runtime
    cost associated with the large numbers of clusters required
    to keep quantization errors small and model fidelity
    high. We develop new models with a compositional parameterization
    of cluster centers, so representational capacity
    increases super-linearly in the number of parameters. This
    allows one to effectively quantize data using billions or trillions
    of centers. We formulate two such models, Orthogonal
    k-means and Cartesian k-means. They are closely related to
    one another, to k-means, to methods for binary hash function
    optimization like ITQ, and to Product Quantization
    for vector quantization. The models are tested on large-scale
    ANN retrieval tasks (1M GIST, 1B SIFT features), and
    on codebook learning for object recognition (CIFAR-10).

  • Fast rigid motion segmentation via incrementally-complex local models

    F. Flores-Mangas, A.D. Jepson
    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    @inproceedings{FloresCVPR13,
    title = {Fast rigid motion segmentation via incrementally-complex local models},
    author = {F. Flores-Mangas and A.D. Jepson},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2013},
    month = {June}
    }

    The problem of rigid motion segmentation of trajectory
    data under orthography has been long solved for nondegenerate
    motions in the absence of noise. But because
    real trajectory data often incorporates noise, outliers, motion
    degeneracies and motion dependencies, recently proposed
    motion segmentation methods resort to non-trivial
    representations to achieve state of the art segmentation accuracies,
    at the expense of a large computational cost. This
    paper proposes a method that dramatically reduces this cost
    (by two or three orders of magnitude) with minimal accuracy
    loss (from 98.8% achieved by the state of the art, to
    96.2% achieved by our method on the standard Hopkins 155
    dataset). Computational efficiency comes from the use of a
    simple but powerful representation of motion that explicitly
    incorporates mechanisms to deal with noise, outliers and
    motion degeneracies. Subsets of motion models with the
    best balance between prediction accuracy and model complexity
    are chosen from a pool of candidates, which are then
    used for segmentation.

  • Recognizing Human Activities from Partially Observed Videos

    Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. Siskind, S. Wang
    In Computer Vision and Pattern Recognition (CVPR), Portland, USA, June 2013

    @inproceedings{svenCVPR13,
    title = {Recognizing Human Activities from Partially Observed Videos},
    author = {Y. Cao and D. Barrett and A. Barbu and S. Narayanaswamy and H. Yu and A. Michaux and Y. Lin and S. Dickinson and J. Siskind and S. Wang},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2013},
    month = {June}
    }

    Recognizing human activities in partially observed
    videos is a challenging problem and has many practical applications.
    When the unobserved subsequence is at the end
    of the video, the problem is reduced to activity prediction
    from unfinished activity streaming, which has been studied
    by many researchers. However, in the general case, an unobserved
    subsequence may occur at any time by yielding a
    temporal gap in the video. In this paper, we propose a new
    method that can recognize human activities from partially
    observed videos in the general case. Specifically, we formulate
    the problem into a probabilistic framework: 1) dividing
    each activity into multiple ordered temporal segments,
    2) using spatiotemporal features of the training video samples
    in each segment as bases and applying sparse coding
    (SC) to derive the activity likelihood of the test video sample
    at each segment, and 3) finally combining the likelihood
    at each segment to achieve a global posterior for the activities.
    We further extend the proposed method to include
    more bases that correspond to a mixture of segments with
    different temporal lengths (MSSC), which can better represent
    the activities with large intra-class variations. We
    evaluate the proposed methods (SC and MSSC) on various
    real videos. We also evaluate the proposed methods on two
    special cases: 1) activity prediction where the unobserved
    subsequence is at the end of the video, and 2) human activity
    recognition on fully observed videos. Experimental
    results show that the proposed methods outperform existing
    state-of-the-art comparison methods.

  • Shape-Based Registration of Kidneys Across Differently Contrasted CT Scans

    F. Flores-Mangas, A.D. Jepson, M. Haider
    In Canadian Conference on Computer and Robot Vision (CRV), Toronto, Canda, pp. 244-251, May 2013

    @inproceedings{FloresCRV12,
    title = {Shape-Based Registration of Kidneys Across Differently Contrasted CT Scans},
    author = {F. Flores-Mangas and A.D. Jepson and M. Haider},
    booktitle = {Canadian Conference on Computer and Robot Vision},
    year = {2013},
    pages = {244–251},
    month = {May}
    }

    We present a method to register kidneys from Computed Tomography (CT) scans with and without contrast enhancement. The method builds a patient-specific kidney shape model from the contrast enhanced image, and then matches it against automatically segmented candidate surfaces extracted from the pre-contrast image to find the alignment. Only the object of interest is used to drive the alignment, providing results that are robust to near-rigid relative motions of the kidney with respect to the surrounding tissues. Shape-based features are used, as opposed to intensity-based ones, and consequently the resulting registration is invariant to the inherent contrast variations. The contributions of this work are: a surface grouping and segmentation algorithm driven by smooth curvature constraints, and a framework to register image volumes under contrast variation, relative motion and local deformation with minimal user intervention. Encouraging experimental results with real patient images, all with various kinds and sizes of kidney lesions, validate the approach.

  • What Does an Aberrated Photo Tell Us about the Lens and the Scene?

    Huixuan Tang, Kiriakos N. Kutulakos
    In International Conference on Computational Photography (ICCP), Boston, USA, April 2013
    Oral presentation

    @inproceedings{TangICCP13,
    title = {What Does an Aberrated Photo Tell Us about the Lens and the Scene?},
    author = {Huixuan Tang and Kiriakos N. Kutulakos},
    booktitle = {International Conference on Computational Photography},
    year = {2013},
    month = {April}
    }

    We consider the problem of deliberately manipulating the direct and indirect light flowing through a time-varying, fully-general scene in order to simplify its visual analysis. Our approach rests on a crucial link between stereo geometry and light transport: while direct light always obeys the epipolar geometry of a projector-camera pair, indirect light overwhelmingly does not. We show that it is possible to turn this observation into an imaging method that analyzes light transport in real time in the optical domain, prior to acquisition. This yields three key abilities that we demonstrate in an experimental camera prototype: (1) producing a live indirect-only video stream for any scene, regardless of geometric or photometric complexity; (2) capturing images that make existing structured-light shape recovery algorithms robust to indirect transport; and (3) turning them into one-shot methods for dynamic 3D shape capture.

  • Shape Perception in Human and Computer Vision: An Interdisciplinary Perspective


    Eds. S. Dickinson, Z. Pizlo, Springer Verlag, 2013

    @book{svenSpringer13,
    title = {Shape Perception in Human and Computer Vision: An Interdisciplinary Perspective},
    editor = {S. Dickinson and Z. Pizlo},
    publisher = {Springer Verlag},
    year = {2013}
    }

    This comprehensive and authoritative text/reference presents a unique, multidisciplinary perspective on Shape Perception in Human and Computer Vision. Rather than focusing purely on the state of the art, the book provides viewpoints from world-class researchers reflecting broadly on the issues that have shaped the field. Drawing upon many years of experience, each contributor discusses the trends followed and the progress made, in addition to identifying the major challenges that still lie ahead. Topics and features: examines each topic from a range of viewpoints, rather than promoting a specific paradigm; discusses topics on contours, shape hierarchies, shape grammars, shape priors, and 3D shape inference; reviews issues relating to surfaces, invariants, parts, multiple views, learning, simplicity, shape constancy and shape illusions; addresses concepts from the historically separate disciplines of computer vision and human vision using the same “language” and methods.

  • 3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

    Sanja Fidler, Sven Dickinson, Raquel Urtasun
    In Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2012
    Spotlight presentation

    @inproceedings{sanjaNIPS12,
    title = {3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model},
    author = {Sanja Fidler and Sven Dickinson and Raquel Urtasun},
    booktitle = {Neural Information Processing Systems},
    year = {2012}
    month = {December}
    }

    This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the state of-the-art in both 2D and 3D object detection.

  • Hamming distance metric learning

    M. Norouzi, D.J. Fleet, R. Salakhudinov
    In Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2012

    @inproceedings{NorouziNIPS12,
    title = {Hamming distance metric learning},
    author = {M. Norouzi and D.J. Fleet and R. Salakhudinov},
    booktitle = {Neural Information Processing Systems},
    year = {2012},
    month = {December}
    }

    Motivated by large-scale multimedia applications we propose to learn mappings
    from high-dimensional data to binary codes that preserve semantic similarity.
    Binary codes are well suited to large-scale applications as they are storage efficient
    and permit exact sub-linear kNN search. The framework is applicable
    to broad families of mappings, and uses a flexible form of triplet ranking loss.
    We overcome discontinuous optimization of the discrete mappings by minimizing
    a piecewise-smooth upper bound on empirical loss, inspired by latent structural
    SVMs. We develop a new loss-augmented inference algorithm that is quadratic in
    the code length. We show strong retrieval performance on CIFAR-10 and MNIST,
    with promising classification results using no more than kNN on the binary codes.

  • Utilizing Optical Aberrations for Extended-Depth-of-Field Panoramas

    Huixuan Tang, Kiriakos N. Kutulakos
    In Asian Conference on Computer Vision (ACCV), Daejeon, Korea, November 2012
    Oral presentation

    @inproceedings{TangACCV12,
    title = {Utilizing Optical Aberrations for Extended-Depth-of-Field Panoramas},
    author = {Huixuan Tang and Kiriakos N. Kutulakos},
    booktitle = {Asian Conference on Computer Vision},
    year = {2012},
    month = {November}
    }

    Optical aberrations in off-the-shelf photographic lenses are commonly
    treated as unwanted artifacts that degrade image quality. In this paper we argue
    that such aberrations can be useful, as they often produce point-spread functions
    (PSFs) that have greater frequency-preserving abilities in the presence of defocus
    compared to those of an ideal thin lens. Specifically, aberrated and defocused
    PSFs often contain sharp, edge-like structures that vary with depth and image
    position, and that become increasingly anisotropic away from the image center.
    In such cases, defocus blur varies spatially and preserves high spatial frequencies
    in some directions but not others. Here we take advantage of this fact to create
    extended-depth-of-field panoramas from a set of overlapping photos taken with
    off-the-shelf lenses and a wide aperture. We achieve this by first measuring the
    lens PSF through a one-time calibration procedure and then using multi-image
    deconvolution to restore anisotropic blur in areas of image overlap. Our results
    suggest that common wide-aperture lenses may preserve frequencies well enough
    to allow extended-depth-of-field panoramic photography with large apertures, resulting
    in potentially much shorter exposures.

  • Frequency Analysis of Transient Light Transport with Applications in Bare Sensor Imaging

    Di Wu, Gordon Wetzstein, Christopher Barsi, Thomas Willwacher, Matthew P. O’Toole, Nikhil Naik, Qionghai Dai, Kiriakos N. Kutulakos, Ramesh Raskar
    In European Conference on Computer Vision (ECCV), Florence, Italy, October 2012

    @inproceedings{WuECCV12,
    title = {Frequency Analysis of Transient Light Transport with Applications in Bare Sensor Imaging},
    author = {Di Wu and Gordon Wetzstein and Christopher Barsi and Thomas Willwacher and Matthew P. O’Toole and Nikhil Naik and Qionghai Dai and Kiriakos N. Kutulakos and Ramesh Raskar},
    booktitle = {European Conference on Computer Vision},
    year = {2012},
    month = {October}
    }

    Light transport has been analyzed extensively, in both the
    primal domain and the frequency domain; the latter provides intuition
    of effects introduced by free space propagation and by optical elements,
    and allows for optimal designs of computational cameras for tailored,
    efficient information capture. Here, we relax the common assumption
    that the speed of light is infinite and analyze free space propagation in
    the frequency domain considering spatial, temporal, and angular light
    variation. Using this analysis, we derive analytic expressions for cross-dimensional
    information transfer and show how this can be exploited for
    designing a new, time-resolved bare sensor imaging system.

  • Optimal Image and Video Closure by Superpixel Grouping

    A. Levinshtein, C. Sminchisescu, S. Dickinson
    In International Journal of Computer Vision (IJCV), Vol. 100, Num. 1, pp. 99-119, October 2012

    @article{svenIJCV12,
    title = {Optimal Image and Video Closure by Superpixel Grouping},
    author = {A. Levinshtein and C. Sminchisescu and S. Dickinson},
    journal = {International Journal of Computer Vision},
    volume = {100},
    number = {1},
    pages = {99–119},
    year = {2012},
    month = {October}
    }

    Detecting independent objects in images and
    videos is an important perceptual grouping problem.
    One common perceptual grouping cue that can facilitate
    this objective is the cue of contour closure, reflecting
    the spatial coherence of objects in the world and
    their projections as closed boundaries separating figure
    from background. Detecting contour closure in images
    consists of finding a cycle of disconnected contour fragments
    that separates an object from its background.
    Searching the entire space of possible groupings is intractable,
    and previous approaches have adopted powerful
    perceptual grouping heuristics, such as proximity
    and co-curvilinearity, to constrain the search. We introduce
    a new formulation of the problem, by transforming
    the problem of finding cycles of contour fragments to
    finding subsets of superpixels whose collective boundary
    has strong edge support (few gaps) in the image.
    Our cost function, a ratio of a boundary gap measure
    to area, promotes spatially coherent sets of superpixels.
    Moreover, its properties support a global optimization
    procedure based on parametric maxflow. Extending closure
    detection to videos, we introduce the concept of
    spatiotemporal closure. Analogous to image closure, we
    formulate our spatiotemporal closure cost over a graph
    of spatiotemporal superpixels. Our cost function is a
    ratio of motion and appearance discontinuity measures
    on the boundary of the selection to an internal homogeneity
    measure of the selected spatiotemporal volume.
    The resulting approach automatically recovers coherent
    components in images and videos, corresponding to
    objects, object parts, and objects with surrounding context, providing a good set of multiscale hypotheses for
    high-level scene analysis. We evaluate both our image
    and video closure frameworks by comparing them to
    other closure detection approaches, and find that they
    yield improved performance.

  • Primal-Dual Coding to Probe Light Transport

    Matthew P. O’Toole, Ramesh Raskar, Kiriakos N. Kutulakos
    In ACM Transactions on Graphics (SIGGRAPH), Vol. 31, Num. 4, August 2012

    @article{TooleSIGGRAPH12,
    title = {Primal-Dual Coding to Probe Light Transport},
    author = {Matthew P. O’Toole and Ramesh Raskar and Kiriakos N. Kutulakos},
    journal = {ACM Transactions on Graphics},
    year = {2012},
    volume = {31},
    number = {4},
    month = {August}
    }

    We present primal-dual coding, a photography technique that enables
    direct fine-grain control over which light paths contribute to
    a photo. We achieve this by projecting a sequence of patterns onto
    the scene while the sensor is exposed to light. At the same time,
    a second sequence of patterns, derived from the first and applied
    in lockstep, modulates the light received at individual sensor pixels.
    We show that photography in this regime is equivalent to a
    matrix probing operation in which the elements of the scene� transport
    matrix are individually re-scaled and then mapped to the photo.
    This makes it possible to directly acquire photos in which specific
    light transport paths have been blocked, attenuated or enhanced.
    We show captured photos for several scenes with challenging light
    transport effects, including specular inter-reflections, caustics, diffuse
    inter-reflections and volumetric scattering. A key feature of
    primal-dual coding is that it operates almost exclusively in the optical
    domain: our results consist of directly-acquired, unprocessed
    RAW photos or differences between them.

  • Video In Sentences Out

    Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang
    In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina, USA, August 2012

    @inproceedings{BarbuUAI12,
    author = {Andrei Barbu and Alexander Bridge and Zachary Burchill and Dan Coroian and Sven Dickinson and Sanja Fidler and Aaron Michaux and Sam Mussman and Siddharth Narayanaswamy and Dhaval Salvi and Lara Schmidt and Jiangnan Shangguan and Jeffrey Mark Siskind and Jarrell Waggoner and Song Wang and Jinlian Wei and Yifan Yin and Zhiqi Zhang},
    title = {Video In Sentences Out},
    booktitle = {Conference on Uncertainty in Artificial Intelligence},
    year = {2012},
    month = {August}
    }

    We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

  • Discovering Hierarchical Object Models from Captioned Images

    M. Jamieson, Y. Eskin, A. Fzaly, S. Stevenson, S. Dickinson
    In Computer Vision and Image Understanding (CVIU), Vol. 116, Num. 7, pp. 842-853, July 2012

    @article{svenCVIU12,
    title = {Discovering Hierarchical Object Models from Captioned Images},
    author = {M. Jamieson and Y. Eskin and A. Fzaly and S. Stevenson and S. Dickinson},
    journal = {Computer Vision and Image Understanding},
    volume = {116},
    number = {7},
    pages = {842–853},
    year = {2012},
    month = {July}
    }

    We address the problem of automatically learning the recurring associations
    between the visual structures in images and the words in their associated captions,
    yielding a set of named object models that can be used for subsequent
    image annotation. In previous work, we used language to drive the perceptual
    grouping of local features into configurations that capture small parts
    (patches) of an object. However, model scope was poor, leading to poor object
    localization during detection (annotation), and ambiguity was high when
    part detections were weak. We extend and significantly revise our previous
    framework by using language to drive the perceptual grouping of parts, each a
    configuration in the previous framework, into hierarchical configurations that
    offer greater spatial extent and flexibility. The resulting hierarchical multipart
    models remain scale, translation and rotation invariant, but are more
    reliable detectors and provide better localization. Moreover, unlike typical
    frameworks for learning object models, our approach requires no bounding
    boxes around the objects to be learned, can handle heavily cluttered training
    scenes, and is robust in the face of noisy captions, i.e., where objects in an
    image may not be named in the caption, and objects named in the caption
    may not appear in the image. We demonstrate improved precision and recall
    in annotation over the non-hierarchical technique and also show extended
    spatial coverage of detected objects.

  • Fast search in Hamming space with multi-index hashing

    M. Norouzi, A. Punjani, D.J. Fleet
    In Computer Vision and Pattern Recognition (CVPR), Providence, USA, June 2012

    @inproceedings{NorouziCVPR12,
    title = {Fast search in Hamming space with multi-index hashing},
    author = {M. Norouzi and A. Punjani and D.J. Fleet},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2012},
    month = {June}
    }

    There has been growing interest in mapping image data
    onto compact binary codes for fast near neighbor search in
    vision applications. Although binary codes are motivated
    by their use as direct indices (addresses) into a hash table,
    codes longer than 32 bits are not being used in this
    way, as it was thought to be ineffective. We introduce a
    rigorous way to build multiple hash tables on binary code
    substrings that enables exact K-nearest neighbor search in
    Hamming space. The algorithm is straightforward to implement,
    storage efficient, and it has sub-linear run-time
    behavior for uniformly distributed codes. Empirical results
    show dramatic speed-ups over a linear scan baseline and
    for datasets with up to one billion items, 64- or 128-bit
    codes, and search radii up to almost 25 bits.

  • Super-edge grouping for object localization by combining appearance and shape information

    Zhiqi Zhang, Sanja Fidler, Jarell W. Waggoner, Yu Cao, Jeff M. Siskind, Sven Dickinson, Song Wang
    In Computer Vision and Pattern Recognition (CVPR), Providence, June 2012

    @inproceedings{ZhangCVPR12,
    author = {Zhiqi Zhang and Sanja Fidler and Jarell W. Waggoner and Yu Cao and Jeff M. Siskind and Sven Dickinson and Song Wang},
    title = {Super-edge grouping for object localization by combining appearance and shape information},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2012},
    month = {June}
    }

    Both appearance and shape play important roles in object localization and object detection. In this paper, we propose a new superedge grouping method for object localization by incorporating both boundary shape and appearance information of objects. Compared with the previous edge grouping methods, the proposed method does not subdivide detected edges into short edgels before grouping. Such long, unsubdivided superedges not only facilitate the incorporation of object shape information into localization, but also increase the robustness against image noise and reduce computation. We identify and address several important problems in achieving the proposed superedge grouping, including gap filling for connecting superedges, accurate encoding of region-based information into individual edges, and the incorporation of object-shape information into object localization. In this paper, we use the bag of visual words technique to quantify the region-based appearance features of the object of interest. We find that the proposed method, by integrating both boundary and region information, can produce better localization performance than previous subwindow search and edge grouping methods on most of the 20 object categories from the VOC 2007 database. Experiments also show that the proposed method is roughly 50 times faster than the previous edge grouping method.

  • Unsupervised Disambiguation of Image Captions

    Wesley May, Sanja Fidler, Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson
    In First Joint Conference on Lexical and Computational Semantics (*SEM), Montreal, Canada, June 2012

    @inproceedings{MaySEM12,
    author = {Wesley May and Sanja Fidler and Afsaneh Fazly and Suzanne Stevenson and Sven Dickinson},
    title = {Unsupervised Disambiguation of Image Captions},
    booktitle = {First Joint Conference on Lexical and Computational Semantics},
    year = {2012},
    month = {June}
    }

    Given a set of images with related captions, our goal is to show how visual features can improve the accuracy of unsupervised word sense disambiguation when the textual context is very small, as this sort of data is common in news and social media. We extend previous work in unsupervised text-only disambiguation with methods that integrate text and images. We construct a corpus by using Amazon Mechanical Turk to caption sense-tagged images gathered from ImageNet. Using a Yarowsky-inspired algorithm, we show that gains can be made over text-only disambiguation, as well as multimodal approaches such as Latent Dirichlet Allocation.

  • Detecting Reduplication in Videos of American Sign Language

    Z. Gavrilov, S. Sclaroff, C. Neidle, S. Dickinson
    In Proc. Eighth International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, May 2012

    @inproceedings{GavrilovLREC12,
    author = {Z. Gavrilov and S. Sclaroff and C. Neidle and S. Dickinson},
    title = {Detecting Reduplication in Videos of American Sign Language},
    booktitle = {Proc. Eighth International Conference on Language Resources and Evaluation},
    year = {2012},
    month = {May}
    }

    A framework is proposed for the detection of reduplication in digital videos of American Sign Language (ASL). In ASL, reduplication is
    used for a variety of linguistic purposes, including overt marking of plurality on nouns, aspectual inflection on verbs, and nominalization
    of verbal forms. Reduplication involves the repetition, often partial, of the articulation of a sign. In this paper, the apriori algorithm for
    mining frequent patterns in data streams is adapted for finding reduplication in videos of ASL. The proposed algorithm can account for
    varying weights on items in the apriori algorithm� input sequence. In addition, the apriori algorithm is extended to allow for inexact
    matching of similar hand motion subsequences and to provide robustness to noise. The formulation is evaluated on 105 lexical signs
    produced by two native signers. To demonstrate the formulation, overall hand motion direction and magnitude are considered; however,
    the formulation should be amenable to combining these features with others, such as hand shape, orientation, and place of articulation.

  • Human attributes from 3D pose tracking

    M. Livne, L. Sigal, N. Troje, D.J. Fleet
    In Computer Vision and Image Understanding (CVIU), Vol. 116, Num. 5, pp. 648-660, May 2012

    @article{LivneCVIU12,
    title = {Human attributes from 3D pose tracking},
    author = {M. Livne and L. Sigal and N. Troje and D.J. Fleet},
    journal = {Computer Vision and Image Understanding},
    year = {2012},
    volume = {116},
    number = {5},
    pages = {648–660},
    month = {May}
    }

    It is well known that biological motion conveys a wealth of socially meaningful information. From even a brief exposure, biological motion cues enable the recognition of familiar people, and the inference of attributes such as gender, age, mental state, actions and intentions. In this paper we show that from the output of a video-based 3D human tracking algorithm we can infer physical attributes (e.g., gender and weight) and aspects of mental state (e.g., happiness or sadness). In particular, with 3D articulated tracking we avoid the need for view-based models, specific camera viewpoints, and constrained domains. The task is useful for man-machine communication, and it provides a natural benchmark for evaluating the performance of 3D pose tracking methods (vs. conventional Euclidean joint error metrics). We show results on a large corpus of motion capture data and on the output of a simple 3D pose tracker applied to videos of people walking.

  • Learning Categorical Shape from Captioned Images

    Tom Lee, Sanja Fidler, Alex Levinshtein, Sven Dickinson
    In Canadian Conference on Computer and Robot Vision (CRV), Toronto, Canada, May 2012

    @inproceedings{LeeCRV12,
    author = {Tom Lee and Sanja Fidler and Alex Levinshtein and Sven Dickinson},
    title = {Learning Categorical Shape from Captioned Images},
    booktitle = {Canadian Conference on Computer and Robot Vision},
    year = {2012},
    month = {May}
    }

    Given a set of captioned images of cluttered scenes containing various objects in different positions and scales, we learn named contour models of object categories without relying on bounding box annotation. We extend a recent language-vision integration framework that finds spatial configurations of image features that co-occur with words in image captions. By substituting appearance features with local contour features, object categories are recognized by a contour model that grows along the object’s boundary. Experiments on ETHZ are presented to show that 1) the extended framework is better able to learn named visual categories whose within class variation is better captured by a shape model than an appearance model; and 2) typical object recognition methods fail when manually annotated bounding boxes are unavailable.

  • Shared kernel information embedding for discriminative inference

    R. Memisevic, L. Sigal, D.J. Fleet
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 34, Num. 4, pp. 778-790, April 2012

    @article{MemisevicPAMI12,
    title = {Shared kernel information embedding for discriminative inference},
    author = {R. Memisevic and L. Sigal and D.J. Fleet},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2012},
    volume = {34},
    number = {4},
    pages = {778–790},
    month = {April}
    }

    Latent variable models, such as the GPLVM and related methods, help mitigate overfitting when learning from small or
    moderately sized training sets. Nevertheless, existing methods suffer from several problems: 1) complexity, 2) the lack of explicit
    mappings to and from the latent space, 3) an inability to cope with multimodality, and 4) the lack of a well-defined density over the latent
    space. We propose an LVM called the Kernel Information Embedding (KIE) that defines a coherent joint density over the input and a
    learned latent space. Learning is quadratic, and it works well on small data sets. We also introduce a generalization, the shared KIE
    (sKIE), that allows us to model multiple input spaces (e.g., image features and poses) using a single, shared latent representation. KIE
    and sKIE permit missing data during inference and partially labeled data during learning. We show that with data sets too large to learn
    a coherent global model, one can use the sKIE to learn local online models. We use sKIE for human pose inference.

  • Light-Efficient Photography

    Samuel W. Hasinoff, Kiriakos N. Kutulakos
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 33, Num. 11, pp. 2203-2214, November 2011

    @article{HasinoffPAMI11,
    title = {Light-Efficient Photography},
    author = {Samuel W. Hasinoff and Kiriakos N. Kutulakos},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2011},
    volume = {33},
    number = {11},
    pages = {2203–2214},
    month = {November}
    }

    In this paper, we consider the problem of imaging a scene with a given depth of field at a given exposure level in the
    shortest amount of time possible. We show that by 1) collecting a sequence of photos and 2) controlling the aperture, focus, and
    exposure time of each photo individually, we can span the given depth of field in less total time than it takes to expose a single
    narrower-aperture photo. Using this as a starting point, we obtain two key results. First, for lenses with continuously variable apertures,
    we derive a closed-form solution for the globally optimal capture sequence, i.e., that collects light from the specified depth of field in the
    most efficient way possible. Second, for lenses with discrete apertures, we derive an integer programming problem whose solution is
    the optimal sequence. Our results are applicable to off-the-shelf cameras and typical photography conditions, and advocate the use of
    dense, wide-aperture photo sequences as a light-efficient alternative to single-shot, narrow-aperture photography.

  • Simultaneous tracking and activity recognition

    C. Manfredotti, Fleet, H.J. Hamilton, S. Zilles
    In International Conference on Tools with Artificial Intelligence (ICTAI), Boca Raton, USA, pp. 189-196, November 2011

    @inproceedings{ManfredottiICTAI11,
    title = {Simultaneous tracking and activity recognition},
    author = {C. Manfredotti and Fleet and H.J. Hamilton and S. Zilles},
    booktitle = {International Conference on Tools with Artificial Intelligence},
    year = {2011},
    pages = {189–196},
    month = {November}
    }

    Many tracking problems involve several distinct objects interacting with each other. We develop a framework that takes into account interactions between objects allowing the recognition of complex activities. In contrast to classic approaches that consider distinct phases of tracking and activity recognition, our framework performs these two tasks simultaneously. In particular, we adopt a Bayesian standpoint where the system maintains a joint distribution of the positions, the interactions and the possible activities. This turns out to be advantegeous, as information about the ongoing activities can be used to improve the prediction step of the tracking, while, at the same time, tracking information can be used for online activity recognition. Experimental results in two different settings show that our approach 1) decreases the error rate and improves the identity maintenance of the positional tracking and 2) identifies the correct activity with higher accuracy than standard approaches.

  • Dynamic Refraction Stereo

    Nigel J. Morris, Kiriakos N. Kutulakos
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 33, Num. 8, pp. 1518-1531, August 2011

    @article{MorrisPAMI11,
    title = {Dynamic Refraction Stereo},
    author = {Nigel J. Morris and Kiriakos N. Kutulakos},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2011},
    volume = {33},
    number = {8},
    pages = {1518–1531},
    month = {August}
    }

    In this paper we consider the problem of reconstructing the 3D position and surface normal of points on an unknown,
    arbitrarily-shaped refractive surface. We show that two viewpoints are sufficient to solve this problem in the general case, even if the
    refractive index is unknown. The key requirements are 1) knowledge of a function that maps each point on the two image planes to a
    known 3D point that refracts to it, and 2) light is refracted only once. We apply this result to the problem of reconstructing the time-varying
    surface of a liquid from patterns placed below it. To do this, we introduce a novel �tereo matching�criterion called refractive
    disparity, appropriate for refractive scenes, and develop an optimization-based algorithm for individually reconstructing the position
    and normal of each point projecting to a pixel in the input views. Results on reconstructing a variety of complex, deforming liquid
    surfaces suggest that our technique can yield detailed reconstructions that capture the dynamic behavior of free-flowing liquids.

  • Object Categorization using Bone Graphs

    D. Macrini, S. Dickinson, D. Fleet, K. Siddiqi
    In Computer Vision and Image Understanding (CVIU), Vol. 115, Num. 8, pp. 1187-1206, August 2011

    @inproceedings{svenCVIU11c,
    title = {Object Categorization using Bone Graphs},
    author = {D. Macrini and S. Dickinson and D. Fleet and K. Siddiqi},
    journal = {Computer Vision and Image Understanding},
    volume = {115},
    number = {8},
    pages = {1187–1206},
    year = {2011},
    month = {August}
    }

    The bone graph is a graph-based medial shape abstraction that
    offers improved stability over shock graphs and other skeleton-based descriptions
    that retain unstable ligature structure. Unlike the shock graph, the
    bone graph’s edges are attributed, allowing a richer specification of relational
    information, including how and where two medial parts meet. In this
    paper, we propose a novel shape matching algorithm that exploits this relational
    information. Formulating the problem as an inexact directed acyclic
    graph matching problem, we extend a leading bipartite graph-based algorithm
    for matching shock graphs. In addition to accommodating the
    relational information, our new algorithm is better able to enforce hierarchical
    and sibling constraints between nodes, resulting in a more general and
    more powerful matching algorithm. We evaluate our algorithm with respect
    to a competing shock graph-based matching algorithm, and show that for
    the task of view-based object categorization, our algorithm applied to bone
    graphs outperforms the competing algorithm. Moreover, our algorithm applied
    to shock graphs also outperforms the competing shock graph matching
    algorithm, demonstrating the generality and improved performance of our
    matching algorithm.

  • Minimal loss hashing for compact binary codes

    M. Norouzi, D.J. Fleet
    In International Conference on Machine Learning (ICML), Bellevue, USA, June 2011

    @inproceedings{NorouziICML11,
    title = {Minimal loss hashing for compact binary codes},
    author = {M. Norouzi and D.J. Fleet},
    booktitle = {International Conference on Machine Learning},
    year = {2011},
    month = {June}
    }

    We propose a method for learning similarity-preserving
    hash functions that map high-dimensional
    data onto binary codes. The
    formulation is based on structured prediction
    with latent variables and a hinge-like
    loss function. It is efficient to train for large
    datasets, scales well to large code lengths,
    and outperforms state-of-the-art methods.

  • Model-based 3D hand pose estimation from monocular video

    M. de La Gorce, D.J. Fleet, N. Paragios
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 33, Num. 9, pp. 1793-1805, February 2011

    @article{GorcePAMI11,
    title = {Model-based 3D hand pose estimation from monocular video},
    author = {M. de La Gorce and D.J. Fleet and N. Paragios},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2011},
    volume = {33},
    number = {9},
    pages = {1793–1805},
    month = {February}
    }

    A novel model-based approach to 3D hand tracking from
    monocular video is presented. The 3D hand pose, the hand texture
    and the illuminant are dynamically estimated through minimization of
    an objective function. Derived from an inverse problem formulation, the
    objective function enables explicit use of temporal texture continuity
    and shading information, while handling important self-occlusions and
    time-varying illumination. The minimization is done efficiently using a
    quasi-Newton method, for which we provide a rigorous derivation of
    the objective function gradient. Particular attention is given to terms
    related to the change of visibility near self-occlusion boundaries that are
    neglected in existing formulations. To this end we introduce new occlusion
    forces and show that using all gradient terms greatly improves the
    performance of the method. Qualitative and quantitative experimental
    results demonstrate the potential of the approach.

  • Bone Graphs: Medial Shape Parsing and Abstraction

    D. Macrini, S. Dickinson, D. Fleet, K. Siddiqi
    In Computer Vision and Image Understanding (CVIU), Vol. 115, Num. 7, pp. 1044-1061, 2011

    @article{svenCVIU11a,
    title = {Bone Graphs: Medial Shape Parsing and Abstraction},
    author = {D. Macrini and S. Dickinson and D. Fleet and K. Siddiqi},
    journal = {Computer Vision and Image Understanding},
    volume = {115},
    number = {7},
    pages = {1044–1061},
    year = {2011}
    }

    The recognition of 3-D objects from their silhouettes demands a shape representation which is stable
    with respect to minor changes in viewpoint and articulation. This can be achieved by parsing a silhouette
    into parts and relationships that do not change across similar object views. Medial descriptions, such as
    skeletons and shock graphs, provide part-based decompositions but suffer from instabilities. As a result,
    similar shapes may be represented by dissimilar part sets. We propose a novel shape parsing approach
    which is based on identifying and regularizing the ligature structure of a medial axis, leading to a bone
    graph, a medial abstraction which captures a more stable notion of an object� parts. Our experiments
    show that it offers improved recognition and pose estimation performance in the presence of within-class
    deformation over the shock graph.

  • Efficient Many-to-Many Feature Matching under the l1 Norm

    M. Demirci, Y. Osmanlioglu, A. Shokoufandeh, S. Dickinson
    In Computer Vision and Image Understanding (CVIU), Vol. 115, Num. 7, pp. 976-983, 2011

    @article{svenCVIU11b,
    title = {Efficient Many-to-Many Feature Matching under the l1 Norm},
    author = {M. Demirci and Y. Osmanlioglu and A. Shokoufandeh and S. Dickinson},
    journal = {Computer Vision and Image Understanding},
    volume = {115},
    number = {7},
    pages = {976–983},
    year = {2011}
    }

    Matching configurations of image features, represented as attributed graphs, to configurations of model
    features is an important component in many object recognition algorithms. Noisy segmentation of
    images and imprecise feature detection may lead to graphs that represent visually similar configurations
    that do not admit an injective matching. In previous work, we presented a framework which computed
    an explicit many-to-many vertex correspondence between attributed graphs of features configurations.
    The framework utilized a low distortion embedding function to map the nodes of the graphs into point
    sets in a vector space. The Earth Movers Distance (EMD) algorithm was then used to match the resulting
    points, with the computed flows specifying the many-to-many vertex correspondences between the
    input graphs. In this paper, we will present a distortion-free embedding, which represents input graphs
    as metric trees and then embeds them isometrically in the geometric space under the l1 norm. This not
    only improves the representational power of graphs in the geometric space, it also reduces the complexity
    of the previous work using recent developments in computing EMD under l1. Empirical evaluation of
    the algorithm on a set of recognition trials, including a comparison with previous approaches, demonstrates
    the effectiveness and robustness of the proposed framework.

  • Motion models for people tracking

    D.J. Fleet
    In Guide to Visual Analysis of Humans: Looking at People, Eds. T. Moeslund, A. Hilton, V. Krueger, L. Sigal, Springer, pp. 171-198, 2011

    @inbook{FleetChapter11,
    title = {Motion models for people tracking},
    author = {D.J. Fleet},
    booktitle = {Guide to Visual Analysis of Humans: Looking at People},
    year = {2011},
    editor = {T. Moeslund and A. Hilton and V. Krueger and L. Sigal},
    publisher = {Springer},
    pages = {171–198}
    }

    This chapter provides an introduction to models of human pose and motion
    for use in 3D human pose tracking. We concentrate on probabilistic latent variable
    models of kinematics, most of which are learned from motion capture data,
    and on recent physics-based models. We briefly discuss important open problems
    and future research challenges.

  • Transparent and Specular Object Reconstruction

    Matthew P. O’Toole, Kiriakos N. Kutulakos
    In SIGGRAPH Asia, Vol. 29, Num. 6, December 2010

    @article{TooleSIGGRAPHAsia10,
    title = {Transparent and Specular Object Reconstruction},
    author = {Matthew P. O’Toole and Kiriakos N. Kutulakos},
    journal = {SIGGRAPH Asia},
    year = {2010},
    volume = {29},
    number = {6},
    month = {December}
    }

    We present a general framework for analyzing the transport matrix of a real-world scene at full resolution, without capturing many photos. The key idea is to use projectors and cameras to directly acquire eigenvectors and the Krylov subspace of the unknown transport matrix. To do this, we implement Krylov subspace methods partially in optics, by treating the scene as a black box subroutine that enables optical computation of arbitrary matrix-vector products. We describe two methods�ptical Arnoldi to acquire a low-rank approximation of the transport matrix for relighting; and optical GMRES to invert light transport. Our experiments suggest that good-quality relighting and transport inversion are possible from a few dozen low-dynamic range photos, even for scenes with complex shadows, caustics, and other challenging lighting effects.

  • Spatiotemporal Closure

    A. Levinshtein, C. Sminchisescu, S. Dickinson
    In Asian Conference on Computer Vision (ACCV), Queenstown, New Zealand, November 2010

    @inproceedings{LevinshteinACCV10,
    author = {A. Levinshtein and C. Sminchisescu and S. Dickinson},
    title = {Spatiotemporal Closure},
    booktitle = {Asian Conference on Computer Vision},
    year = {2010},
    month = {November}
    }

    Spatiotemporal segmentation is an essential task for video
    analysis. The strong interconnection between finding an object� spatial
    support and finding its motion characteristics makes the problem
    particularly challenging. Motivated by closure detection techniques in
    2D images, this paper introduces the concept of spatiotemporal closure.
    Treating the spatiotemporal volume as a single entity, we extract contiguous
    �ubes�whose overall surface is supported by strong appearance
    and motion discontinuties. Formulating our closure cost over a graph of
    spatiotemporal superpixels, we show how it can be globally minimized
    using the parametric maxflow framework in an efficient manner. The
    resulting approach automatically recovers coherent spatiotemporal components,
    corresponding to objects, object parts, and object unions, providing
    a good set of multiscale spatiotemporal hypotheses for high-level
    video analysis.

  • Spatiotemporal Contour Grouping using Abstract Part Models

    P. Sala, D. Macrini, S. Dickinson
    In Asian Conference on Computer Vision (ACCV), Queenstown, New Zealand, November 2010

    @inproceedings{SalaACCV10,
    author = {P. Sala and D. Macrini and S. Dickinson},
    title = {Spatiotemporal Contour Grouping using Abstract Part Models},
    booktitle = {Asian Conference on Computer Vision},
    year = {2010},
    month = {November}
    }

    In recent work [Sala & Dickinson, 2010], we introduced a framework for model based
    perceptual grouping and shape abstraction using a vocabulary of
    simple part shapes. Given a user-defined vocabulary of simple abstract
    parts, the framework grouped image contours whose abstract shape was
    consistent with one of the part models. While the results showed promise,
    the representational gap between the actual image contours that make
    up an exemplar shape and the contours that make up an abstract part
    model is significant, and an abstraction of a group of image contours
    may be consistent with more than one part model; therefore, while recall
    of ground-truth parts was good, precision was poor. In this paper, we
    address the precision problem by moving the camera and exploiting spatiotemporal
    constraints in the grouping process. We introduce a novel
    probabilistic, graph-theoretic formulation of the problem, in which the
    spatiotemporal consistency of a perceptual group under camera motion
    is learned from a set of training sequences. In a set of comprehensive
    experiments, we demonstrate (not surprisingly) how a spatiotemporal
    framework for part-based perceptual grouping significantly outperforms
    a static image version.

  • Contour Grouping and Abstraction Using Simple Part Models

    P. Sala, S. Dickinson
    In European Conference on Computer Vision (ECCV), Crete, Greece, September 2010

    @inproceedings{SalaECCV10,
    author = {P. Sala and S. Dickinson},
    title = {Contour Grouping and Abstraction Using Simple Part Models},
    booktitle = {European Conference on Computer Vision},
    year = {2010},
    month = {September}
    }

    We address the problem of contour-based perceptual grouping using a
    user-defined vocabulary of simple part models. We train a family of classifiers on
    the vocabulary, and apply them to a region oversegmentation of the input image
    to detect closed contours that are consistent with some shape in the vocabulary.
    Given such a set of consistent cycles, they are both abstracted and categorized
    through a novel application of an active shape model also trained on the vocabulary.
    From an image of a real object, our framework recovers the projections of
    the abstract surfaces that comprise an idealized model of the object. We evaluate
    our framework on a newly constructed dataset annotated with a set of ground
    truth abstract surfaces.

  • Discovering Multipart Appearance Models from Captioned Images

    M. Jamieson, Y. Eskin, A. Fazly, S. Stevenson, S. Dickinson
    In European Conference on Computer Vision (ECCV), Crete, Greece, September 2010

    @inproceedings{JamiesonECCV10,
    author = {M. Jamieson and Y. Eskin and A. Fazly and S. Stevenson and S. Dickinson},
    title = {Discovering Multipart Appearance Models from Captioned Images},
    booktitle = {European Conference on Computer Vision},
    year = {2010},
    month = {September}
    }

    Even a relatively unstructured captioned image set depicting
    a variety of objects in cluttered scenes contains strong correlations
    between caption words and repeated visual structures. We exploit these
    correlations to discover named objects and learn hierarchical models of
    their appearance. Revising and extending a previous technique for finding
    small, distinctive configurations of local features, our method assembles
    these co-occurring parts into graphs with greater spatial extent and flexibility.
    The resulting multipart appearance models remain scale, translation
    and rotation invariant, but are more reliable detectors and provide
    better localization. We demonstrate improved annotation precision and
    recall on datasets to which the non-hierarchical technique was previously
    applied and show extended spatial coverage of detected objects.

  • Human attributes from 3D pose tracking

    L. Sigal, D.J. Fleet, N. Troje, M. Livne
    In European Conference on Computer Vision (ECCV), Heraklion, Greece, September 2010

    @inproceedings{SigalECCV10,
    title = {Human attributes from 3D pose tracking},
    author = {L. Sigal and D.J. Fleet and N. Troje and M. Livne},
    booktitle = {European Conference on Computer Vision},
    year = {2010},
    month = {September}
    }

    We show that, from the output of a simple 3D human pose tracker one
    can infer physical attributes (e.g., gender and weight) and aspects of mental state
    (e.g., happiness or sadness). This task is useful for man-machine communication,
    and it provides a natural benchmark for evaluating the performance of 3D pose
    tracking methods (vs. conventional Euclidean joint error metrics). Based on an extensive
    corpus of motion capture data, with physical and perceptual ground truth,
    we analyze the inference of subtle biologically-inspired attributes from cyclic
    gait data. It is shown that inference is also possible with partial observations of
    the body, and with motions as short as a single gait cycle. Learning models from
    small amounts of noisy video pose data is, however, prone to over-fitting. To mitigate
    this we formulate learning in terms of domain adaptation, for which mocap
    data is uses to regularize models for inference from video-based data.

  • Optimal Contour Closure by Superpixel Grouping

    A. Levinshtein, C. Sminchisescu, S. Dickinson
    In European Conference on Computer Vision (ECCV), Crete, Greece, September 2010

    @inproceedings{LevinshteinECCV10,
    author = {A. Levinshtein and C. Sminchisescu and S. Dickinson},
    title = {Optimal Contour Closure by Superpixel Grouping},
    booktitle = {European Conference on Computer Vision},
    year = {2010},
    month = {September}
    }

    Detecting contour closure, i.e., finding a cycle of disconnected
    contour fragments that separates an object from its background,
    is an important problem in perceptual grouping. Searching the entire
    space of possible groupings is intractable, and previous approaches have
    adopted powerful perceptual grouping heuristics, such as proximity and
    co-curvilinearity, to manage the search. We introduce a new formulation
    of the problem, by transforming the problem of finding cycles of contour
    fragments to finding subsets of superpixels whose collective boundary
    has strong edge support in the image. Our cost function, a ratio of a
    novel learned boundary gap measure to area, promotes spatially coherent
    sets of superpixels. Moreover, its properties support a global optimization
    procedure using parametric maxflow. We evaluate our framework by
    comparing it to two leading contour closure approaches, and find that it
    yields improved performance.

  • Optimizing walking controllers with uncertain user inputs and environments

    J. Wang, D.J. Fleet, A. Hertzmann
    In ACM Transactions on Graphics (SIGGRAPH), Vol. 29, Num. 4, July 2010

    @article{WangSIGGRAPH10,
    title = {Optimizing walking controllers with uncertain user inputs and environments},
    author = {J. Wang and D.J. Fleet and A. Hertzmann},
    journal = {ACM Transactions on Graphics},
    year = {2010},
    volume = {29},
    number = {4},
    month = {July}
    }

    We introduce methods for optimizing physics-based walking controllers
    for robustness to uncertainty. Many unknown factors, such
    as external forces, control torques, and user control inputs, cannot
    be known in advance and must be treated as uncertain. These
    variables are represented with probability distributions, and a return
    function scores the desirability of a single motion. Controller
    optimization entails maximizing the expected value of the return,
    which is computed by Monte Carlo methods. We demonstrate examples
    with different sources of uncertainty and task constraints.
    Optimizing control strategies under uncertainty increases robustness
    and produces natural variations in style.

  • Dynamical binary latent variable models for 3D human pose tracking

    G.W. Taylor, L. Sigal, D.J. Fleet, G. Hinton
    In Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, June 2010

    @inproceedings{TaylorCVPR10,
    title = {Dynamical binary latent variable models for 3D human pose tracking},
    author = {G.W. Taylor and L. Sigal and D.J. Fleet and G. Hinton},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2010},
    month = {June}
    }

    We introduce a new class of probabilistic latent variable
    model called the Implicit Mixture of Conditional Restricted
    Boltzmann Machines (imCRBM) for use in human
    pose tracking. Key properties of the imCRBM are as follows:
    (1) learning is linear in the number of training exemplars
    so it can be learned from large datasets; (2) it learns
    coherent models of multiple activities; (3) it automatically
    discovers atomic “movemes”; and (4) it can infer transitions
    between activities, even when such transitions are not
    present in the training set. We describe the model and how
    it is learned and we demonstrate its use in the context of
    Bayesian filtering for multi-view and monocular pose tracking.
    The model handles difficult scenarios including multiple
    activities and transitions among activities. We report
    state-of-the-art results on the HumanEva dataset.

  • Non-rigid structure from locally-rigid motion

    J. Taylor, A.D. Jepson, K.N. Kutulakos
    In Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, June 2010
    Oral presentation

    @inproceedings{TaylorCVPR10,
    title = {Non-rigid structure from locally-rigid motion},
    author = {J. Taylor and A.D. Jepson and K.N. Kutulakos},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2010},
    month = {June}
    }

    We introduce locally-rigid motion, a general framework for
    solving the M-point, N-view structure-from-motion problem
    for unknown bodies deforming under orthography. The
    key idea is to first solve many local 3-point, N-view rigid
    problems independently, providing a “soup” of specific,
    plausibly rigid, 3D triangles. The main advantage here is
    that the extraction of 3D triangles requires only very weak
    assumptions: (1) deformations can be locally approximated
    by near-rigid motion of three points (i.e., stretching not
    dominant) and (2) local motions involve some generic rotation
    in depth. Triangles from this soup are then grouped
    into bodies, and their depth flips and instantaneous relative
    depths are determined. Results on several sequences,
    both our own and from related work, suggest these conditions
    apply in diverse settings — including very challenging
    ones (e.g., multiple deforming bodies). Our starting point
    is a novel linear solution to 3-point structure from motion,
    a problem for which no general algorithms currently exist.

  • Polynomial shape from shading

    A. Ecker, A.D. Jepson
    In Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 145-152, June 2010

    @inproceedings{EckerCVPR10,
    title = {Polynomial shape from shading},
    author = {A. Ecker and A.D. Jepson},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2010},
    pages = {145–152}
    month = {June}
    }

    We examine the shape from shading problem without boundary conditions as a polynomial system. This view allows, in generic cases, a complete solution for ideal polyhedral objects. For the general case we propose a semidefinite programming relaxation procedure, and an exact line search iterative procedure with a new smoothness term that favors folds at edges. We use this numerical technique to inspect shading ambiguities.

  • Physics-based person tracking using the Anthropomorphic Walker

    M. Brubaker, D.J. Fleet, A. Hertzmann
    In International Journal of Computer Vision (IJCV), Vol. 87, Num. 1, March 2010

    @article{BrubakerIJCV10,
    title = {Physics-based person tracking using the Anthropomorphic Walker},
    author = {M. Brubaker and D.J. Fleet and A. Hertzmann},
    journal = {International Journal of Computer Vision},
    year = {2010},
    volume = {87},
    number = {1},
    month = {March}
    }

    We introduce a physics-based model for 3D person
    tracking. Based on a bio-mechanical characterization of
    lower-body dynamics, the model captures important physical
    properties of bipedal locomotion such as balance and
    ground contact. The model generalizes naturally to variations
    in style due to changes in speed, step-length, and mass,
    and avoids common problems (such as foot-skate) that arise
    with existing trackers. The dynamics comprise a two degree-of-freedom
    representation of human locomotion with inelastic
    ground contact. A stochastic controller generates impulsive
    forces during the toe-off stage of walking, and springlike
    forces between the legs. A higher-dimensional kinematic
    body model is conditioned on the underlying dynamics.
    The combined model is used to track walking people in
    video, including examples with turning, occlusion, and varying
    gait. We also report quantitative monocular and binocular
    tracking results with the HumanEva dataset.

  • Linear sequence-to-sequence alighnment

    Rodrigo L. Carceroni, Flavio L. Padua, G. A. Santos, Kiriakos N. Kutulakos
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 32, Num. 2, pp. 304-320, February 2010

    @article{CarceroniPAMI10,
    title = {Linear sequence-to-sequence alighnment},
    author = {Rodrigo L. Carceroni and Flavio L. Padua and G. A. Santos and Kiriakos N. Kutulakos},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2010},
    volume = {32},
    number = {2},
    pages = {304–320},
    month = {February}
    }

    In this paper, we consider the problem of estimating the spatiotemporal alignment between N unsynchronized video
    sequences of the same dynamic 3D scene, captured from distinct viewpoints. Unlike most existing methods, which work for N = 2 and
    rely on a computationally intensive search in the space of temporal alignments, we present a novel approach that reduces the problem
    for general N to the robust estimation of a single line in RN. This line captures all temporal relations between the sequences and can
    be computed without any prior knowledge of these relations. Considering that the spatial alignment is captured by the parameters of
    fundamental matrices, an iterative algorithm is used to refine simultaneously the parameters representing the temporal and spatial
    relations between the sequences. Experimental results with real-world and synthetic sequences show that our method can accurately
    align the videos even when they have large misalignments (e.g., hundreds of frames), when the problem is seemingly ambiguous (e.g.,
    scenes with roughly periodic motion), and when accurate manual alignment is difficult (e.g., due to slow-moving objects).

  • Discovering Multipart Appearance Models from Captioned Images

    M. Jamieson, Y. Eskin, A. Fazly, S. Stevenson, S. Dickinson
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 32, Num. 1, pp. 148-164, January 2010

    @article{JamiesonPAMI10,
    author = {M. Jamieson and Y. Eskin and A. Fazly and S. Stevenson and S. Dickinson},
    title = {Discovering Multipart Appearance Models from Captioned Images},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2010},
    volume = {32},
    number = {1},
    pages = {148–164},
    month = {January}
    }

    Given an unstructured collection of captioned images
    of cluttered scenes featuring a variety of objects, our goal
    is to simultaneously learn the names and appearances of the
    objects. Only a small fraction of local features within any given
    image are associated with a particular caption word, and captions
    may contain irrelevant words not associated with any image
    object. We propose a novel algorithm that uses the repetition of
    feature neighborhoods across training images and a measure of
    correspondence with caption words to learn meaningful feature
    configurations (representing named objects). We also introduce
    a graph-based appearance model that captures some of the
    structure of an object by encoding the spatial relationships
    among the local visual features. In an iterative procedure we
    use language (the words) to drive a perceptual grouping process
    that assembles an appearance model for a named object. Results
    of applying our method to three data sets in a variety of
    conditions demonstrate that from complex, cluttered, real-world
    scenes with noisy captions, we can learn both the names and
    appearances of objects, resulting in a set of models invariant
    to translation, scale, orientation, occlusion, and minor changes
    in viewpoint or articulation. These named models, in turn, are
    used to automatically annotate new, uncaptioned images, thereby
    facilitating keyword-based image retrieval.

  • Transparent and Specular Object Reconstruction

    Ivo Ihrke, Kiriakos N. Kutulakos, Hendrik P. Lensch, Marcus A. Magnor, Wolfgang Heidrich
    In Computer Graphics Forum, Vol. 29, Num. 8, pp. 2400-2426, 2010

    @article{IhrkeCGF10,
    title = {Transparent and Specular Object Reconstruction},
    author = {Ivo Ihrke and Kiriakos N. Kutulakos and Hendrik P. Lensch and Marcus A. Magnor and Wolfgang Heidrich},
    journal = {Computer Graphics Forum},
    year = {2010},
    volume = {29},
    number = {8},
    pages = {2400–2426}
    }

    This state of the art report covers reconstruction methods for transparent and specular objects or phenomena.
    While the 3D acquisition of opaque surfaces with Lambertian reflectance is a well-studied problem, transparent,
    refractive, specular and potentially dynamic scenes pose challenging problems for acquisition systems. This report
    reviews and categorizes the literature in this field.

    Despite tremendous interest in object digitization, the acquisition of digital models of transparent or specular
    objects is far from being a solved problem. On the other hand, real-world data is in high demand for applications
    such as object modelling, preservation of historic artefacts and as input to data-driven modelling techniques. With
    this report we aim at providing a reference for and an introduction to the field of transparent and specular object
    reconstruction.

    We describe acquisition approaches for different classes of objects. Transparent objects/phenomena that do not
    change the straight ray geometry can be found foremost in natural phenomena. Refraction effects are usually
    small and can be considered negligible for these objects. Phenomena as diverse as fire, smoke, and interstellar
    nebulae can be modelled using a straight ray model of image formation. Refractive and specular surfaces on the
    other hand change the straight rays into usually piecewise linear ray paths, adding additional complexity to the
    reconstruction problem. Translucent objects exhibit significant sub-surface scattering effects rendering traditional
    acquisition approaches unstable. Different classes of techniques have been developed to deal with these problems
    and good reconstruction results can be achieved with current state-of-the-art techniques. However, the approaches
    are still specialized and targeted at very specific object classes. We classify the existing literature and hope to
    provide an entry point to this exiting field.

  • Optimizing walking controllers

    J. Wang, D.J. Fleet, A. Hertzmann
    In SIGGRAPH Asia, Vol. 28, Num. 5, December 2009

    @article{WangSIGGRAPH09,
    title = {Optimizing walking controllers},
    author = {J. Wang and D.J. Fleet and A. Hertzmann},
    journal = {SIGGRAPH Asia},
    year = {2009},
    volume = {28},
    number = {5},
    month = {December}
    }

    This paper describes a method for optimizing the parameters of a physics-based controller for full-body, 3D walking. A modified version of the SIMBICON controller [Yin et al. 2007] is optimized for characters of varying body shape, walking speed and step length. The objective function includes terms for power minimization, angular momentum minimization, and minimal head motion, among others. Together these terms produce a number of important features of natural walking, including active toe-off, near-passive knee swing, and leg extension during swing. We explain the specific form of our objective criteria, and show the importance of each term to walking style. We demonstrate optimized controllers for walking with different speeds, variation in body shape, and in ground slope.

  • Relations to improve multi-target tracking in an activity recognition system

    C. Manfredotti, E. Messina, D.J. Fleet
    In International Conference on Imaging for Crime Detection and Prevention (ICDP), London, UK, December 2009

    @inproceedings{ManfredottiICICDP09,
    title = {Relations to improve multi-target tracking in an activity recognition system},
    author = {C. Manfredotti and E. Messina and D.J. Fleet},
    booktitle = {International Conference on Imaging for Crime Detection and Prevention},
    year = {2009},
    month = {December}
    }

    The explicit recognition of the relationships between interacting objects can improve the understanding of their dynamics. In this work, we investigate the use of relational dynamic Bayesian networks to represent the interactions between moving objects in a surveillance system. We use a transition model that incorporates first-order logic relations and a two-phases particle filter algorithm in order to directly track relations between targets. We present some results about activity recognition in monitoring coastal borders.

  • TurboPixels: Fast Superpixels using Geometric Flows

    A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson,, K. Siddiqi
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 31, Num. 12, pp. 2290-2297, December 2009

    @article{LevinshteinPAMI09,
    author = {A. Levinshtein and A. Stere and K. Kutulakos and D. Fleet and S. Dickinson, and K. Siddiqi},
    title = {TurboPixels: Fast Superpixels using Geometric Flows},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2009},
    volume = {31},
    number = {12},
    pages = {2290–2297},
    month = {December}
    }

    We describe a geometric-flow-based algorithm for computing a dense
    oversegmentation of an image, often referred to as superpixels. It produces
    segments that, on one hand, respect local image boundaries, while, on the other
    hand, limiting undersegmentation through a compactness constraint. It is very fast,
    with complexity that is approximately linear in image size, and can be applied to
    megapixel sized images with high superpixel densities in a matter of minutes. We
    show qualitative demonstrations of high-quality results on several complex
    images. The Berkeley database is used to quantitatively compare its performance
    to a number of oversegmentation algorithms, showing that it yields less
    undersegmentation than algorithms that lack a compactness constraint while
    offering a significant speedup over N-cuts, which does enforce compactness.

  • Benchmarking image segmentation algorithms

    F.J. Estrada, A.D. Jepson
    In International Journal of Computer Vision (IJCV), Vol. 85, Num. 2, pp. 167-181, November 2009

    @article{EstradaIJCV09,
    title = {Benchmarking image segmentation algorithms},
    author = {F.J. Estrada, A.D. Jepson},
    journal = {International Journal of Computer Vision},
    volume = {85},
    number = {2},
    year = {2009},
    pages = {167–181},
    month = {November}
    }

    We present a thorough quantitative evaluation of
    four image segmentation algorithms on images from the
    Berkeley Segmentation Database. The algorithms are evaluated
    using an efficient algorithm for computing precision
    and recall with regard to human ground-truth boundaries.
    We test each segmentation method over a representative set
    of input parameters, and present tuning curves that fully
    characterize algorithm performance over the complete image
    database. We complement the evaluation on the BSD
    with segmentation results on synthetic images. The results
    reported here provide a useful benchmark for current and
    future research efforts in image segmentation.

  • The Representation and Matching of Images using Top Points

    F. Demirci, B. Platel, A. Shokoufandeh, L. Florack, S. Dickinson
    In Journal of Mathematical Imaging and Vision (JMIV), Vol. 35, Num. 2, pp. 103-116, October 2009

    @article{DemirciPAMI09,
    author = {F. Demirci and B. Platel and A. Shokoufandeh and L. Florack and S. Dickinson},
    title = {The Representation and Matching of Images using Top Points},
    journal = {Journal of Mathematical Imaging and Vision},
    year = {2009},
    volume = {35},
    number = {2},
    pages = {103–116},
    month = {October}
    }

    In previous work, singular points (or top points)
    in the scale space representation of generic images have
    proven valuable for image matching. In this paper, we propose
    a novel construction that encodes the scale space description
    of top points in the form of a directed acyclic
    graph. This representation allows us to utilize coarse-to-
    fine graph matching algorithms for comparing images represented
    in terms of top point configurations instead of using
    solely the top points and their features in a point matching
    algorithm, as was done previously. The nodes of the graph
    represent the critical paths together with their top points.
    The edge set captures the neighborhood distribution of vertices
    in scale space, and is constructed through a hierarchical
    tessellation of scale space using a Delaunay triangulation
    of the top points. We present a coarse-to-fine many-to-many
    matching algorithm for comparing such graph-based
    representations. The algorithm is based on a metric-tree representation
    of labeled graphs and their low-distortion embeddings
    into normed vector spaces via spherical encoding.
    This is a two-step transformation that reduces the matching
    problem to that of computing a distribution-based distance measure between two such embeddings. To evaluate
    the quality of our representation, four sets of experiments
    are performed. First, the stability of this representation under
    Gaussian noise of increasing magnitude is examined.
    Second, a series of recognition experiments is run on a face
    database. Third, a set of clutter and occlusion experiments
    is performed to measure the robustness of the algorithm.
    Fourth, the algorithm is compared to a leading interest point-based
    framework in an object recognition experiment.

  • Backing Off: Hierarchical decomposition of activity for 3D novel pose recovery

    J. Darby, B. Li, N. Costens, D.J. Fleet, N. Lawrence
    In British Machine Vision Conference (BMVC), London, UK, September 2009

    @inproceedings{DarbyBMVC09,
    title = {Backing Off: Hierarchical decomposition of activity for 3D novel pose recovery},
    author = {J. Darby and B. Li and N. Costens and D.J. Fleet and N. Lawrence},
    booktitle = {British Machine Vision Conference},
    year = {2009},
    month = {September}
    }

    For model-based 3D human pose estimation, even simple models of the human body
    lead to high-dimensional state spaces. Where the class of activity is known a priori, low-dimensional
    activity models learned from training data make possible a thorough and
    efficient search for the best pose. Conversely, searching for solutions in the full state
    space places no restriction on the class of motion to be recovered, but is both difficult
    and expensive. This paper explores a potential middle ground between these approaches,
    using the hierarchical Gaussian process latent variable model to learn activity at different
    hierarchical scales within the human skeleton. We show that by training on full-body
    activity data then descending through the hierarchy in stages and exploring subtrees independently
    of one another, novel poses may be recovered. Experimental results on motion
    capture data and monocular video sequences demonstrate the utility of the approach, and
    comparisons are drawn with existing low-dimensional activity models.

  • Confocal Stereo

    Samuel W. Hasinoff, Kiriakos N. Kutulakos
    In International Journal of Computer Vision (IJCV), Vol. 81, Num. 1, pp. 82-104, September 2009
    Special Issue on ECCV 2006 Best Papers

    @article{HasinofIJCV09,
    title = {Confocal Stereo},
    author = {Samuel W. Hasinoff and Kiriakos N. Kutulakos},
    journal = {International Journal of Computer Vision},
    year = {2009},
    volume = {81},
    number = {1},
    pages = {82–104},
    month = {September}
    }

    We present confocal stereo, a new method for
    computing 3D shape by controlling the focus and aperture of
    a lens. The method is specifically designed for reconstructing
    scenes with high geometric complexity or fine-scale texture.
    To achieve this, we introduce the confocal constancy
    property, which states that as the lens aperture varies, the
    pixel intensity of a visible in-focus scene point will vary in
    a scene-independent way, that can be predicted by prior radiometric
    lens calibration. The only requirement is that incoming
    radiance within the cone subtended by the largest
    aperture is nearly constant. First, we develop a detailed lens
    model that factors out the distortions in high resolution SLR
    cameras (12MP or more) with large-aperture lenses (e.g.,
    f1.2). This allows us to assemble an A ? F aperture-focus
    image (AFI) for each pixel, that collects the undistorted
    measurements over all A apertures and F focus settings. In
    the AFI representation, confocal constancy reduces to color
    comparisons within regions of the AFI, and leads to focus
    metrics that can be evaluated separately for each pixel. We
    propose two such metrics and present initial reconstruction
    results for complex scenes, as well as for a scene with known
    ground-truth shape

  • Estimating contact dynamics

    M. Brubaker, L. Sigal, D.J. Fleet
    In International Conference on Computer Vision (ICCV), Kyoto, Japan, September 2009

    @inproceedings{BrubakerICCV09,
    title = {Estimating contact dynamics},
    author = {M. Brubaker and L. Sigal and D.J. Fleet},
    booktitle = {International Conference on Computer Vision},
    year = {2009},
    month = {September}
    }

    Motion and interaction with the environment are fundamentally
    intertwined. Few people-tracking algorithms exploit
    such interactions, and those that do assume that surface
    geometry and dynamics are given. This paper concerns
    the converse problem, i.e., the inference of contact and environment
    properties from motion. For 3D human motion,
    with a 12-segment articulated body model, we show how
    one can estimate the forces acting on the body in terms
    of internal forces (joint torques), gravity, and the parameters
    of a contact model (e.g., the geometry and dynamics
    of a spring-based model). This is tested on motion capture
    data and video-based tracking data, with walking, jogging,
    cartwheels, and jumping.

  • Multiscale Symmetric Part Detection and Grouping

    A. Levinshtein, C. Sminchisescu, S. Dickinson
    In International Conference on Computer Vision (ICCV), Kyoto, Japan, September 2009

    @inproceedings{LevinshteinICCV09,
    author = {A. Levinshtein and C. Sminchisescu and S. Dickinson},
    title = {Multiscale Symmetric Part Detection and Grouping},
    booktitle = {International Conference on Computer Vision},
    year = {2009},
    month = {September}
    }

    Skeletonization algorithms typically decompose an object’s
    silhouette into a set of symmetric parts, offering a
    powerful representation for shape categorization. However,
    having access to an object’s silhouette assumes correct
    figure-ground segmentation, leading to a disconnect with
    the mainstream categorization community, which attempts
    to recognize objects from cluttered images. In this paper,
    we present a novel approach to recovering and grouping
    the symmetric parts of an object from a cluttered scene. We
    begin by using a multiresolution superpixel segmentation to
    generate medial point hypotheses, and use a learned affinity
    function to perceptually group nearby medial points likely
    to belong to the same medial branch. In the next stage,
    we learn higher granularity affinity functions to group the
    resulting medial branches likely to belong to the same object.
    The resulting framework yields a skeletal approximation
    that’s free of many of the instabilities plaguing traditional
    skeletons. More importantly, it doesn’t require a
    closed contour, enabling the application of skeleton-based
    categorization systems to more realistic imagery.

  • Stochastic Image Denoising

    F.J. Estrada, D.J. Fleet, A.D. Jepson
    In British Machine Vision Conference (BMVC), London, UK, pp. 145-152, September 2009
    Best Paper Award

    @inproceedings{EstradaBMVC09,
    title = {Stochastic Image Denoising},
    author = {F.J. Estrada and D.J. Fleet and A.D. Jepson},
    booktitle = {British Machine Vision Conference},
    year = {2009},
    pages = {145–152}
    month = {September}
    }

    We present a novel, probabilistic algorithm for image noise removal. We show that
    suitably constrained random walks over small image neighbourhoods provide a good
    estimate of the appearance of a pixel, and that a stable estimate can be obtained with
    a small number of samples. We provide a through evaluation and comparison of the
    proposed algorithm over a large standardized data set. Results show that our method
    consistently outperforms competing approaches for image denoising.

  • Time-Constrained Photography

    Samuel W. Hasinoff, Kiriakos N. Kutulakos, Fredo Durand, William T. Freeman
    In International Conference on Computer Vision (ICCV), Kyoto, Japan, pp. 333-340, September 2009
    Oral presentation

    @inproceedings{HasinoffICCV09,
    title = {Time-Constrained Photography},
    author = {Samuel W. Hasinoff and Kiriakos N. Kutulakos and Fredo Durand and William T. Freeman},
    booktitle = {International Conference on Computer Vision},
    year = {2009},
    pages = {333–340},
    month = {September}
    }

    Capturing multiple photos at different focus settings is a
    powerful approach for reducing optical blur, but how many
    photos should we capture within a fixed time budget? We
    develop a framework to analyze optimal capture strategies
    balancing the tradeoff between defocus and sensor
    noise, incorporating uncertainty in resolving scene depth.
    We derive analytic formulas for restoration error and use
    Monte Carlo integration over depth to derive optimal capture
    strategies for different camera designs, under a wide
    range of photographic scenarios. We also derive a new upper
    bound on how well spatial frequencies can be preserved
    over the depth of field. Our results show that by capturing
    the optimal number of photos, a standard camera can
    achieve performance at the level of more complex computational
    cameras, in all but the most demanding of cases.
    We also show that computational cameras, although specifically
    designed to improve one-shot performance, generally
    benefit from capturing multiple photos as well.

  • The quantitative characterization of the distinctiveness and robustness of local image descriptors

    G. Carneiro, A.D. Jepson
    In Image and Vision Computing (IVC), Vol. 27, Num. 8, pp. 1143-1156, July 2009

    @article{CarneiroIVC09,
    title = {The quantitative characterization of the distinctiveness and robustness of local image descriptors},
    author = {G. Carneiro and A.D. Jepson},
    journal = {Image and Vision Computing},
    volume = {27},
    number = {8},
    year = {2009},
    pages = {1143–1156},
    month = {July}
    }

    We introduce a new method that characterizes quantitatively local image descriptors in
    terms of their distinctiveness and robustness to geometric transformations and brightness
    deformations. The quantitative characterization of these properties is important for recognition
    systems based on local descriptors because it allows for the implementation of a
    classifier that selects descriptors based on their distinctiveness and robustness properties.
    This classification results in: a) recognition time reduction due to a smaller number of descriptors present in the test image and in the database of model descriptors; b) improvement of the recognition accuracy since only the most reliable descriptors for the recognition task
    are kept in the model and test images; and c) better scalability given the smaller number
    of descriptors per model. Moreover, the quantitative characterization of distinctiveness and
    robustness of local descriptors provides a more accurate formulation of the recognition process,
    which has the potential to improve the recognition accuracy. We show how to train a
    multi-layer perceptron that quickly classifies robust and distinctive local image descriptors.
    A regressor is also trained to provide quantitative models for each descriptor. Experimental
    results show that the use of these trained models not only improves the performance
    of our recognition system, but it also reduces significantly the computation time for the
    recognition process.

  • Shared kernel information embedding for discriminative inference

    L. Sigal, R. Memisevicand D.J. Fleet
    In Computer Vision and Pattern Recognition (CVPR), Miami, USA, June 2009

    @inproceedings{SigalCVPR09,
    title = {Shared kernel information embedding for discriminative inference},
    author = {L. Sigal and R. Memisevicand D.J. Fleet},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2009},
    month = {June}
    }

    Latent Variable Models (LVM), like the Shared-GPLVM
    and the Spectral Latent Variable Model, help mitigate over-
    fitting when learning discriminative methods from small or
    moderately sized training sets. Nevertheless, existing methods
    suffer from several problems: 1) complexity; 2) the
    lack of explicit mappings to and from the latent space; 3)
    an inability to cope with multi-modality; and 4) the lack
    of a well-defined density over the latent space. We propose
    a LVM called the Shared Kernel Information Embedding
    (sKIE). It defines a coherent density over a latent
    space and multiple input/output spaces (e.g., image features
    and poses), and it is easy to condition on a latent state,
    or on combinations of the input/output states. Learning
    is quadratic, and it works well on small datasets. With
    datasets too large to learn a coherent global model, one
    can use sKIE to learn local online models. sKIE permits
    missing data during inference, and partially labelled data
    during learning. We use sKIE for human pose inference.

  • Focal Stack Photography: High-Performance Photography with Conventional Cameras

    Kiriakos N. Kutulakos, Samuel W. Hasinoff
    In Int. Conf. on Machine Vision and Applications (MVA), Yokohama, Japan, pp. 332-337, May 2009
    Invited paper

    @inproceedings{KutulakosMVA09,
    title = {Focal Stack Photography: High-Performance Photography with Conventional Cameras},
    author = {Kiriakos N. Kutulakos and Samuel W. Hasinoff},
    booktitle = {Int. Conf. on Machine Vision and Applications},
    year = {2009},
    pages = {332–337},
    month = {May}
    }

    We look at how a seemingly small change in the photographic
    process — capturing a focal stack at the press
    of a button, instead of a single photo — can boost significantly
    the optical performance of a conventional camera.
    By generalizing the familiar photographic concepts
    of “depth of field” and “exposure time” to the case of
    focal stacks, we show that focal stack photography has
    two performance advantages: (1) it allows us to capture
    a given depth of field much faster than one-shot photography,
    and (2) it leads to higher signal-to-noise ratios
    when capturing wide depths of field with a restricted
    exposure time. We consider these advantages in detail
    and discuss their implications for photography.

  • Skeletal Shape Abstraction from Examples

    F. Demirci, A. Shokoufandeh, S. Dickinson
    In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 31, Num. 5, pp. 944-952, May 2009

    @article{DemirciPAMI09,
    author = {F. Demirci and A. Shokoufandeh and S. Dickinson},
    title = {Skeletal Shape Abstraction from Examples},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year = {2009},
    volume = {31},
    number = {5},
    pages = {944–952},
    month = {May}
    }

    Learning a class prototype from a set of exemplars is an important
    challenge facing researchers in object categorization. Although the problem is
    receiving growing interest, most approaches assume a one-to-one
    correspondence among local features, restricting their ability to learn true
    abstractions of a shape. In this paper, we present a new technique for learning an
    abstract shape prototype from a set of exemplars whose features are in many-tomany
    correspondence. Focusing on the domain of 2D shape, we represent a
    silhouette as a medial axis graph whose nodes correspond to “parts” defined by
    medial branches and whose edges connect adjacent parts. Given a pair of medial
    axis graphs, we establish a many-to-many correspondence between their nodes to
    find correspondences among articulating parts. Based on these correspondences,
    we recover the abstracted medial axis graph along with the positional and radial
    attributes associated with its nodes. We evaluate the abstracted prototypes in the
    context of a recognition task.

  • Object Categorization: Computer and Human Vision Perspectives


    Eds. S. Dickinson, A. Leonardis, B. Schiele, M. Tarr, Cambridge University Press, 2009

    @book{svenBook09,
    title = {Object Categorization: Computer and Human Vision Perspectives},
    editor = {S. Dickinson and A. Leonardis and B. Schiele and M. Tarr},
    publisher = {Cambridge University Press},
    year = {2009}
    }

    This edited volume presents a unique multidisciplinary perspective on the problem of visual object categorization. The result of a series of four highly successful workshops on the topic, the book gathers many of the most distinguished researchers from both computer and human vision to reflect on their experience, identify open problems, and foster a cross-disciplinary discussion with the idea that parallel problems and solutions have arisen in both domains. Twenty-seven of these workshop speakers have contributed chapters, including fourteen from computer vision and thirteen from human vision. Their contributions range from broad perspectives on the problem to more specific approaches, collectively providing important historical context, identifying the major challenges, and presenting recent research results. This multidisciplinary collection is the first of its kind on the topic of object categorization, providing an outstanding context for graduate students and researchers in both computer and human vision.

  • The Evolution of Object Categorization and the Challenge of Image Abstraction

    S. Dickinson
    In Object Categorization: Computer and Human Vision Perspectives, Eds. S. Dickinson, A. Leonardis, B. Schiele, M. Tarr, Cambridge University Press, pp. 1-37, 2009

    @incollection{svenChapter09,
    author = {S. Dickinson},
    title = {The Evolution of Object Categorization and the Challenge of Image Abstraction},
    editor = {S. Dickinson and A. Leonardis and B. Schiele and M. Tarr},
    booktitle = {Object Categorization: Computer and Human Vision Perspectives},
    publisher = {Cambridge University Press},
    year = {2009},
    pages = {1–37}
    }

    The Evolution of Object Categorization and the Challenge of Image Abstraction

  • Video-Based People Tracking

    M. Brubaker, L. Sigal, D.J. Fleet
    In Handbook of Ambient Intelligence and Smart Environments, Eds. H. Nakashima, H. Aghajan, J.C. Augusto, Springer, pp. 57-88, 2009

    @inbook{BrubakerChapter09,
    title = {Video-Based People Tracking},
    author = {M. Brubaker and L. Sigal and D.J. Fleet},
    booktitle = {Handbook of Ambient Intelligence and Smart Environments},
    year = {2009},
    editor = {H. Nakashima and H. Aghajan and J.C. Augusto},
    publisher = {Springer},
    pages = {57–88}
    }

    Video-Based People Tracking.

  • A Generalized Family of Fixed-Radius Distribution-Based Distance Measures for Content-Based fMRI Image Retrieval

    J. Novatnack, N. Cornea, A. Shokoufandeh, D. Silver, S. Dickinson, P. Kantor, B. Bai
    In Pattern Recognition Letters (PRL), Vol. 29, Num. 12, pp. 1726-1732, September 2008

    @article{NovatnackPRL08,
    author = {J. Novatnack and N. Cornea and A. Shokoufandeh and D. Silver and S. Dickinson and P. Kantor and B. Bai},
    title = {A Generalized Family of Fixed-Radius Distribution-Based Distance Measures for Content-Based fMRI Image Retrieval},
    journal = {Pattern Recognition Letters},
    year = {2008},
    volume = {29},
    number = {12},
    pages = {1726–1732},
    month = {September}
    }

    We present a family of distance measures for comparing activation patterns captured in fMRI images. We
    model an fMRI image as a spatial object with varying density, and measure the distance between two
    fMRI images using a novel fixed-radius, distribution-based Earth Mover’s Distance that is computable
    in polynomial time. We also present two simplified formulations for the distance computation whose
    complexity is better than linear programming. The algorithms are robust in the presence of noise, and
    by varying the radius of the distance measures, can tolerate different degrees of within-class deformation.
    Empirical evaluation of the algorithms on a dataset of 430 fMRI images in a content-based image
    retrieval application demonstrates the power and robustness of the distance measures.

  • Retrieving Articulated 3-D Models Using Medial Surfaces

    K. Siddiqi, J. Zhang, D. Macrini, A. Shokoufandeh, S. Bioux, S. Dickinson
    In Machine Vision and Applications (MVA), Vol. 19, Num. 4, pp. 261-275, July 2008

    @article{SiddiqiMVA08,
    author = {K. Siddiqi and J. Zhang and D. Macrini and A. Shokoufandeh and S. Bioux and S. Dickinson},
    title = {Retrieving Articulated 3-D Models Using Medial Surfaces},
    journal = {Machine Vision and Applications},
    year = {2008},
    volume = {19},
    number = {4},
    pages = {261–275},
    month = {July}
    }

    We consider the use of medial surfaces to
    represent symmetries of 3-D objects. This allows for a
    qualitative abstraction based on a directed acyclic graph of
    components and also a degree of invariance to a variety of
    transformations including the articulation of parts. We demonstrate
    the use of this representation for 3-D object model
    retrieval. Our formulation uses the geometric information associated with each node along with an eigenvalue labeling
    of the adjacency matrix of the subgraph rooted at that node.
    We present comparative retrieval results against the techniques
    of shape distributions [Osada et al.] and harmonic
    spheres [Kazhdan et al.] on 425 models from the McGill
    Shape Benchmark, representing 19 object classes. For objects
    with articulating parts, the precision vs recall curves using
    our method are consistently above and to the right of those
    of the other two techniques, demonstrating superior retrieval
    performance. For objects that are rigid, our method gives
    results that compare favorably with these methods.

  • From Skeletons to Bone Graphs: Medial Abstraction for Object Recognition

    D. Macrini, K. Siddiqi, S. Dickinson
    In Computer Vision and Pattern Recognition (CVPR), Anchorage, USA, June 2008

    @inproceedings{MacriniCVPR08,
    author = {D. Macrini and K. Siddiqi and S. Dickinson},
    title = {From Skeletons to Bone Graphs: Medial Abstraction for Object Recognition},
    booktitle = {Computer Vision and Pattern Recognition},
    year = {2008},
    month = {June}
    }

    Medial descriptions, such as shock graphs, have gained
    significant momentum in the shape-based object recognition
    community due to their invariance to translation, rotation,
    scale and articulation and their ability to cope with
    moderate amounts of within-class deformation. While they
    attempt to decompose a shape into a set of parts, this decomposition
    can suffer from ligature-induced instability. In
    particular, the addition of even a small part can have a
    dramatic impact on the representation in the vicinity of its
    attachment. We present an algorithm for identifying and
    representing the ligature structure, and restoring the non-ligature
    structures that remain. This leads to a bone graph,
    a new medial shape abstraction that captures a more intuitive
    notion of an object� parts than a skeleton or a shock
    graph, and offers improved stability and within-class deformation
    invariance. We demonstrate these advantages by
    comparing the use of bone graphs to shock graphs in a set
    of view-based object recognition and pose estimation trials.

  • Learning the Abstract Motion Semantics of Verbs from Captioned Videos

    S. Mathe, A. Fazly, S. Dickinson, S. Stevenson
    In 3rd International Workshop on Semantic Learning and Applications in Multimedia (SLAM), Anchorage, USA, June 2008

    @inproceedings{MatheSLAM08,
    author = {S. Mathe and A. Fazly and S. Dickinson and S. Stevenson},
    title = {Learning the Abstract Motion Semantics of Verbs from Captioned Videos},
    booktitle = {3rd International Workshop on Semantic Learning and Applications in Multimedia},
    year = {2008},
    month = {June}
    }

    We propose an algorithm for learning the semantics of a
    (motion) verb from videos depicting the action expressed by
    the verb, paired with sentences describing the action participants
    and their roles. Acknowledging that commonalities
    among example videos may not exist at the level of the input
    features, our approximation algorithm efficiently searches
    the space of more abstract features for a common solution.
    We test our algorithm by using it to learn the semantics of
    a sample set of verbs; results demonstrate the usefulness
    of the proposed framework, while identifying directions for
    further improvement.

  • Learning Visual Compound Models from Parallel Image-Text Datasets

    J. Moringen, S. Wachsmuth, S. Dickinson, S. Stevenson
    In German Association for Pattern Recognition (DAGM), Munich, Germany, June 2008

    @inproceedings{SalaPOCV08,
    author = {J. Moringen and S. Wachsmuth and S. Dickinson and S. Stevenson},
    title = {Learning Visual Compound Models from Parallel Image-Text Datasets},
    booktitle = {German Association for Pattern Recognition},
    year = {2008},
    month = {June}
    }

    In this paper, we propose a new approach to learn structured
    visual compound models from shape-based feature descriptions.
    We use captioned text in order to drive the process of grouping boundary
    fragments detected in an image. In the learning framework, we transfer
    several techniques from computational linguistics to the visual domain
    and build on previous work in image annotation. A statistical translation
    model is used in order to establish links between caption words and
    image elements. Then, compounds are iteratively built up by using a
    mutual information measure. Relations between compound elements are
    automatically extracted and increase the discriminability of the visual
    models. We show results on different synthetic and realistic datasets in
    order to validate our approach.

  • Model-Based Perceptual Grouping and Shape Abstraction

    P. Sala, S. Dickinson
    In 6th IEEE Computer Society Workshop on Perceptual Organization in Computer Vision (POCV), Anchorage, USA, June 2008

    @inproceedings{SalaPOCV08,
    author = {P. Sala and S. Dickinson},
    title = {Model-Based Perceptual Grouping and Shape Abstraction},
    booktitle = {6th IEEE Computer Society Workshop on Perceptual Organization in Computer Vision},
    year = {2008},
    month = {June}
    }

    Contour features are re-emerging in the categorization
    community as it moves from appearance back to shape.
    However, the classical assumption of one-to-one correspondence
    between an extracted image contour and a model
    contour constrains category models to be highly brittle, offering
    little abstraction between image and model. Moreover,
    today’s contour-based models are category-specific,
    offering no mechanism for contour grouping and abstraction
    in the absence of an object prior. We present a novel
    framework for recovering a set of abstract parts from a
    multi-scale contour image. Given a user-specified part vocabulary
    and an image to be analyzed, the system covers
    the image with abstract part models drawn from the vocabulary.
    More importantly, correspondence between image
    contours and part contours is many-to-one, yielding a
    powerful shape abstraction mechanism. We illustrate the
    strengths and weaknesses of this work in progress on a set
    of anecdotal scenes.

  • 3-D Model Retrieval Using Medial Surfaces

    K. Siddiqi, J. Zhang, D. Macrini, S. Dickinson, A. Shokoufandeh
    In Medial Representations: Mathematics, Algorithms and Applications, Eds. K. Siddiqi, S. Pizer, Kluwer, Boston, pp. 309-326, 2008

    @incollection{svenChapter08,
    author = {K. Siddiqi and J. Zhang and D. Macrini and S. Dickinson and A. Shokoufandeh},
    title = {3-D Model Retrieval Using Medial Surfaces},
    editor = {K. Siddiqi and S. Pizer},
    booktitle = {Medial Representations: Mathematics, Algorithms and Applications},
    publisher = {Kluwer, Boston},
    year = {2008},
    pages = {309–326}
    }

    Graphs derived from medial representations have been used for 2D object
    matching and retrieval with considerable success (Pelillo et al., 1999; Siddiqi et al.,
    1999b; Sebastian et al., 2001). In this chapter we consider consider the use of graphs
    derived from medial surfaces for 3D object matching and retrieval. The medial
    reprsentation allows for a qualitative abstraction based on a directed acyclic graph of
    components and also a degree of invariance to a variety of transformations including
    the articulation of parts. The formulation discussed in this chapter uses the geometric
    information associated with each node along with an eigenvalue labeling of
    the adjacency matrix of the subgraph rooted at that node. Comparative retrieval
    results are presented against the techniques of shape distributions (Osada et al.,
    2002) and harmonic spheres (Kazhdan et al., 2003b) on 425 models representing 19
    object classes. These results demonstrate that medial surface based graph matching
    outperforms these techniques for objects with articulating parts.

Posted on September 27, 2015