Publications

last updated: November 30, 2020

Journal Articles

Video Face Clustering with Self-Supervised Representation Learning

Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz and Rainer Stiefelhagen
IEEE Transactions on Biometrics (T-BIOM vol. 2 (2), pp. 145-157), 2020.

PDF DOI

@article{Sharma_J2_TBIOM,
  author = {Vivek Sharma and Makarand Tapaswi and M. Saquib Sarfraz and Rainer Stiefelhagen},
  title = {{Video Face Clustering with Self-Supervised Representation Learning}},
  year = {2020},
  journal = {IEEE Transactions on Biometrics (T-BIOM)},
  volume = {4},
  pages = {145-157},
  doi = {10.1109/TBIOM.2019.2947264}
}

Characters are a key component of understanding the story conveyed in TV series and movies. With the rise of advanced deep face models, identifying face images may seem like a solved problem. However, as face detectors get better, clustering and identification need to be revisited to address increasing diversity in facial appearance. In this paper, we propose unsupervised methods for feature refinement with application to video face clustering. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, that can also be applied to image collections. We evaluate our methods on three video face clustering datasets. Thorough experiments including generalization studies show that our methods outperform current state-of-the-art methods on all datasets. The datasets and code are available at https://github.com/vivoutlaw/SSIAM.

Aligning Plot Synopses to Videos for Story-based Retrieval

Makarand Tapaswi, Martin Bäuml and Rainer Stiefelhagen
International Journal of Multimedia Information Retrieval (IJMIR vol. 4 (1), pp. 3-16), 2015.

PDF Project DOI

@article{Tapaswi_J1_PlotRetrieval,
  author = {Makarand Tapaswi and Martin Bäuml and Rainer Stiefelhagen},
  title = {{Aligning Plot Synopses to Videos for Story-based Retrieval}},
  year = {2015},
  journal = {International Journal of Multimedia Information Retrieval (IJMIR)},
  volume = {4},
  pages = {3-16},
  doi = {10.1007/s13735-014-0065-9}
}

We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions -- plot synopses -- of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.

Conference Proceedings

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

Vladimir Petrik*, Makarand Tapaswi*, Ivan Laptev and Josef Sivic
Conference on Robot Learning (CoRL Poster, Acceptance rate=34%), Virtual, Nov 2020.

PDF Project arXiv code

@inproceedings{Petrik2020_Real2Sim,
  author = {Vladimir Petrik* and Makarand Tapaswi* and Ivan Laptev and Josef Sivic},
  title = {{Learning Object Manipulation Skills via Approximate State Estimation from Real Videos}},
  year = {2020},
  booktitle = {Conference on Robot Learning (CoRL)},
  month = {Nov},
  doi = {}
}

Humans are adept at learning new tasks by watching a few instructional videos. On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain. In this paper, we explore a method that facilitates learning object manipulation skills directly from videos. Leveraging recent advances in 2D visual recognition and differentiable rendering, we develop an optimization based method to estimate a coarse 3D state representation for the hand and the manipulated object(s) without requiring any supervision. We use these trajectories as dense rewards for an agent that learns to mimic them through reinforcement learning. We evaluate our method on simple single- and two-object actions from the Something-Something dataset. Our approach allows an agent to learn actions from single videos, while watching multiple demonstrations makes the policy more robust. We show that policies learned in a simulated environment can be easily transferred to a real robot.

Learning Interactions and Relationships between Movie Characters

Anna Kukleva, Makarand Tapaswi and Ivan Laptev
Conference on Computer Vision and Pattern Recognition (CVPR Oral, Acceptance rate=5.0%), Virtual, Jun 2020.

PDF Project DOI arXiv code

@inproceedings{Kukleva2020_MGIntRel,
  author = {Anna Kukleva and Makarand Tapaswi and Ivan Laptev},
  title = {{Learning Interactions and Relationships between Movie Characters}},
  year = {2020},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun},
  doi = {10.1109/CVPR42600.2020.00987}
}

Interactions between people are often governed by their relationships. On the flip side, social relationships are built upon several interactions. Two strangers are more likely to greet and introduce themselves while becoming friends over time. We are fascinated by this interplay between interactions and relationships, and believe that it is an important aspect of understanding social situations. In this work, we propose neural models to learn and jointly predict interactions, relationships, and the pair of characters that are involved. We note that interactions are informed by a mixture of visual and dialog cues, and present a multimodal architecture to extract meaningful information from them. Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels. We evaluate our models on the MovieGraphs dataset and show the impact of modalities, use of longer temporal context for predicting relationships, and achieve encouraging performance using weak labels as compared with ground-truth labels.

Clustering based Contrastive Learning for Improving Face Representations

Vivek Sharma, Makarand Tapaswi, Saquib Sarfraz and Rainer Stiefelhagen
IEEE International Conference on Automatic Face and Gesture Recognition (FG Poster, Acceptance rate=44.0%), Buenos Aires, Argentina, May 2020.

PDF

@inproceedings{Sharma2020_CCL,
  author = {Vivek Sharma and Makarand Tapaswi and Saquib Sarfraz and Rainer Stiefelhagen},
  title = {{Clustering based Contrastive Learning for Improving Face Representations}},
  year = {2020},
  booktitle = {IEEE International Conference on Automatic Face and Gesture Recognition (FG)},
  month = {May},
  doi = {}
}

A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, providea form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features. We demonstrate our method on the challenging task of learning representations for video face clustering. Through several ablation studies, we analyze the impact of creating pair-wise positive and negative labels from different sources. Experiments on three challenging video face clustering datasets: BBT-0101, BF-0502, and ACCIO show that CCL achieves a new state-of-the-art on all datasets.

Video Face Clustering with Unknown Number of Clusters

Makarand Tapaswi, Marc T. Law and Sanja Fidler
International Conference on Computer Vision (ICCV Poster, Acceptance rate=25.0%), Seoul, Korea, Oct 2019.

PDF DOI arXiv code

@inproceedings{Tapaswi2019_BallClustering,
  author = {Makarand Tapaswi and Marc T. Law and Sanja Fidler},
  title = {{Video Face Clustering with Unknown Number of Clusters}},
  year = {2019},
  booktitle = {International Conference on Computer Vision (ICCV)},
  month = {Oct},
  doi = {10.1109/ICCV.2019.00513}
}

Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing. We address the challenging problem of clustering face tracks based on their identity. Different from previous work in this area, we choose to operate in a realistic and difficult setting where: (i) the number of characters is not known a priori; and (ii) face tracks belonging to minor or background characters are not discarded. To this end, we propose Ball Cluster Learning (BCL), a supervised approach to carve the embedding space into balls of equal size, one for each cluster. The learned ball radius is easily translated to a stopping criterion for iterative merging algorithms. This gives BCL the ability to estimate the number of clusters as well as their assignment, achieving promising results on commonly used datasets. We also present a thorough discussion of how existing metric learning literature can be adapted for this task.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev and Josef Sivic
International Conference on Computer Vision (ICCV Poster, Acceptance rate=25.0%), Seoul, Korea, Oct 2019.

PDF Project DOI arXiv code

@inproceedings{Miech2019_HowTo100M,
  author = {Antoine Miech and Dimitri Zhukov and Jean-Baptiste Alayrac and Makarand Tapaswi and Ivan Laptev and Josef Sivic},
  title = {{HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips}},
  year = {2019},
  booktitle = {International Conference on Computer Vision (ICCV)},
  month = {Oct},
  doi = {10.1109/ICCV.2019.00272}
}

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available.

Self-Supervised Learning of Face Representations for Video Face Clustering

Vivek Sharma, Makarand Tapaswi, Saquib Sarfraz and Rainer Stiefelhagen
IEEE International Conference on Automatic Face and Gesture Recognition (FG Oral, Acceptance rate=20.1%), Best Paper Award Lille, France, May 2019.

PDF DOI arXiv code

@inproceedings{Sharma2019_FaceCluster,
  author = {Vivek Sharma and Makarand Tapaswi and Saquib Sarfraz and Rainer Stiefelhagen},
  title = {{Self-Supervised Learning of Face Representations for Video Face Clustering}},
  year = {2019},
  booktitle = {IEEE International Conference on Automatic Face and Gesture Recognition (FG)},
  month = {May},
  doi = {10.1109/FG.2019.8756609}
}

Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks. Our datasets and code will be made available for enabling fair comparisons in the future.

Visual Reasoning by Progressive Module Networks

Seung Wook Kim, Makarand Tapaswi and Sanja Fidler
International Conference on Learning Representations (ICLR Poster, Acceptance rate=32.9%), New Orleans, LA, USA, May 2019.

PDF Project arXiv demo

@inproceedings{Kim2019_PMN,
  author = {Seung Wook Kim and Makarand Tapaswi and Sanja Fidler},
  title = {{Visual Reasoning by Progressive Module Networks}},
  year = {2019},
  booktitle = {International Conference on Learning Representations (ICLR)},
  month = {May},
  doi = {}
}

Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn – most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline.

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

Paul Vicol, Makarand Tapaswi, Lluis Castrejon and Sanja Fidler
Conference on Computer Vision and Pattern Recognition (CVPR Spotlight, Acceptance rate=8.7%), Salt Lake City, UT, USA, Jun. 2018.

PDF Project DOI arXiv spotlight suppl. material

@inproceedings{Vicol2018_MovieGraphs,
  author = {Paul Vicol and Makarand Tapaswi and Lluis Castrejon and Sanja Fidler},
  title = {{MovieGraphs: Towards Understanding Human-Centric Situations from Videos}},
  year = {2018},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2018.00895}
}

There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to read people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions. In addition, most interactions and many attributes are grounded in the video with time stamps. We provide a thorough analysis of our dataset, showing interesting common-sense correlations between different social aspects of scenes, as well as across scenes over time. We propose a method for querying videos and text with graphs, and show that: 1) our graphs contain rich and sufficient information to summarize and localize each scene; and 2) subgraphs allow us to describe situations at an abstract level and retrieve multiple semantically relevant situations. We also propose methods for interaction understanding via ordering, and reason understanding. MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.

Now You Shake Me: Towards Automatic 4D Cinema

Yuhao Zhou, Makarand Tapaswi and Sanja Fidler
Conference on Computer Vision and Pattern Recognition (CVPR Spotlight, Acceptance rate=8.7%), Salt Lake City, UT, USA, Jun. 2018.

PDF Project DOI spotlight suppl. material

@inproceedings{Zhou2017_Movie4D,
  author = {Yuhao Zhou and Makarand Tapaswi and Sanja Fidler},
  title = {{Now You Shake Me: Towards Automatic 4D Cinema}},
  year = {2018},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2018.00775}
}

We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies. These include effects such as physical interactions, water splashing, light, and shaking, and are grounded to either a character in the scene or the camera. We collect a new dataset referred to as the Movie4D dataset which annotates over 9K effects in 63 movies. We propose a Conditional Random Field model atop a neural network that brings together visual and audio information, as well as semantics in the form of person tracks. Our model further exploits correlations of effects between different characters in the clip as well as across movie threads. We propose effect detection and classification as two tasks, and present results along with ablation studies on our dataset, paving the way towards 4D cinema in everyone's homes.

Situation Recognition with Graph Neural Networks

Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun and Sanja Fidler
International Conference on Computer Vision (ICCV Poster, Acceptance rate=28.9%), Venice, Italy, Oct. 2017.

PDF DOI arXiv

@inproceedings{RuiyuLi2017_SituGGNN,
  author = {Ruiyu Li and Makarand Tapaswi and Renjie Liao and Jiaya Jia and Raquel Urtasun and Sanja Fidler},
  title = {{Situation Recognition with Graph Neural Networks}},
  year = {2017},
  booktitle = {International Conference on Computer Vision (ICCV)},
  month = {Oct.},
  doi = {10.1109/ICCV.2017.448}
}

We address the problem of recognizing situations in images. Given an image, the task is to predict the most salient verb (action), and fill its semantic roles such as who is performing the action, what is the source and target of the action, etc. Different verbs have different roles (e.g. attacking has weapon), and each role can take on many possible values (nouns). We propose a model based on Graph Neural Networks that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph. Experiments with different graph connectivities show that our approach that propagates information between roles significantly outperforms existing work, as well as multiple baselines. We obtain roughly 3-5% improvement over previous work in predicting the full situation. We also provide a thorough qualitative analysis of our model and influence of different roles in the verbs.

MovieQA: Understanding Stories in Movies through Question-Answering

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun and Sanja Fidler
Conference on Computer Vision and Pattern Recognition (CVPR Spotlight, Acceptance rate=9.7%), Las Vegas, NV, USA, Jun. 2016.

PDF Project DOI arXiv code spotlight presentation poster-1 poster-2

@inproceedings{Tapaswi2016_MovieQA,
  author = {Makarand Tapaswi and Yukun Zhu and Rainer Stiefelhagen and Antonio Torralba and Raquel Urtasun and Sanja Fidler},
  title = {{MovieQA: Understanding Stories in Movies through Question-Answering}},
  year = {2016},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2016.501}
}

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from simpler ``Who'' did ``What'' to ``Whom'', to ``Why'' and ``How'' certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.

Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning

Ziad Al-Halah, Makarand Tapaswi and Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR Poster, Acceptance rate=29.9%), Las Vegas, NV, USA, Jun. 2016.

PDF DOI arXiv data

@inproceedings{AlHalah2016_AssociationPrediction,
  author = {Ziad Al-Halah and Makarand Tapaswi and Rainer Stiefelhagen},
  title = {{Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning}},
  year = {2016},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2016.643}
}

Collecting training images for all visual categories is not only expensive but also impractical. Zero-shot learning (ZSL), especially using attributes, offers a pragmatic solution to this problem. However, at test time most attribute-based methods require a full description of attribute associations for each unseen class. Providing these associations is time consuming and often requires domain specific knowledge. In this work, we aim to carry out attribute- based zero-shot classification in an unsupervised manner. We propose an approach to learn relations that couples class embeddings with their corresponding attributes. Given only the name of an unseen class, the learned relationship model is used to automatically predict the class-attribute associations. Furthermore, our model facilitates transferring attributes across data sets without additional effort. Integrating knowledge from multiple sources results in a significant additional improvement in performance. We evaluate on two public data sets: Animals with Attributes and aPascal/aYahoo. Our approach outperforms state-of-the-art methods in both predicting class-attribute associations and unsupervised ZSL by a large margin.

Naming TV Characters by Watching and Analyzing Dialogs

Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah and Rainer Stiefelhagen
Winter Conference on Applications of Computer Vision (WACV Acceptance rate=42.3%), Lake Placid, NY, USA, Mar. 2016.

PDF DOI presentation poster

@inproceedings{Haurilet2016_SubttOnly,
  author = {Monica-Laura Haurilet and Makarand Tapaswi and Ziad Al-Halah and Rainer Stiefelhagen},
  title = {{Naming TV Characters by Watching and Analyzing Dialogs}},
  year = {2016},
  booktitle = {Winter Conference on Applications of Computer Vision (WACV)},
  month = {Mar.},
  doi = {10.1109/WACV.2016.7477560}
}

Person identification in TV series has been a popular research topic over the last decade. In this area, most approaches either use manually annotated data or extract character supervision from a combination of subtitles and transcripts. However, both approaches have key drawbacks that hinder application of these methods at a large scale -- manual annotation is expensive and transcripts are often hard to obtain. We investigate the topic of automatically labeling all character appearances in TV series using information obtained solely from subtitles. This task is extremely difficult as the dialogs between characters provide very sparse and weakly supervised data. We address these challenges by exploiting recent advances in face descriptors and Multiple Instance Learning methods. We propose methods to create MIL bags and evaluate and discuss several MIL techniques. The best combination achieves an average precision over 80% on three diverse TV series. We demonstrate that only using subtitles provides good results on identifying characters in TV series and wish to encourage the community towards this problem.

Accio: A Data Set for Face Track Retrieval in Movies Across Age

Esam Ghaleb, Makarand Tapaswi, Ziad Al-Halah, Hazım Kemal Ekenel and Rainer Stiefelhagen
International Conference on Multimedia Retrieval (ICMR Short paper, Poster, Acceptance rate=40.5%), Shanghai, China, Jun. 2015.

PDF DOI poster face tracks features (9GB+)

@inproceedings{Ghaleb2015_Accio,
  author = {Esam Ghaleb and Makarand Tapaswi and Ziad Al-Halah and Hazım Kemal Ekenel and Rainer Stiefelhagen},
  title = {{Accio: A Data Set for Face Track Retrieval in Movies Across Age}},
  year = {2015},
  booktitle = {International Conference on Multimedia Retrieval (ICMR)},
  month = {Jun.},
  doi = {10.1145/2671188.2749296}
}

Video face recognition is a very popular task and has come a long way. The primary challenges such as illumination, resolution and pose are well studied through multiple data sets. However there are no video-based data sets dedicated to study the effects of aging on facial appearance. We present a challenging face track data set, Harry Potter Movies Aging Data set (Accio), to study and develop age invariant face recognition methods for videos. Our data set not only has strong challenges of pose, illumination and distractors, but also spans a period of ten years providing substantial variation in facial appearance. We propose two primary tasks: within and across movie face track retrieval; and two protocols which differ in their freedom to use external data. We present baseline results for the retrieval performance using a state-of-the-art face track descriptor. Our experiments show clear trends of reduction in performance as the age gap between the query and database increases. We will make the data set publicly available for further exploration in age-invariant video face recognition.

Book2Movie: Aligning Video scenes with Book chapters

Makarand Tapaswi, Martin Bäuml and Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR Poster, Acceptance rate=28.4%), Boston, MA, USA, Jun. 2015.

PDF Project DOI data set ext. abstract poster-1 poster-2 suppl. material

@inproceedings{Tapaswi2015_Book2Movie,
  author = {Makarand Tapaswi and Martin Bäuml and Rainer Stiefelhagen},
  title = {{Book2Movie: Aligning Video scenes with Book chapters}},
  year = {2015},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2015.7298792}
}

Film adaptations of novels often visually display in a few shots what is described in many pages of the source novel. In this paper we present a new problem: to align book chapters with video scenes. Such an alignment facilitates finding differences between the adaptation and the original source, and also acts as a basis for deriving rich descriptions from the novel for the video clips. We propose an efficient method to compute an alignment between book chapters and video scenes using matching dialogs and character identities as cues. A major consideration is to allow the alignment to be non-sequential. Our suggested shortest path based approach deals with the non-sequential alignments and can be used to determine whether a video scene was part of the original book. We create a new data set involving two popular novel-to-film adaptations with widely varying properties and compare our method against other text-to-video alignment baselines. Using the alignment, we present a qualitative analysis of describing the video through rich narratives obtained from the novel.

Improved Weak Labels using Contextual Cues for Person Identification in Videos

Makarand Tapaswi, Martin Bäuml and Rainer Stiefelhagen
International Conference on Automatic Face and Gesture Recognition (FG Poster, Acceptance rate=38.0%), Ljubljana, Slovenia, May 2015.

PDF DOI poster code

@inproceedings{Tapaswi2015_SpeakingFace,
  author = {Makarand Tapaswi and Martin Bäuml and Rainer Stiefelhagen},
  title = {{Improved Weak Labels using Contextual Cues for Person Identification in Videos}},
  year = {2015},
  booktitle = {International Conference on Automatic Face and Gesture Recognition (FG)},
  month = {May},
  doi = {10.1109/FG.2015.7163083}
}

Fully automatic person identification in TV series has been achieved by obtaining weak labels from subtitles and transcripts [Everingham 2011]. In this paper, we revisit the problem of matching subtitles with face tracks to obtain more assignments and more accurate weak labels. We perform a detailed analysis of the state-of-the-art showing the types of errors during the assignment and providing insights into their cause. We then propose to model the problem of assigning names to face tracks as a joint optimization problem. Using negative constraints between co-occurring pairs of tracks and positive constraints from track threads, we are able to significantly improve the speaker assignment performance. This directly influences the identification performance on all face tracks. We also propose a new feature to determine whether a tracked face is speaking and show further improvements in performance while being computationally more efficient.

Total Cluster: A person agnostic clustering method for broadcast videos

Makarand Tapaswi, Omkar M. Parkhi, Esa Rahtu, Eric Sommerlade, Rainer Stiefelhagen and Andrew Zisserman
Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP Oral, Acceptance rate=9.4%), Bangalore, India, Dec. 2014.

PDF DOI presentation

@inproceedings{Tapaswi2014_FaceTrackCluster,
  author = {Makarand Tapaswi and Omkar M. Parkhi and Esa Rahtu and Eric Sommerlade and Rainer Stiefelhagen and Andrew Zisserman},
  title = {{Total Cluster: A person agnostic clustering method for broadcast videos}},
  year = {2014},
  booktitle = {Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP)},
  month = {Dec.},
  doi = {10.1145/2683483.2683490}
}

The goal of this paper is unsupervised face clustering in edited video material -- where face tracks arising from different people are assigned to separate clusters, with one cluster for each person. In particular we explore the extent to which faces can be clustered automatically without making an error. This is a very challenging problem given the variation in pose, lighting and expressions that can occur, and the similarities between different people. The novelty we bring is three fold: first, we show that a form of weak supervision is available from the editing structure of the material -- the shots, threads and scenes that are standard in edited video; second, we show that by first clustering within scenes the number of face tracks can be significantly reduced with almost no errors; third, we propose an extension of the clustering method to entire episodes using exemplar SVMs based on the negative training data automatically harvested from the editing structure. The method is demonstrated on multiple episodes from two very different TV series, Scrubs and Buffy. For both series it is shown that we move towards our goal, and also outperform a number of baselines from previous works.

Cleaning up after a Face Tracker: False Positive Removal

Makarand Tapaswi, Cemal Çağrı Çörez, Martin Bäuml, Hazım Kemal Ekenel and Rainer Stiefelhagen
International Conference on Image Processing (ICIP Poster, Acceptance rate=43.2%), Paris, France, Oct. 2014.

PDF DOI poster

@inproceedings{Tapaswi2014_FalsePositiveTracks,
  author = {Makarand Tapaswi and Cemal Çağrı Çörez and Martin Bäuml and Hazım Kemal Ekenel and Rainer Stiefelhagen},
  title = {{Cleaning up after a Face Tracker: False Positive Removal}},
  year = {2014},
  booktitle = {International Conference on Image Processing (ICIP)},
  month = {Oct.},
  doi = {10.1109/ICIP.2014.7025050}
}

Automatic person identification in TV series has gained popularity over the years. While most of the works rely on using face-based recognition, errors during tracking such as false positive face tracks are typically ignored. We propose a variety of methods to remove false positive face tracks and categorize the methods into confidence- and context-based. We evaluate our methods on a large TV series data set and show that up to 75% of the false positive face tracks are removed at the cost of 3.6% true positive tracks. We further show that the proposed method is general and applicable to other detectors or trackers.

A Time Pooled Track Kernel for Person Identification

Martin Bäuml, Makarand Tapaswi and Rainer Stiefelhagen
Conference on Advanced Video and Signal-based Surveillance (AVSS Oral, Acceptance rate=22.5%), Seoul, Korea, Aug. 2014.

PDF Project DOI data

@inproceedings{Baeuml2014_TrackKernel,
  author = {Martin Bäuml and Makarand Tapaswi and Rainer Stiefelhagen},
  title = {{A Time Pooled Track Kernel for Person Identification}},
  year = {2014},
  booktitle = {Conference on Advanced Video and Signal-based Surveillance (AVSS)},
  month = {Aug.},
  doi = {10.1109/AVSS.2014.6918636}
}

We present a novel method for comparing tracks by means of a time pooled track kernel. In contrast to spatial or feature-space pooling, the track kernel pools base kernel results within tracks over time. It includes as special cases frame-wise classification on the one hand and the normalized sum kernel on the other hand. We also investigate non-Mercer instantiations of the track kernel and obtain good results despite its Gram matrices not being positive semidefinite. Second, the track kernel matrices in general require less memory than single frame kernels, allowing to process larger datasets without resorting to subsampling. Finally, the track kernel formulation allows for very fast testing compared to frame-wise classification which is important in settings where user feedback is obtained and quick iterations of re-training and re-testing are required. We apply our approach to the task of video-based person identification in large scale settings and obtain state-of-the art results.

StoryGraphs: Visualizing Character Interactions as a Timeline

Makarand Tapaswi, Martin Bäuml and Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR Poster, Acceptance rate=29.9%), Columbus, OH, USA, Jun. 2014.

PDF Project DOI code poster-1 poster-2 suppl. material

@inproceedings{Tapaswi2014_StoryGraphs,
  author = {Makarand Tapaswi and Martin Bäuml and Rainer Stiefelhagen},
  title = {{StoryGraphs: Visualizing Character Interactions as a Timeline}},
  year = {2014},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2014.111}
}

We present a novel way to automatically summarize and represent the storyline of a TV episode by visualizing character interactions as a chart. We also propose a scene detection method that lends itself well to generate over-segmented scenes which is used to partition the video. The positioning of character lines in the chart is formulated as an optimization problem which trades between the aesthetics and functionality of the chart. Using automatic person identification, we present StoryGraphs for 3 diverse TV series encompassing a total of 22 episodes. We define quantitative criteria to evaluate StoryGraphs and also compare them against episode summaries to evaluate their ability to provide an overview of the episode.

Story-based Video Retrieval in TV series using Plot Synopses

Makarand Tapaswi, Martin Bäuml and Rainer Stiefelhagen
International Conference on Multimedia Retrieval (ICMR Oral Full paper, Acceptance rate=19.1%), Glasgow, Scotland, Apr. 2014.

PDF Project DOI presentation

@inproceedings{Tapaswi2014_PlotRetrieval,
  author = {Makarand Tapaswi and Martin Bäuml and Rainer Stiefelhagen},
  title = {{Story-based Video Retrieval in TV series using Plot Synopses}},
  year = {2014},
  booktitle = {International Conference on Multimedia Retrieval (ICMR)},
  month = {Apr.},
  doi = {10.1145/2578726.2578727}
}

We present a novel approach to search for plots in the storyline of structured videos such as TV series. To this end, we propose to align natural language descriptions of the videos, such as plot synopses, with the corresponding shots in the video. Guided by subtitles and person identities the alignment problem is formulated as an optimization task over all possible assignments and solved efficiently using dynamic programming. We evaluate our approach on a novel dataset comprising of the complete season 5 of Buffy the Vampire Slayer, and show good alignment performance and the ability to retrieve plots in the storyline.

Semi-supervised Learning with Constraints for Person Identification in Multimedia Data

Martin Bäuml, Makarand Tapaswi and Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR Poster, Acceptance rate=25.2%), Portland, OR, USA, Jun. 2013.

PDF Project DOI poster-1 poster-2

@inproceedings{Baeuml2013_SemiPersonID,
  author = {Martin Bäuml and Makarand Tapaswi and Rainer Stiefelhagen},
  title = {{Semi-supervised Learning with Constraints for Person Identification in Multimedia Data}},
  year = {2013},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2013.462}
}

We address the problem of person identification in TV series. We propose a unified learning framework for multi-class classification which incorporates labeled and unlabeled data, and constraints between pairs of features in the training. We apply the framework to train multinomial logistic regression classifiers for multi-class face recognition. The method is completely automatic, as the labeled data is obtained by tagging speaking faces using subtitles and fan transcripts of the videos. We demonstrate our approach on six episodes each of two diverse TV series and achieve state-of-the-art performance.

Contextual Constraints for Person Retrieval in Camera Networks

Martin Bäuml, Makarand Tapaswi, Arne Schumann and Rainer Stiefelhagen
Conference on Advanced Video and Signal-based Surveillance (AVSS Oral, Acceptance rate=17.8%), Beijing, China, Sep. 2012.

PDF Project DOI

@inproceedings{Baeuml2012_CamNetwork,
  author = {Martin Bäuml and Makarand Tapaswi and Arne Schumann and Rainer Stiefelhagen},
  title = {{Contextual Constraints for Person Retrieval in Camera Networks}},
  year = {2012},
  booktitle = {Conference on Advanced Video and Signal-based Surveillance (AVSS)},
  month = {Sep.},
  doi = {10.1109/AVSS.2012.28}
}

We use contextual constraints for person retrieval in camera networks. We start by formulating a set of general positive and negative constraints on the identities of person tracks in camera networks, such as a person cannot appear twice in the same frame. We then show how these constraints can be used to improve person retrieval. First, we use the constraints to obtain training data in an unsupervised way to learn a general metric that is better suited to discriminate between different people than the Euclidean distance. Second, starting from an initial query track, we enhance the query-set using the constraints to obtain additional positive and negative samples for the query. Third, we formulate the person retrieval task as an energy minimization problem, integrate track scores and constraints in a common framework and jointly optimize the retrieval over all interconnected tracks. We evaluate our approach on the CAVIAR dataset and achieve 22% relative performance improvement in terms of mean average precision over standard retrieval where each track is treated independently.

``Knock! Knock! Who is it?'' Probabilistic Person Identification in TV series

Makarand Tapaswi, Martin Bäuml and Rainer Stiefelhagen
Conference on Computer Vision and Pattern Recognition (CVPR Poster, Acceptance rate=24.0%), Providence, RI, USA, Jun. 2012.

PDF Project DOI poster-1 poster-2 suppl. material

@inproceedings{Tapaswi2012_PersonID,
  author = {Makarand Tapaswi and Martin Bäuml and Rainer Stiefelhagen},
  title = {{``Knock! Knock! Who is it?'' Probabilistic Person Identification in TV series}},
  year = {2012},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {Jun.},
  doi = {10.1109/CVPR.2012.6247986}
}

We describe a probabilistic method for identifying characters in TV series or movies. We aim at labeling every character appearance, and not only those where a face can be detected. Consequently, our basic unit of appearance is a person track (as opposed to a face track). We model each TV series episode as a Markov Random Field, integrating face recognition, clothing appearance, speaker recognition and contextual constraints in a probabilistic manner. The identification task is then formulated as an energy minimization problem. In order to identify tracks without faces, we learn clothing models by adapting available face recognition results. Within a scene, as indicated by prior analysis of the temporal structure of the TV series, clothing features are combined by agglomerative clustering. We evaluate our approach on the first 6 episodes of The Big Bang Theory and achieve an absolute improvement of 20% for person identification and 12% for face recognition.

Direct modeling of spoken passwords for text-dependent speaker recognition by compressed time-feature representations

Amitava Das and Makarand Tapaswi
International Conference on Acoustics, Speech, and Signal Processing (ICASSP Poster), Dallas, TX, USA, Mar. 2010.

PDF

@inproceedings{Das2010_SpeakerID,
  author = {Amitava Das and Makarand Tapaswi},
  title = {{Direct modeling of spoken passwords for text-dependent speaker recognition by compressed time-feature representations}},
  year = {2010},
  booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  month = {Mar.},
  doi = {}
}

Traditional Text-Dependent Speaker Recognition (TDSR) systems model the user-specific spoken passwords with frame-based features such as MFCC and use DTW or HMM type classifiers to handle the variable length of the feature vector sequence. In this paper, we explore a direct modeling of the entire spoken password by a fixed-dimension vector called Compressed Feature Dynamics or CFD. Instead of the usual frame-by-frame feature extraction, the entire password utterance is first modeled by a 2-D Featurogram or FGRAM, which efficiently captures speaker-identity-specific speech dynamics. CFDs are compressed and approximated version of the FGRAMs and their fixed dimension allows the use of simpler classifiers. Overall, the proposed FGRAM-CFD framework provides an efficient and direct model to capture the speaker-identity information well for a TDSR system. As demonstrated in trials on a 344-speaker database, compared to traditional MFCC-based TDSR systems, the FGRAM-CFD framework shows quite encouraging performance at significantly lower complexity.

Workshops

A Closed-form Gradient for the 1D Earth Mover's Distance for Spectral Deep Learning on Biological Data

Manuel Martinez, Makarand Tapaswi, Rainer Stiefelhagen
ICML 2016 Workshop on Computational Biology (CompBio@ICML16 ), New York, NY, USA, Jun. 2016.

PDF

@inproceedings{Martinez2016_ICMLCompBioW,
  author = {Manuel Martinez and Makarand Tapaswi and Rainer Stiefelhagen},
  title = {{A Closed-form Gradient for the 1D Earth Mover's Distance for Spectral Deep Learning on Biological Data}},
  year = {2016},
  booktitle = {ICML 2016 Workshop on Computational Biology (CompBio@ICML16)},
  month = {Jun.},
  doi = {}
}

KIT at MediaEval 2015 -- Evaluating Visual Cues for Affective Impact of Movies Task

Marin Vlastelica Pogančić, Sergey Hayrapetyan, Makarand Tapaswi, Rainer Stiefelhagen
Proceedings of the MediaEval2015 Multimedia Benchmark Workshop (MediaEval2015 ), Wurzen, Germany, Sep. 2015.

PDF

@inproceedings{Vlastelica2015_MEDIAEVAL,
  author = {Marin Vlastelica Pogančić and Sergey Hayrapetyan and Makarand Tapaswi and Rainer Stiefelhagen},
  title = {{KIT at MediaEval 2015 -- Evaluating Visual Cues for Affective Impact of Movies Task}},
  year = {2015},
  booktitle = {Proceedings of the MediaEval2015 Multimedia Benchmark Workshop (MediaEval2015)},
  month = {Sep.},
  doi = {}
}

We present the approach and results of our system on the MediaEval Affective Impact of Movies Task. The challenge involves two primary tasks: affect classification and violence detection. We test the performance of multiple visual features followed by linear SVM classifiers. Inspired by successes in different vision fields, we use (i) GIST features used in scene modeling, (ii) features extracted from a deep convolutional neural network trained on object recognition, and (iii) improved dense trajectory features encoded using Fisher vectors commonly used in action recognition.

QCompere @ Repere 2013

Hervé Bredin, Johann Poignant, Guillaume Fortier, Makarand Tapaswi, Viet Bac Le, Anindya Roy, Claude Barras, Sophie Rosset, Achintya Sarkar, Hua Gao, Alexis Mignon, Jakob Verbeek, Laurent Besacier, Georges Quénot, Hazım Kemal Ekenel, Rainer Stiefelhagen
Workshop on Speech, Language and Audio in Multimedia (SLAM Oral), Marseille, France, Aug. 2013.

PDF

@inproceedings{Bredin2013_REPERE,
  author = {Hervé Bredin and Johann Poignant and Guillaume Fortier and Makarand Tapaswi and Viet Bac Le and Anindya Roy and Claude Barras and Sophie Rosset and Achintya Sarkar and Hua Gao and Alexis Mignon and Jakob Verbeek and Laurent Besacier and Georges Quénot and Hazım Kemal Ekenel and Rainer Stiefelhagen},
  title = {{QCompere @ Repere 2013}},
  year = {2013},
  booktitle = {Workshop on Speech, Language and Audio in Multimedia (SLAM)},
  month = {Aug.},
  doi = {}
}

We describe QCompere consortium submissions to the REPERE 2013 evaluation campaign. The REPERE challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV broadcast. First, four mono-modal components are introduced (one for each foregoing community) constituting the elementary building blocks of our various submissions. Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as propagation-, classifier-, rule- or graph-based approaches. Finally, their performance is evaluated on REPERE 2013 test set and their advantages and limitations are discussed.

Fusion of Speech, Faces and Text for Person Identification in TV Broadcast

Hervé Bredin, Johann Poignant, Makarand Tapaswi, Guillaume Fortier, Viet Bac Le, Thibault Napoleon, Hua Gao, Claude Barras, Sophie Rosset, Laurent Besacier, Jakob Verbeek, Georges Quénot, Frédéric Jurie, Hazım Kemal Ekenel
Workshop on Information Fusion in Computer Vision for Concept Recognition (held with ECCV 2012) (IFCVCR Poster), Florence, Italy, Oct. 2012.

PDF DOI poster

@inproceedings{Bredin2012_REPERE,
  author = {Hervé Bredin and Johann Poignant and Makarand Tapaswi and Guillaume Fortier and Viet Bac Le and Thibault Napoleon and Hua Gao and Claude Barras and Sophie Rosset and Laurent Besacier and Jakob Verbeek and Georges Quénot and Frédéric Jurie and Hazım Kemal Ekenel},
  title = {{Fusion of Speech, Faces and Text for Person Identification in TV Broadcast}},
  year = {2012},
  booktitle = {Workshop on Information Fusion in Computer Vision for Concept Recognition (held with ECCV 2012) (IFCVCR)},
  month = {Oct.},
  doi = {10.1007/978-3-642-33885-4_39}
}

The Repere challenge is a project aiming at the evaluation of systems for supervised and unsupervised multimodal recognition of people in TV broadcast. In this paper, we describe, evaluate and discuss QCompere consortium submissions to the 2012 Repere evaluation campaign dry-run. Speaker identification (and face recognition) can be greatly improved when combined with name detection through video optical character recognition. Moreover, we show that unsupervised multimodal person recognition systems can achieve performance nearly as good as supervised monomodal ones (with several hundreds of identity models).

KIT at MediaEval2012 - Content-based Genre Classification with Visual Cues

Tomas Semela, Makarand Tapaswi, Hazım Kemal Ekenel, Rainer Stiefelhagen
Proceedings of the MediaEval2012 Multimedia Benchmark Workshop (MediaEval2012 ), Pisa, Italy, Oct. 2012.

PDF

@inproceedings{Semela2012_MediaEval,
  author = {Tomas Semela and Makarand Tapaswi and Hazım Kemal Ekenel and Rainer Stiefelhagen},
  title = {{KIT at MediaEval2012 - Content-based Genre Classification with Visual Cues}},
  year = {2012},
  booktitle = {Proceedings of the MediaEval2012 Multimedia Benchmark Workshop (MediaEval2012)},
  month = {Oct.},
  doi = {}
}

This paper presents the results of our content–based video genre classification system on the 2012 MediaEval Tagging Task. Our system utilizes several low–level visual cues to achieve this task. The purpose of this evaluation is to assess our content–based system’s performance on the large amount of blip.tv web–videos and high number of genres. The task and corpus are described in detail in [Schmeideke 2012].

Multilingual spoken-password based user authentication in emerging economies using cellular phone networks

Amitava Das, Ohil K. Manyam, Makarand Tapaswi and Veeresh Taranalli
Workshop on Spoken Language Technology (SLT Oral), Goa, India, Dec. 2008.

PDF DOI

@inproceedings{Das2008_Multilingual,
  author = {Amitava Das and Ohil K. Manyam and Makarand Tapaswi and Veeresh Taranalli},
  title = {{Multilingual spoken-password based user authentication in emerging economies using cellular phone networks}},
  year = {2008},
  booktitle = {Workshop on Spoken Language Technology (SLT)},
  month = {Dec.},
  doi = {10.1109/SLT.2008.4777826}
}

Mobile phones are playing an important role in changing the socio-economic landscapes of emerging economies like India. A proper voice-based user authentication will help in many new mobile based applications including mobile-commerce and banking. We present our exploration and evaluation of an experimental set-up for user authentication in remote Indian villages using mobile phones and user-selected multilingual spoken passwords. We also present an effective speaker recognition method using a set of novel features called Compressed Feature Dynamics (CFD) which capture the speaker-identity effectively from the speech dynamics contained in the spoken passwords. Early trials demonstrate the effectiveness of the proposed method in handling noisy cell-phone speech. Compared to conventional text-dependent speaker recognition methods, the proposed CFD method delivers competitive performance while significantly reducing storage and computational complexity -- an advantage highly beneficial for cell-phone based deployment of such user authentication systems.

Disclaimer

This publication material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.