Satya Krishna Gorti

MSc. in Applied Computing

satyag [at] cs [dot] toronto [dot] edu

Brief Bio

I graduated with MSc. in Applied Computing from University of Toronto in 2018. My main interests lie in the area of Machine Learning, Deep Learning and Computer Vision. I am currently a Senior ML Research Scientist at Layer6 AI where I work on Representation Learning and Multi-Modal Understanding. I also lead the ML Frameworks team where we build ML frameworks for training, deploying and monitoring ML models in production on the cloud. Previous to this, I was a Research Intern at Uber ATG working on multi-object tracking using LIDAR and RADAR sensors for self-driving vehicles.

Research

  • Data-Efficient Multimodal Fusion on a Single GPU
    We propose FuseMix, a multimodal augmentation scheme that operates on latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance and in certain cases outperform state-of-the art methods in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with ∼600× fewer GPU days and ∼80× fewer image-text pairs.
    CVPR 2024 - Seattle, WA
    [Paper][Code]
  • TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation
    We propose TR0N, a highly general framework to turn pre-trained unconditional generative models, such as GANs and VAEs, into conditional models. The conditioning can be highly arbitrary, and requires only a pre-trained auxiliary model. We show how to turn unconditional models into class-conditional ones with the help of a classifier, and also into text-to-image models by leveraging CLIP. TR0N learns a lightweight stochastic mapping which translates between the space of conditions and the latent space of the generative model. The translated latent samples are then further improved upon through Langevin dynamics, enabling us to obtain higher-quality data samples. TR0N requires no training data nor fine-tuning, yet can achieve a zero-shot FID of $10.9$ on MS-COCO, outperforming competing alternatives not only on this metric, but also in sampling speed.
    ICML 2023 - Honolulu, HI
    [Paper][Code]
  • XPool: Cross-Modal Language-Video Attention for Text-Video Retrieval
    We propose a cross-modal attention model called XPool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text’s attention weights over the frames. We evaluate our method on three benchmark datasets of MSRVTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 8% in relative improvement in Recall@1.
    CVPR 2022 - New Orleans, LA
    [Paper][Code]
  • Weakly Supervised Action Selection Learning in Video
    We propose Action Selection Learning (ASL), an approach to temporally localize actions in untrimmed videos using video level class labels as weak supervision. Empirically, we show that ASL outperforms leading baselines on two popular benchmarks THUMOS-14 and ActivityNet-1.2, with 12.3% and 5.7% relative improvement respectively.
    CVPR 2021 - Nashville, TN
    [Paper]
  • Cross-Class Relevance Learning for Information Fusion in Temporal Concept Localization
    We present a framework for temporal concept localization and hold state-of-the-art results on Youtube-8M dataset.
    ICCV 2019 - The 3rd Workshop on YouTube-8M Large-Scale Video Understanding - Seoul, South Korea
    [Paper][Workshop]
  • Guided Similarity Separation for Image Retrieval
    We propose a graph convolutional network to directly encode neighbour information into image descriptors for image retrieval. We further leverage ideas from clustering and manifold learning, and introduce an unsupervised loss based on pairwise separation of image similarities.
    NeurIPS 2019 - Vancouver, BC
    [Paper]
  • Semi-Supervised Traversal in Image Retrieval
    A novel semi-supervised graph traversal extention to Explore-Exploit Graph Traversal (EGT) for image retrieval.
    CVPR 2019 - Landmark Recognition Workshop - Long Beach, CA
    [Paper][Workshop]
  • Online algorithm for adaptive learning rate
    Online algorithm for learning the learning rate in stochastic gradient descent using first order and second order approximation methods and studying its effects on convex and non-convex machine learning problems.
    [arXiv][GitHub]
  • Text-to-Image-to-Text translation using cycle consistent adversarial networks
    Improving text to image synthesis using cycle consistency.
    [arXiv][GitHub]
    Ground Truth Caption Generated Image Generated Caption
    the flower has long yellow petals that are thin and a yellow stamen this flower has petals that are yellow and very thin
    there are many long and narrow floppy pink petals surrounding many red stamen and a green stigma on this flower this flower has petals that are red with pointed tips
  • ReGAN: RE[LAX|BAR|INFORCE] based Sequence Generation using GANs
    A comparative study on gradient estimators for sequence generation using GANs
    [arXiv][GitHub]

Presentations

    • TMLS 2019, Toronto, Ontario - Temporal Concept Localization on Youtube-8M [Video]
    • ICCV 2019, Seoul, South Korea - Youtube-8m 1st Place challenge presentation
    • CVPR 2019, Long Beach, CA - Semi-supervised EGT for landmark retrieval [Slides]
    • Review of GANs for Sequences of Discrete Elements with Gumbel-Softmax Distribution [Slides]

    Resume

    You can find my full resume here.