Noël Vouitsis

Refereed

(*) denotes equal contribution

Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples
In NeurIPS 2024 Attributing Model Behavior at Scale Workshop
In NeurIPS 2024 Fine-Tuning in Modern Machine Learning: Principles and Scalability Workshop
Noël Vouitsis, Rasa Hosseinzadeh, Brendan Leigh Ross, Valentin Villecroze, Satya Krishna Gorti, Jesse C. Cresswell, Gabriel Loaiza-Ganem
Consistency model (CM) distillation aims to solve the probability flow (PF) ordinary differential equation (ODE) defined by an existing diffusion model. We find that better solving the PF ODE can lead to significantly worse sample quality, calling into question why exactly CMs work well in the first place.

MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation
In NAACL 2025 (Oral)
Abridged in NeurIPS 2024 Table Representation Learning Workshop (Oral)
Satya Krishna Gorti, Ilan Gofman, Zhaoyan Liu, Jiapeng Wu, Noël Vouitsis, Guangwei Yu, Jesse C. Cresswell, Rasa Hosseinzadeh
Recent advances in text-to-SQL generation rely on large closed-source models like GPT-4 that present challenges in accessibility, privacy, and latency. To address these issues, we focus on developing small, efficient, and open-source text-to-SQL models. We demonstrate the benefits of sampling multiple candidate SQL generations and propose our method, MSc-SQL, to critique them using associated metadata. Our sample critiquing model evaluates multiple outputs simultaneously, achieving state-of-the-art performance compared to other open-source models while remaining competitive with larger models at a much lower cost.

Conformal Prediction Sets Improve Human Decision Making
In ICML 2024
Abridged in ICLR 2024 Bridging the Gap Between Practice and Theory in Deep Learning Workshop
Jesse C. Cresswell, Yi Sui, Bhargava Kumar, Noël Vouitsis
We study the usefulness of conformal prediction sets as an aid for human decision making by conducting a pre-registered randomized controlled trial with conformal prediction sets provided to human subjects. With statistical significance, we find that when humans are given conformal prediction sets their accuracy on tasks improves compared to fixed-size prediction sets with the same coverage guarantee. The results show that quantifying model uncertainty with conformal prediction is helpful for human-in-the-loop decision making and human-AI teams.

Data-Efficient Multimodal Fusion on a Single GPU
In CVPR 2024 (Spotlight)
Noël Vouitsis*, Zhaoyan Liu*, Satya Krishna Gorti*, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims Volkovs
We propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance – and in certain cases outperform state-of-the art methods – in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with ∼600× fewer GPU days and ∼80× fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones.

TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation
In ICML 2023
Zhaoyan Liu*, Noël Vouitsis*, Satya Krishna Gorti, Jimmy Ba, Gabriel Loaiza-Ganem
TR0N is a highly general framework to add any type of conditioning (e.g. classes, free-form text, images) to pre-trained unconditional generative models (e.g. GANs, VAEs). TR0N is simple, efficient and requires no provided dataset to train (zero-shot). We show impressive quantitative and qualitative results across tasks, and are highly competitive with DALL·E 2 in terms of FID on MS-COCO.

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
In CVPR 2022
Satya Krishna Gorti*, Noël Vouitsis*, Junwei Ma*, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu
X-Pool is a cross-modal attention model that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text’s attention weights over the frames. We evaluate our method on three benchmark datasets of MSRVTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1.