NATURAL

End-To-End Causal Effect Estimation from Unstructured Natural Language Data

1University of Toronto

2Vector Institute

3Meta AI

Does semaglutide protect kidney health? Do later school times promote the well-being of children? These types of question about cause and effect drive decisions across medicine, policy, and business. Randomized controlled trials (RCTs) are the most trusted mechanism to answer causal questions by estimating treatment effects. Unfortunately, clinical trials take several years and millions of dollars to maybe approve a drug, and RCTs are often infeasible for many critical policy questions. Yet, a sudden outbreak or pandemic gives us mere days and a handful of potential facts to make vital decisions. Our aim is to bolster the sources of causal information available to us and accelerate the extraction of useful insights from them.

Observational studies offer pre-trial insights but demand structured data. Meanwhile, a wealth of information lies untapped in online forums. For instance, thousands with diabetes, migraines, or Long Covid share their treatment experiences on dedicated subreddits. This data is rich, diverse, and accessible – but unstructured. In this work, we introduce a pipeline that turns unstructured text data like this into causal insights.

NATURAL: From Forums to Findings

img
RCTs take years and millions of dollars to estimate treatment effects. NATURAL converts
unstructured text data from online forums to cheap ATE estimates in hours.

NATURAL is a large-language-model based pipeline that turns unstructured text data into meaningful treatment effects. We used social media data to test its performance against real-world RCTs comparing several diabetes and migraine drugs. For clinical trials, NATURAL predicted average treatment effects (ATEs) that fell within three percentage points of their ground truth counterparts! This suggests that unstructured text data is indeed a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.

NATURAL estimates fell within 3 percentage points of their corresponding ground truth
clinical trial ATEs. Possible ATE values lie between -100 and 100.
Semaglutide
vs.
Tirzepatide
NCT03987919
Semaglutide
vs.
Liraglutide
NCT03191396
Erenumab
vs.
Topiramate
NCT03828539
OnabotulinumtoxinA
vs.
Topiramate
NCT02191579
Treatment effect
in real-world RCT
10.11 -14.70 28.30 41.00
NATURAL using
social media data
8.83 -15.90 27.90 42.60

How It Works

NATURAL is data-driven, but also incorporates domain expertise. At a high-level, the key steps to compute NATURAL estimators are:
  1. Domain expertise is used to design an observational study, ensuring it meets critical causal assumptions.
  2. A multi-step filtering identifies reports that are likely to conform to the experimental design.
  3. Large language models (LLMs) are used to extract the conditional distribution of structured variables of interest (outcome, treatment, covariates) given the report.
  4. We adapt classical techniques like inverse propensity score weighting to compute treatment effect estimates from these conditionals.
Visualization of forum data to medical insights
NATURAL leverages LLMs to curate data that can be plugged into natural language conditioned estimators for average treatment effects.

Evaluation

We developed six observational datasets to evaluate different parts of NATURAL: two synthetic datasets constructed using marketing data, and four clinical datasets curated from public (pre-December 2022) migraine and diabetes subreddits from the Pushshift collection. For each dataset, we treated the average treatment effect (ATE) from a corresponding real-world completely randomized experiment as ground truth. You can see the results obtained with our NATURAL estimators in the table above. In synthetic settings, we can conduct more fine-grained evaluation than simply comparing the final ATEs. The visualization below suggests that LLMs are able to estimate observational distributions increasingly well with self-reported data.
Performance graph of NATURAL Performance graph of NATURAL
For Hillstrom (left) and Retail Hero (right) datasets, the KL divergence between estimated joint and propensity distributions and their true
counterparts reduces with increasing number of posts (top), as does the RMSE between the NATURAL estimate and true ATE (bottom).

What's Next

NATURAL is a first step towards automated effect estimation from natural language data, with further potential in data-driven decision-making. We are excited to expand its data sources as well as applications:
  1. Using online forum conversations to better understand policy interventions.
  2. Estimating individualized treatment effects on-demand from electronic health records.
  3. Prioritizing clinical trial investment for neglected diseases, based on real lived experiences.
  4. Repurposing drugs and uncovering hidden potential in existing medications.
  5. Detecting rare adverse effects of drugs via safety monitoring in large, diverse populations.
Please get in touch with us about any questions or interest in collaborations!
To cite this work, please use:
@article{dhawan2024end,
        title={End-To-End Causal Effect Estimation from Unstructured Natural Language Data},
        author={Dhawan, Nikita and Cotta, Leonardo and Ullrich, Karen and Krishnan, Rahul G and Maddison, Chris J},
        journal={arXiv preprint arXiv:2407.07018},
        year={2024}
        }