The Computer-Assisted Organic Synthesis Planning (CAOSP) problem arises several times in the drug discovery pipeline. The CAOSP problem boils down to identifying a sequence of chemical reactions capable of producing a desired target molecule. CAOSP algorithms typically utilize a database of known chemical reactions and a second database of known starting materials (i.e., typically commercially available molecules). Desirable synthetic plans are inexpensive, have a high yield, and avoid the use of hazardous reactions and intermediates. Typically cast as a planning problem, significant progress has been made in CAOSP.

For the Non-Biologist: Large-scale pharmaceutical manufacturing requires efficient chemical assembly lines. We are developing tools for use by synthetic chemists to optimize these assembly lines. The Computer-Assisted Organic Synthesis Planning (CAOSP) problem reduces to identifying a series of chemical reactions capable of producing the target molecule from starting materials.

Students: Abraham Heifets, Izhar Wallach
Collaborators: SimBioSys Inc.


Molecular procurement is a fundamental challenge for chemists working with novel molecules. One laborious technique for obtaining molecular samples is to extract them from a natural source (i.e., from a plant or animal source). Alternatively, a chemist can synthesize the desired molecule from commercially available starting reactants. To facilitate this process, a number of electronic chemical reaction databases are now available. These databases allow the user to query for single-step molecular transformations. A typical synthetic plan requires a sequence of approximately ten single step reactions to be chained together.

The CAOSP problem can be cast into the framework of a traditional AI planning problem, for example robot motion planning. These path planning algorithms utilize a search tree with internal nodes representing configurations or states and edges representing actions. A set of nodes are designated goal nodes; a path from the root node to a goal node represents a feasible path. In CAOSP, nodes represent molecules and edges represent chemical reactions capable of converting one molecule into another.

Instead of performing a forward search, approaches to CAOSP typically gain computational leverage by employing a retrosynthetic search. A retrosynthetic plan starts with the target molecule and applies chemical reactions in reverse, generating a set of reactants from a product. Working backwards in this way, a search algorithm can stop when all required reactants are commercially available starting materials.

Typical chemical synthesis plans contain between 6 and 15 steps. Fortunately, this limits the depth of our search tree (i.e., we can limit the tree depth to ~15). Unfortunately, our branching factor is enormous. For comparison, the game of chess is typically considered to have a branching factor of 35 - CAOSP has a branching factor between 500 and 10,000! On the bright side, CAOSP does not play against an adversary! Beyond the exponential nature of the search space, a second challenge relates to what we term reaction execution. A chemical reaction database is less a database of explicit chemical reactions and more a list of generalized reaction schemas. Reaction execution refers to predicting reaction products given a reaction schema and set of chemical reactants. The overabundance of special cases make reaction execution extremely difficult.

We are currently collaborating with Simulated Biomolecular Systems author of ARChem and a leader in CAOSP. Check back to learn more.
  • Modeling Stereochemistry: Many chemical reactions introduce or remove chirality; this must be properly modeled in a CAOSP algorithm.
  • Robustness to Execution Ambiguity: Design of CAOSP algorithms that are robust to the uncertainty of reaction execution.
  • Domain Specific Search: Many modern advances in search have not found their way into CAOSP. We are interested in exploring how techniques such as A*, Tabu Search, and Gossip Protocols can be adapted to CAOSP using domain specific knowledge.
  • Molecular Synthetic Complexity: Efficient pruning of the search tree requires tight bounds on the synthetic complexity of molecular intermediates. Provable measures of synthetic complexity would allow more aggressive branch-and-bound search.
  • Structure-Based Drug Design: We are working to combine our research in CAOSP with our work in de novo Structure-Based Drug Design.


Current Projects

Learn more about each project by clicking through to their project pages.