CSC2233 Topics in Storage Systems:

Vector Databases in Modern AI Applications

Winter 2025

Cutting-edge AI applications rely on custom storage systems that store billions of vectors, search through them quickly, and retrieve relevant results within milliseconds. Examples include LLM-based AI code assistants, image and video search, generative AI using retrieval-augmented generation (RAG), computational chemistry, protein folding, and recommender systems (e.g., TikTok, YouTube, Spotify). These applications embed data -- images, user preferences, or molecule structure -- as dense vectors, and then rely on fast search to find related vectors. The recent explosion of such AI applications has given rise to a new class of specialized storage system: the vector database, or VDBMS, which combines vector storage with semantic search over stored vectors.

This seminar-style course will introduce the key ideas and techniques used in this emerging class of storage systems, such as standard and hybrid querying, approximate nearest neighbour search, cluster and graph-based indexes, liveness layers, disk-resident indexing, and quantization. We will also discuss recent papers on updatable indexes, GPU-based indexing, multi-tenancy, VDBMS architecture, and more.

Audience and Prerequisites

The course is intended for both those considering research on related topics, as well as those merely curious about this core component of modern AI applications.

In terms of background, the course should be suitable for CS grad students who has taken a third-year systems course or two during their undergrad studies (e.g., databases, operating systems, etc.). In other words you need some understanding of how systems are built, some hands-on experience in coding, and general knowledge of algorithms and data structures. There is no need for extensive prior background in machine learning, databases, distributed systems, or storage systems -- although such knowledge will help. Having said that, this is a graduate course: you will need to read and discuss scientific publications, and to conduct your research project. Thus you are expected to be able to conduct background research and be able to write code for your project.


News

Mar 30, 2025: Added project project evaluation checklist.

Mar 23, 2025: Explained the project showcase on last day.

Older News (click to open/close)

Feb 12, 2025: Updated lecture slides with more material, and added student slides.

Feb 1, 2025: Added slides and a short video about segmenting -- a popular technique used in many VDBMS.

Jan 22, 2025: week 3 slides + guidelines and tips for presentation are now available

Jan 20, 2025: Bidding begins!

Bidding via HotCRP now takes place, ends on Jan 24. See details in Piazza. Note we made a few small changes to the paper list, and it should now be final.

Jan 8, 2025: Welcome to CSC2233!

The course is officially open!

  • You can find lecture slides in the Schedule table. Today's slides are already there.
  • For those on the waiting list: you may yet have your chance if people drop (no guarantees!). Make sure to attend all classes in the meanwhile, and sign up to Piazza.
  • You are welcome to audit -- the classroom is large enough.

Jan 6, 2025: Piazza mixup - double check your enrollment.

Please double check you are enrolled in Piazza, especially if you added yourself manually. Due to a mixup in configuration today, a handful of students were mistakenly removed from Piazza.

Jan 1, 2025: Added schedule, clarified paper bidding and audience background.
Dec 30, 2024: Full paper list with links now available.
Dec 20, 2024: information on wait list and auditing

While the course is currently full, you are welcome to audit -- the classroom is large enough.

For those on the waiting list: you may yet have your chance if people drop (no guarantees!). But for this to work you must make sure to attend all classes, including the first one, and to sign up to Piazza

Dec 19, 2024: added Piazza link
Dec 15, 2024: under construction

The course website will be updated with more details once the term starts. Please check in regularly.


Information

Format

The course will follow a standard seminar format: lectures to introduce key ideas, student-led presentations on academic papers, and a final research project. In addition, students will be expected to read and report in writing on several academic papers, and to take part in class discussions. Because this is a seminar-style course, class attendance is expected and mandatory. Students are also expected to regularly check the course website.

Reports

Students will be required to read and write four brief reports on several papers. The goal is to have students get used to reading papers critically (but not necessarily negatively!) and to promote class discussions.

Students will bid on which papers to present and report via HotCRP. Reports will also be submitted through the same system.

Paper reports are due at 8pm the day before the presentation of the paper (8pm Tuesday the week the paper is presented).

Presentations and Participation

Presentations are a core component of the course. Each student will present one in-person, and will also play a role in class discussions of four other papers. Students are encouraged to participate in discussions.

See Piazza for guidelines, and you can find some tips for effective presentations here.

Projects

Students will pair up to propose and pursue a small hands-on research project. It could be trying out a new idea, implement or benchmark a paper we covered in class, contribute a theoretical analysis, or provide some other constructive contribution. Project proposals will be due roughly half-way through the course (the instructor can provide some suggestions), and the final report is due at the end of the term.

See projects guidelines below.

Marking Scheme

Final grade will be based on project and final report, presentation, class participation, and attendance. Exact make-up will be announced in the beginning of the term.

Accessibility

The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.


Schedule and Slides

Click on links for the lecture slides.

Week Date Content Due
1 Jan 8 intro to VDBMS
2 Jan 15 querying
basic indexing
3 Jan 22 advanced indexing + segmenting (video)
architectures
Paper and report bids due Jan 24
4 Jan 29 Student presentations: Incremental and disk-resident indexing
[Subramanya, NeurIPS'19][Singh, arXiv'21]
5 Feb 5 Student presentations: Multitenancy and disaggregation
[Xu, SOSP'23][Su, SIGMOD'24]
6 Feb 12 Student presentations: Filtered/hybrid queries
[Gollapudi, WWW'23]   [Zhang, OSDI'23][Cai, SIGMOD'24] (slides source)
Project proposals due Feb 15
Reading week
7 Feb 26 Student presentations: GPUs in VDBMS
[Johnson, Trans. Big Data '19]   [Zhu, SIGMOD'24]
8 Mar 5 Student presentations: Learning to index
[KDD'22][Li, KDD'23][Jääsaari, NeurIPS'24]
9 Mar 12 Student presentations: SSDs in VDBMS
[Huang, ICDE'24]   [Wang, SIGMOD'24]
Mid-project report due Mar 14
10 Mar 19 Student presentations: Quantization and compression
[Ge et al., CVPR'13][Aguerrebere, VLDB'23]   [Gao, SIGMOD'24]
11 Mar 26 Student presentations: VDBMS architectures
[Zhang, SIGMOD'24]   [Wei, VLDB'20]   [Chen, VLDB'24]
12 Apr 2 Project showcase! Project report due Apr 4

Paper List

Note that some papers focus on algorithms, some are more math-heavy, while others lean towards computer systems and architecture. A small number of papers might require deeper knowledge in systems or stronger understanding of machine learning. In the end, it falls to you to choose which paper to present and to fill-in any missing knowledge as you do so. It is a great opportunity to learn! However, regardless of how your paper leans, it will be your job as a presenter to make it accessible to the class.

Make sure to List your paper preference in HotCRP by Jan 24.

Incremental disk-resident indexing:

Multitenancy and disaggregation:

Filtered/hybrid queries:

GPUs in VDBMS:

Learning to index:

SSDs in VDBMS:

Quantization and compression:

VDBMS architectures:


Projects

You will pair up to propose and pursue a small hands-on research project on vector databases. The goal is for you to gain experience doing such hands-on research, as well and synthesizing your results in written form. (Depending on factors, we may also do a small "workshop" during the final class where each group presents their work in a short six-minute presentation.)

Our TA, Pritish Mishra, will be available on Fridays at 11am-12pm on Zoom to provide some help on the project. He will also be reviewing the projects, so you should absolutely listen to him!

You are welcome to come up with your own project, or grab a project idea from the list.

Scope

The main requirement is that your project must involve some modest but (ideally) meaningful contribution to the field of vector databases or ANNS indexing. It could be trying out a new idea, extending a paper in the area, implement or benchmark and compare existing work, fresh analysis of common datasets, contribute theoretical insights and analysis, or provide some other constructive contribution. The project should focus on quantitative methods, such as algorithm design, various techniques, empirical evaluation of systems and algorithms, or mathematical modeling; something you can measure or prove.

The scope should be roughly about the size of a small systems workshop paper. You are welcome to tailor your project to align with your own research interests or background, or to make it more relevant and impactful. But it should be connected to vector databases.

Deliverables

To ensure steady progress on your project, we will ask you to submit three PDF deliverables through the term. The PDFs are submitted through Quercus assignments, and any feedback on them will come back via Quercus as well.

Format: the PDF format must follow the LaTeX template for USENIX papers. As common in conferences, you should not change things like font sizes, line spacing, width or height of text area, etc. However for the proposal and progress report deliverables, you can omit the abstract. A common mistake is microscopic figures! Make sure your figures are readable without need to zoom-in too much. (Hint: if figure text is smaller than footnote text, this is probably too small.)

Note that using LaTeX is mandatory! Do not use Word, not even the Word template from USENIX. Using LaTeX is an essential skill for CS grad students and is a learning goal. Instead of using LaTeX locally you can use Overleaf, which can make collaboration easier and requires no setup.

I. Proposal (due mid-Feb, exact date is in Quercus)

A concise document outlining your idea and execution plan (up to one page page). Explain in reasonable detail your idea, how you will pursue it, and how you will evaluate it. The proposed plan is not set in stone; it is perfectly OK (and even normal) that as you go with the project, you end up doing things differently than what you planned for. But you do need a plan! Do not forget to include a project title and the names of the authors (i.e., you names). The PDF should be up to a single page (with two columns of text) plus as many additional pages at the end as you need for references (bibliography). In other words, the referneces do not count in the page limit.

Guidance: (click to open)
  • When outlining your overall project idea, explain what you are planning to investigate, how it relates to previous work, whether it is a novel idea or extension of work, and how do you think it might benefit. Make it clear whether you are reproducing previous work or trying your own ideas. An important part is conducting a small literature review to make sure your idea is not identical to something that was already done. What are the most relevant existing works in the area?
  • For your plan, you should explain the steps you plan to take to implement and test your idea for the project. Be reasonably specific; "(1) design, (2) implementation, (3) evaluation" is not a plan. What planning needs to be done? What exactly will you implement, what language, and how long do you think it will take? (Hint: you only have a few weeks and need to reserve time for experimentation and writing.) On what hardware will you run your experiments, and do you have access to the necessary hardware? What datasets will you use? Are you sure you will finish on time on the hardware you have? (Hint: if you plan indexing SIFT1B on your laptop, this will not go well. Try SIFT1M).
  • For evaluation, briefly detail what kind of experiments you want to run, what kind of variables you want to control, the metrics you will want to measure, and the outcomes you expect or hope to see. What are the research questions you might want to ask?

    The best empirical work can often tell the entire story with a single winning figure. Think what would that figure look like for your project: suppose you have done all the work and everything is great, what is the figure you would like to show that will tell the whole story? What is its X axis? What is the Y axis? What kind of figure it is and what is drawn on it? What story does it tell a reader? You can even include dummy figures in your proposal.

II. Progress Report (due mid March)

A 2-page report (two-column pages) plus unlimited references, mean to help you keep on track. Before writing it, think about your progress towards the goal. What have you accomplished so far? How well into the plan are you? What have you tried that worked, and what did not work? Has your plan changed, and if so how? What more needs to be done?

We suggest the report follow a similar format to the project plan in the proposal, but now including much more detail. If some parts are already done, you can mix in content (paragraphs, tables, figures, etc.) and later re-use this content for your final report. For some projects it might even make sense to include one or two things that did not work, and how you adapted your plan.

III. Final Report (due end of term):

A comprehensive write-up akin to a journal or conference submission 4-6 two-column pages.

As guidance, the format should (in general) be roughly similar to an academic paper: a short abstract, an introduction section, related work section (you can also weave the related work into the introduction instead of its own section), a "methods" section (it can take different forms depending on the project), an evaluation section (detailing experiments, results, and analysis), and end with a "discussion" or conclusion paragraph summarizing and discussing the results in a high level and how you might continue the work if you could. To be clear, these are not hard rules -- for some projects it makes sense to tweak these sections. Feel free to include diagrams, figures, tables, and every type of content that would make sense to include. By the end of the project you would have read at least 5 papers, so you know what they look like.

Project Showcase (the final lecture):

The last class will be a project showcase, where every project group will present a short 5-minute talk about their project.

Each group will prepare and deliver (together or just one member) a 5-minute presentation. You should present what the project is about, what are you looking into, and what have you found. If the project is not quite finished yet, show what have you found so far and what you hope or expect to find. Note this is only 5 minutes and the time limit is very strict. This means there is not a lot of time to dig into details, so focus on the main idea and the main results. Leave the fine details for the final report.

Other Notes

For any questions or further guidance, post on Piazza.


Bibliography

Bibliography used in course slides. Click to open/close each group. These are not requiured readings, but will prove helpful if you intend to conduct research in the area.

Surveys and documents
  • [Pan, VLDBJ'24]James Jie Pan et al. "Survey of vector database management systems." The VLDB Journal (2024).
  • [Jing, arXiv'24] Zhi Jing et al. "When Large Language Models Meet Vector Databases: A Survey." arXiv preprint arXiv:2402.01763 (2024).
  • [Simhadri, PMLR'22] Harsha Vardhan Simhadri et al. "Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search." PMLR 176:177-189, 2022.
  • [Matsui, MTA'18] Yusuke et al. "A survey of product quantization." ITE Tran. Med. Tech. App. 6.1 (2018): 2-10
  • [Briggs, 2024] James Briggs. "Faiss: The Missing Manual". Pinecone Documentation
  • [Qdrant, 2024] Qdrant. "Qdrant documentation". Qdrant documentation.
  • [FAISS, 2023] Facebook AI Research. "FAISS Indexes". FAISS documentation.
VDBMS
  • [Guo, VLDB'22] Rentong Guo et al. "Manu: a cloud native vector database management system." Proc. VLDB Endow. 15, 12 (August 2022). (Manu, Milvus 2.0)
  • [Li, Middleware'18] Jie Li et al. "The design and implementation of a real time visual search system on JD E-commerce platform." Middleware 2018. (Vearch)
  • [Wang, SIGMOD'21] Jianguo Wang et al. "Milvus: A Purpose-Built Vector Data Management System". SIGMOD'21. (Milvus v1.0)
  • [Sage '23] Audrey Sage. "Great Algorithms Are Not Enough". Pinecone blog, 2023. (Pinecone PGA index)
Indexing
  • [Malkov, TPAMI'20] Yu A. Malkov et al. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs". IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (April 2020). (HNSW)
  • [Subramanya, NeurIPS'19] Suhas Jayaram Subramanya et al. "DiskANN : Fast Accurate Billion-point Nearest Neighbor Search on a Single Node". NeurIPS'19. (DiskANN)
  • [Singh, arXiv'21] Aditi Singh et al. "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search". arXiv preprint arXiv:2105.09613 (2021).. (FreshDiskANN)
  • [Andoni, NeurIPS'15] Alexandr Andoni et al. "Practical and optimal LSH for angular distance". NeurIPS'15.
  • (IndexLSH)
  • [Jegou, TPAMI'10] Hervé Jégou et al. "Product Quantization for Nearest Neighbor Search." IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (Mar 2010) (PQ, IVFADC)
  • [Jégou et al., ICASSP'11] Hervé Jégou et al. "Searching in one billion vectors: Re-rank with source coding." ICASSP'11 (IVFADC-R)
  • [Xu, SOSP'23] Yuming Xu et al. "SPFresh: Incremental In-Place Update for Billion-Scale Vector Search". SOSP'23 (SPFresh)
  • [Baranchuk, ECCV'18] Dmitry Baranchuk at al. "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors". (IVFOADC+G+P)
  • [Bernhardsson '20] Erik Bernhardsson. "Approximate Nearest Neighbors Oh Yeah". GitHub link. (ANNOY)
Others
  • [Babenko & Lempitsky, CVPR'14] Artem Babenko and Victor Lempitsky. "Additive Quantization for Extreme Vector Compression." CVPR'14 (Additive Quantizers)
  • [Liu et al, arXiv'15] Shicong Liu et al. "Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search." arXiv preprint arXiv:1509.05195 (2015) (RVQ)
  • [Bachrach, RecSys'14] Yoram Bachrach et al. "Speeding Up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product Spaces". RecSys '14.
  • [Morozov and Babenko, NeurIPS'18] Stanislav Morozov and Artem Babenko. "Non-metric Similarity Graphs for Maximum Inner Product Search". NeurIPS '18.
Recent works