Mar 30, 2025: Added project project evaluation checklist.
CSC2233 Topics in Storage Systems:
Vector Databases in Modern AI Applications
Winter 2025
Cutting-edge AI applications rely on custom storage systems that store billions of vectors, search through them quickly, and retrieve relevant results within milliseconds. Examples include LLM-based AI code assistants, image and video search, generative AI using retrieval-augmented generation (RAG), computational chemistry, protein folding, and recommender systems (e.g., TikTok, YouTube, Spotify). These applications embed data -- images, user preferences, or molecule structure -- as dense vectors, and then rely on fast search to find related vectors. The recent explosion of such AI applications has given rise to a new class of specialized storage system: the vector database, or VDBMS, which combines vector storage with semantic search over stored vectors.
This seminar-style course will introduce the key ideas and techniques used in this emerging class of storage systems, such as standard and hybrid querying, approximate nearest neighbour search, cluster and graph-based indexes, liveness layers, disk-resident indexing, and quantization. We will also discuss recent papers on updatable indexes, GPU-based indexing, multi-tenancy, VDBMS architecture, and more.
Audience and Prerequisites
The course is intended for both those considering research on related topics, as well as those merely curious about this core component of modern AI applications.
In terms of background, the course should be suitable for CS grad students who has taken a third-year systems course or two during their undergrad studies (e.g., databases, operating systems, etc.). In other words you need some understanding of how systems are built, some hands-on experience in coding, and general knowledge of algorithms and data structures. There is no need for extensive prior background in machine learning, databases, distributed systems, or storage systems -- although such knowledge will help. Having said that, this is a graduate course: you will need to read and discuss scientific publications, and to conduct your research project. Thus you are expected to be able to conduct background research and be able to write code for your project.
News
Mar 23, 2025: Explained the project showcase on last day.
Older News (click to open/close)
Feb 28, 2025: Added tips for presenting complex figures to the tips page.
Feb 12, 2025: Updated lecture slides with more material, and added student slides.
Feb 1, 2025: Added slides and a short video about segmenting -- a popular technique used in many VDBMS.
Jan 28, 2025: Project guidelines are now available.
Jan 26, 2025: Paper assignments and presentation schedule now available. Check your assignments ASAP!
Jan 22, 2025: week 3 slides + guidelines and tips for presentation are now available
Jan 20, 2025: Bidding begins!
Bidding via HotCRP now takes place, ends on Jan 24. See details in Piazza. Note we made a few small changes to the paper list, and it should now be final.
Jan 14, 2025 :make sure to list your email address in preparation for bidding (details).
Jan 8, 2025: Welcome to CSC2233!
The course is officially open!
- You can find lecture slides in the Schedule table. Today's slides are already there.
- For those on the waiting list: you may yet have your chance if people drop (no guarantees!). Make sure to attend all classes in the meanwhile, and sign up to Piazza.
- You are welcome to audit -- the classroom is large enough.
Jan 6, 2025: Piazza mixup - double check your enrollment.
Please double check you are enrolled in Piazza, especially if you added yourself manually. Due to a mixup in configuration today, a handful of students were mistakenly removed from Piazza.
Dec 20, 2024: information on wait list and auditing
While the course is currently full, you are welcome to audit -- the classroom is large enough.
For those on the waiting list: you may yet have your chance if people drop (no guarantees!). But for this to work you must make sure to attend all classes, including the first one, and to sign up to Piazza
Dec 15, 2024: under construction
The course website will be updated with more details once the term starts. Please check in regularly.
Information
- Instructor: Moshe Gabel
- TA: Pritish Mishra
- Class meets Wednesdays 11am-1pm at BA 1240. This is a seminar-style course. In-person attendance is mandatory (of course, with reasonable exceptions for emergencies, etc.) and participation is expected.
- Discussions and announcements in Piazza. Make sure to sign up and read it regularly.
- Project submissions on Quercus.
- We use HotCRP to bid on papers and submit paper reports.
Format
The course will follow a standard seminar format: lectures to introduce key ideas, student-led presentations on academic papers, and a final research project. In addition, students will be expected to read and report in writing on several academic papers, and to take part in class discussions. Because this is a seminar-style course, class attendance is expected and mandatory. Students are also expected to regularly check the course website.
Reports
Students will be required to read and write four brief reports on several papers. The goal is to have students get used to reading papers critically (but not necessarily negatively!) and to promote class discussions.
Students will bid on which papers to present and report via HotCRP. Reports will also be submitted through the same system.
Paper reports are due at 8pm the day before the presentation of the paper (8pm Tuesday the week the paper is presented).
Presentations and Participation
Presentations are a core component of the course. Each student will present one in-person, and will also play a role in class discussions of four other papers. Students are encouraged to participate in discussions.
See Piazza for guidelines, and you can find some tips for effective presentations here.
Projects
Students will pair up to propose and pursue a small hands-on research project. It could be trying out a new idea, implement or benchmark a paper we covered in class, contribute a theoretical analysis, or provide some other constructive contribution. Project proposals will be due roughly half-way through the course (the instructor can provide some suggestions), and the final report is due at the end of the term.
See projects guidelines below.
Marking Scheme
Final grade will be based on project and final report, presentation, class participation, and attendance. Exact make-up will be announced in the beginning of the term.
Accessibility
The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.
Schedule and Slides
Click on links for the lecture slides.
Week | Date | Content | Due |
---|---|---|---|
1 | Jan 8 | intro to VDBMS | |
2 | Jan 15 | querying basic indexing |
|
3 | Jan 22 | advanced indexing + segmenting (video) architectures |
Paper and report bids due Jan 24 |
4 | Jan 29 | Student presentations: Incremental and disk-resident indexing [Subramanya, NeurIPS'19] [Singh, arXiv'21] |
|
5 | Feb 5 | Student presentations: Multitenancy and disaggregation [Xu, SOSP'23] [Su, SIGMOD'24] |
|
6 | Feb 12 | Student presentations: Filtered/hybrid queries [Gollapudi, WWW'23] [Zhang, OSDI'23] [Cai, SIGMOD'24] (slides source) |
Project proposals due Feb 15 |
Reading week | 7 | Feb 26 | Student presentations: GPUs in VDBMS [Johnson, Trans. Big Data '19] [Zhu, SIGMOD'24] |
8 | Mar 5 | Student presentations: Learning to index [KDD'22] [Li, KDD'23] [Jääsaari, NeurIPS'24] |
|
9 | Mar 12 | Student presentations: SSDs in VDBMS [Huang, ICDE'24] [Wang, SIGMOD'24] |
Mid-project report due Mar 14 |
10 | Mar 19 | Student presentations: Quantization and compression [Ge et al., CVPR'13] [Aguerrebere, VLDB'23] [Gao, SIGMOD'24] |
|
11 | Mar 26 | Student presentations: VDBMS architectures [Zhang, SIGMOD'24] [Wei, VLDB'20] [Chen, VLDB'24] |
|
12 | Apr 2 | Project showcase! | Project report due Apr 4 |
Paper List
Note that some papers focus on algorithms, some are more math-heavy, while others lean towards computer systems and architecture. A small number of papers might require deeper knowledge in systems or stronger understanding of machine learning. In the end, it falls to you to choose which paper to present and to fill-in any missing knowledge as you do so. It is a great opportunity to learn! However, regardless of how your paper leans, it will be your job as a presenter to make it accessible to the class.
Make sure to List your paper preference in HotCRP by Jan 24.
Incremental disk-resident indexing:
- [Subramanya, NeurIPS'19] Suhas Jayaram Subramanya et al. DiskANN : Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS'19. (DiskANN)
- [Singh, arXiv'21] Aditi Singh et al. FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. arXiv preprint arXiv:2105.09613 (2021). (FreshDiskANN)
- [Xu, SOSP'23] Yuming Xu et al. SPFresh: Incremental In-Place Update for Billion-Scale Vector Search. SOSP'23. (SPFresh)
Multitenancy and disaggregation:
- [ArXiv '24] Yicheng Jin et al. Curator: Efficient Indexing for Multi-Tenant Vector Databases. arXiv preprint arXiv:2401.07119 (2024) (Curator)
- [Jang, ATC'23] Junhyeok Jang et al. CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. USENIX ATC'23. (CXL-ANNS)
- [Su, SIGMOD'24] Yongye Su et al. Vexless: A Serverless Vector Data Management System Using Cloud Functions. Proc. ACM Manag. Data 2, 3, Article 187 (June 2024). (Vexless)
Filtered/hybrid queries:
- [Gollapudi, WWW'23] Siddharth Gollapudi et al. Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters. WWW'23. (Filtered-DiskANN)
- [Zhang, OSDI'23] Qianxi Zhang et al. VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity. OSDI'24. (VBase)
- [Cai, SIGMOD'24] Yuzheng Cai et al. Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor Search. Proc. ACM Manag. Data 2, 6, Article 246. (UNG)
GPUs in VDBMS:
- [Johnson, Trans. Big Data '19] Jeff Johnson et al. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data 7.3 (2019): 535-547.
- [Gaihre, SC'21] Anil Gaihre et al. Dr. Top-k: delegate-centric Top-k on GPUs. SC'21. (Dr. Top-k)
- [Zhu, SIGMOD'24] Yifan Zhu et al. GTS: GPU-based Tree Index for Fast Similarity Search. Proc. ACM Manag. Data 2, 3, Article 142. (GTS)
Learning to index:
- [KDD'22] Gaurav Gupta et al. BLISS: A Billion scale Index using Iterative Re-partitioning. KDD'22. (BLISS)
- [Li, KDD'23] Wuchao Li et al. Learning Balanced Tree Indexes for Large-Scale Vector Retrieval. KDD'23. (BATL)
- [Jääsaari, NeurIPS'24] Elias Jääsaari et al. LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search. NeurIPS'24 (LoRANN)
SSDs in VDBMS:
- [Huang, ICDE'24] Yuchen Huang, et al. Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space. ICDE'24. (Neos)
- [Wang, SIGMOD'24] Mengzhao Wang et al. Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment. Proc. ACM Manag. Data 2, 1, Article 14. (Starling)
- [Tian, ATC'24] Bing Tian et al. Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs. USENIX ATC'24 (SmartANNS)
Quantization and compression:
- [Ge et al., CVPR'13] Tiezheng Ge et al. Optimized Product Quantization for Approximate Nearest Neighbor Search. CVPR'13. (OPQ)
- [Aguerrebere, VLDB'23] Cecilia Aguerrebere et al. Similarity Search in the Blink of an Eye with Compressed Indices. Proc. VLDB Endow. 16, 11. (LVQ)
- [Gao, SIGMOD'24] Jianyang Gao et al. RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search. Proc. ACM Manag. Data 2, 3, Article 142. (RaBitQ)
VDBMS architectures:
- [Zhang, SIGMOD'24] Xinyi Zhang et al. FedKNN: Secure Federated k-Nearest Neighbor Search. Proc. ACM Manag. Data 2, 1, Article 11. (FedKNN, DANN)
- [Wei, VLDB'20] Chuangxian Wei et al. AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data. Proc. VLDB Endow. 13, 12. (AnalyticDB-V, ADBV)
- [Chen, VLDB'24] Cheng Chen et al. SingleStore-V: An Integrated Vector Database System in SingleStore. Proc. VLDB Endow. 17, 12. (SingleStore-V)
Projects
You will pair up to propose and pursue a small hands-on research project on vector databases. The goal is for you to gain experience doing such hands-on research, as well and synthesizing your results in written form. (Depending on factors, we may also do a small "workshop" during the final class where each group presents their work in a short six-minute presentation.)
Our TA, Pritish Mishra, will be available on Fridays at 11am-12pm on Zoom to provide some help on the project. He will also be reviewing the projects, so you should absolutely listen to him!
You are welcome to come up with your own project, or grab a project idea from the list.
Scope
The main requirement is that your project must involve some modest but (ideally) meaningful contribution to the field of vector databases or ANNS indexing. It could be trying out a new idea, extending a paper in the area, implement or benchmark and compare existing work, fresh analysis of common datasets, contribute theoretical insights and analysis, or provide some other constructive contribution. The project should focus on quantitative methods, such as algorithm design, various techniques, empirical evaluation of systems and algorithms, or mathematical modeling; something you can measure or prove.
The scope should be roughly about the size of a small systems workshop paper. You are welcome to tailor your project to align with your own research interests or background, or to make it more relevant and impactful. But it should be connected to vector databases.
Deliverables
To ensure steady progress on your project, we will ask you to submit three PDF deliverables through the term. The PDFs are submitted through Quercus assignments, and any feedback on them will come back via Quercus as well.
Format: the PDF format must follow the LaTeX template for USENIX papers. As common in conferences, you should not change things like font sizes, line spacing, width or height of text area, etc. However for the proposal and progress report deliverables, you can omit the abstract. A common mistake is microscopic figures! Make sure your figures are readable without need to zoom-in too much. (Hint: if figure text is smaller than footnote text, this is probably too small.)
Note that using LaTeX is mandatory! Do not use Word, not even the Word template from USENIX. Using LaTeX is an essential skill for CS grad students and is a learning goal. Instead of using LaTeX locally you can use Overleaf, which can make collaboration easier and requires no setup.
I. Proposal (due mid-Feb, exact date is in Quercus)
A concise document outlining your idea and execution plan (up to one page page). Explain in reasonable detail your idea, how you will pursue it, and how you will evaluate it. The proposed plan is not set in stone; it is perfectly OK (and even normal) that as you go with the project, you end up doing things differently than what you planned for. But you do need a plan! Do not forget to include a project title and the names of the authors (i.e., you names). The PDF should be up to a single page (with two columns of text) plus as many additional pages at the end as you need for references (bibliography). In other words, the referneces do not count in the page limit.
Guidance: (click to open)
- When outlining your overall project idea, explain what you are planning to investigate, how it relates to previous work, whether it is a novel idea or extension of work, and how do you think it might benefit. Make it clear whether you are reproducing previous work or trying your own ideas. An important part is conducting a small literature review to make sure your idea is not identical to something that was already done. What are the most relevant existing works in the area?
- For your plan, you should explain the steps you plan to take to implement and test your idea for the project. Be reasonably specific; "(1) design, (2) implementation, (3) evaluation" is not a plan. What planning needs to be done? What exactly will you implement, what language, and how long do you think it will take? (Hint: you only have a few weeks and need to reserve time for experimentation and writing.) On what hardware will you run your experiments, and do you have access to the necessary hardware? What datasets will you use? Are you sure you will finish on time on the hardware you have? (Hint: if you plan indexing SIFT1B on your laptop, this will not go well. Try SIFT1M).
- For evaluation, briefly detail what kind of experiments you want to run, what kind of variables you want to control, the metrics you will want to measure, and the outcomes you expect or hope to see.
What are the research questions you might want to ask?
The best empirical work can often tell the entire story with a single winning figure. Think what would that figure look like for your project: suppose you have done all the work and everything is great, what is the figure you would like to show that will tell the whole story? What is its X axis? What is the Y axis? What kind of figure it is and what is drawn on it? What story does it tell a reader? You can even include dummy figures in your proposal.
II. Progress Report (due mid March)
A 2-page report (two-column pages) plus unlimited references, mean to help you keep on track. Before writing it, think about your progress towards the goal. What have you accomplished so far? How well into the plan are you? What have you tried that worked, and what did not work? Has your plan changed, and if so how? What more needs to be done?
We suggest the report follow a similar format to the project plan in the proposal, but now including much more detail. If some parts are already done, you can mix in content (paragraphs, tables, figures, etc.) and later re-use this content for your final report. For some projects it might even make sense to include one or two things that did not work, and how you adapted your plan.
III. Final Report (due end of term):
A comprehensive write-up akin to a journal or conference submission 4-6 two-column pages.
As guidance, the format should (in general) be roughly similar to an academic paper: a short abstract, an introduction section, related work section (you can also weave the related work into the introduction instead of its own section), a "methods" section (it can take different forms depending on the project), an evaluation section (detailing experiments, results, and analysis), and end with a "discussion" or conclusion paragraph summarizing and discussing the results in a high level and how you might continue the work if you could. To be clear, these are not hard rules -- for some projects it makes sense to tweak these sections. Feel free to include diagrams, figures, tables, and every type of content that would make sense to include. By the end of the project you would have read at least 5 papers, so you know what they look like.
Project Showcase (the final lecture):
The last class will be a project showcase, where every project group will present a short 5-minute talk about their project.
Each group will prepare and deliver (together or just one member) a 5-minute presentation. You should present what the project is about, what are you looking into, and what have you found. If the project is not quite finished yet, show what have you found so far and what you hope or expect to find. Note this is only 5 minutes and the time limit is very strict. This means there is not a lot of time to dig into details, so focus on the main idea and the main results. Leave the fine details for the final report.
Other Notes
- The project is an opportunity to deepen your understanding of vector databases as well as practice applied research. If you treat it seriously, you will learn a lot, it can be a nice line on your CV, and you might even decide to extend it into a publication. If you treat it as an afterthought, you will not get much out of it.
- You are encouraged and welcome to publish projects in various ways: whether as submission to a workshop/conference, as open source projects, or even as blog posts. However this is not required at all -- up to you.
- We evaluate projects on two dimensions: ambition and execution.
By "ambition" we mean for example how original the contribution is (novelty), how much effort it takes to do, how risky it is.
Note that this is not a conference: a project can be plenty ambitious without being "novel" from a publication standpoint.
By "execution" we mean rigorous evaluation, in-depth analysis, clarity of writing, overall quality of execution, and so on.
If a project is not very ambitious but has excellent execution, it will do well. If a project is ambitious but its execution is not perfect -- that is also fine. The problem starts for projects that are not ambitious ("benchmark 3 open source tools") whose execution is also poor (e.g., simplistic evaluation with a single experiment, and without analysis); these will not do as well.
- We use this checklist to help us assess projects, reports, and showcase presentations. You are not expected to meet all items!
For any questions or further guidance, post on Piazza.
Bibliography
Bibliography used in course slides. Click to open/close each group. These are not requiured readings, but will prove helpful if you intend to conduct research in the area.
Surveys and documents
- [Pan, VLDBJ'24]James Jie Pan et al. "Survey of vector database management systems." The VLDB Journal (2024).
- [Jing, arXiv'24] Zhi Jing et al. "When Large Language Models Meet Vector Databases: A Survey." arXiv preprint arXiv:2402.01763 (2024).
- [Simhadri, PMLR'22] Harsha Vardhan Simhadri et al. "Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search." PMLR 176:177-189, 2022.
- [Matsui, MTA'18] Yusuke et al. "A survey of product quantization." ITE Tran. Med. Tech. App. 6.1 (2018): 2-10
- [Briggs, 2024] James Briggs. "Faiss: The Missing Manual". Pinecone Documentation
- [Qdrant, 2024] Qdrant. "Qdrant documentation". Qdrant documentation.
- [FAISS, 2023] Facebook AI Research. "FAISS Indexes". FAISS documentation.
VDBMS
- [Guo, VLDB'22] Rentong Guo et al. "Manu: a cloud native vector database management system." Proc. VLDB Endow. 15, 12 (August 2022). (Manu, Milvus 2.0)
- [Li, Middleware'18] Jie Li et al. "The design and implementation of a real time visual search system on JD E-commerce platform." Middleware 2018. (Vearch)
- [Wang, SIGMOD'21] Jianguo Wang et al. "Milvus: A Purpose-Built Vector Data Management System". SIGMOD'21. (Milvus v1.0)
- [Sage '23] Audrey Sage. "Great Algorithms Are Not Enough". Pinecone blog, 2023. (Pinecone PGA index)
Indexing
- [Malkov, TPAMI'20] Yu A. Malkov et al. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs". IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (April 2020). (HNSW)
- [Subramanya, NeurIPS'19] Suhas Jayaram Subramanya et al. "DiskANN : Fast Accurate Billion-point Nearest Neighbor Search on a Single Node". NeurIPS'19. (DiskANN)
- [Singh, arXiv'21] Aditi Singh et al. "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search". arXiv preprint arXiv:2105.09613 (2021).. (FreshDiskANN)
- [Andoni, NeurIPS'15] Alexandr Andoni et al. "Practical and optimal LSH for angular distance". NeurIPS'15.
- (IndexLSH)
- [Jegou, TPAMI'10] Hervé Jégou et al. "Product Quantization for Nearest Neighbor Search." IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (Mar 2010) (PQ, IVFADC)
- [Jégou et al., ICASSP'11] Hervé Jégou et al. "Searching in one billion vectors: Re-rank with source coding." ICASSP'11 (IVFADC-R)
- [Xu, SOSP'23] Yuming Xu et al. "SPFresh: Incremental In-Place Update for Billion-Scale Vector Search". SOSP'23 (SPFresh)
- [Baranchuk, ECCV'18] Dmitry Baranchuk at al. "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors". (IVFOADC+G+P)
- [Bernhardsson '20] Erik Bernhardsson. "Approximate Nearest Neighbors Oh Yeah". GitHub link. (ANNOY)
Others
- [Babenko & Lempitsky, CVPR'14] Artem Babenko and Victor Lempitsky. "Additive Quantization for Extreme Vector Compression." CVPR'14 (Additive Quantizers)
- [Liu et al, arXiv'15] Shicong Liu et al. "Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search." arXiv preprint arXiv:1509.05195 (2015) (RVQ)
- [Bachrach, RecSys'14] Yoram Bachrach et al. "Speeding Up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product Spaces". RecSys '14.
- [Morozov and Babenko, NeurIPS'18] Stanislav Morozov and Artem Babenko. "Non-metric Similarity Graphs for Maximum Inner Product Search". NeurIPS '18.
Recent works
Improving indexing:
- [Chen, WWW'23] Patrick Chen et al. FINGER: "Fast inference for graph-based approximate nearest neighbor search. WWW'23. (FINGER)
- [Li, SIGMOD'20] Conglong Li et al. Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination. SIGMOD'20.
- [Wang, ICDE'24] Mengzhao Wang et al. MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality. ICDE'24. (MUST, multi-modal/multi-vector queries)
Filtered/hybrid queries:
- [Mohoney, SIGMOD'23] Jason Mohoney et al. High-Throughput Vector Similarity Search in Knowledge Graphs. SIGMOD'23. (HQI)
- [Wang, NeurIPS'23] Mengzhao Wang et al. An Efficient and Robust Framework for Approximate Nearest Neighbor Search with Attribute Constraint. NeurIPS'23. (NHQ)
- [Patel, SIGMOD'24] Liana Patel et al. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data. Proc. ACM Manag. Data 2, 3, Article 120. (ACORN)