This is a tentative course schedule. Handouts and slides will be added to this page as the course progresses. Make sure to refresh the page to see the latest handouts.
| # | Date | Topic | Reading | Handouts | Assignments |
|---|---|---|---|---|---|
| 1 | Jan 6 | Course Logistics and Introduction | - | H01 - Info sheet [pdf] H02 - Lecture 1 [pdf][pptx] |
|
| 2 | Jan 13 | Data Center Networks, Network Programming | - | H03 - Lecture 2 [pdf][pptx] | - |
| 3 | Jan 20 | Networks and ML | - | H04 - Lecture 3 [pdf][pptx] | |
| 4 | Jan 27 | High-Performance Transport | [1], [2], [3] | - | |
| 5 | Feb 3 | Job and Flow Scheduling | [4], [5], [6] | - | |
| 6 | Feb 10 | Reconfigurable Data Center Networks | [7], [8], [9] | - | Project proposal due on Feb 13 |
| 7 | Feb 17 | - | - | - | |
| 8 | Feb 24 | Scaling the solutions | - | ||
| 9 | Mar 3 | - | |||
| 10 | Mar 10 | - | Intermediate report due on Mar 13 | ||
| 11 | Mar 17 | - | |||
| 12 | Mar 24 | Final project presentations | - | - | - |
| 13 | Mar 31 | Final project presentations | - | - | Final report due on Apr 3 |
This MSc/PhD-level course delves into the core challenges of interconnection networks, emphasizing the use of machine learning to address these issues. The rapid growth of computing demands, driven by machine learning applications, has introduced significant challenges in areas such as bandwidth, latency, and packet loss. Meeting these demands requires innovative techniques and a fresh approach to traditional networking solutions across various layers, including the link, transport, and application layers.
The course begins with a review of key concepts in computer networking, such as packet-switching systems, data center networks, and software-defined networking. It then explores advanced research challenges and cutting-edge solutions in the field. Topics include:
A previous course on computer networks (CSC2209H or equivalent) is highly recommended. Basic undergraduate courses in algorithms and probability theory are recommended.
The course is based on recent research material, and we do not have a textbook.
You have a free late submission of two days. You can use these two days on any one of the deliverables (proposal, intermediate report or final report). You need to notify that TA before using your free late submission.
In addition to the free late submission, you can submit each assignment late by up to two days. For each late day beyond the free late submission, 10% of the mark will be deducted (up to 20%). Assignments/project reports will not be accepted after two days.
Please use our class bulletin board (on Piazza) to ask questions or discuss any course-related topics. You can sign up to the bulletin board here:
https://piazza.com/utoronto.ca/winter2026/csc2229
By using the bulletin board, everyone in class can read the replies, and the overall number of repeat questions is reduced. Please check the bulletin board before posting any new questions. If you have any questions that cannot be posted on the bulletin board (e.g. questions about your grades), you can e-mail the TA or the course instructor directly.
Please make sure to check the announcements folder regularly for updates regarding lectures, assignments, etc. or enable notifications on Piazza.
In addition to our bulletin board, we have a mailing list that will be used exclusively for sharing important information. We will use the email address you have used on ACORN to create this list (please make sure that is a valid email address). Please do not use this email to ask questions.
Students will present papers from the “Reading List” throughout the term. Each presentation is expected to be 20 minutes followed by 10 minutes of Q&A and discussions.
The list of papers we will read in this course will be added here.
[1] Z. Wang et al., “{SRNIC}: A Scalable Architecture for {RDMA} {NICs},” presented at the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 1–14. Available: https://www.usenix.org/conference/nsdi23/presentation/wang-zilong.
[2] Y. Lu et al., “{Multi-Path} Transport for {RDMA} in Datacenters,” presented at the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 357–371. Available: https://www.usenix.org/conference/nsdi18/presentation/lu.
[3] K. Prasopoulos et al., “SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol,” presented at the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 451–471. Available: https://www.usenix.org/conference/nsdi25/presentation/prasopoulos
[4] W. Li et al., “Flow Scheduling with Imprecise Knowledge,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 95–111. Available: https://www.usenix.org/conference/nsdi24/presentation/li-wenxin
[5] S. Rajasekaran, M. Ghobadi, and A. Akella, “CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1403–1420. Available: https://www.usenix.org/conference/nsdi24/ presentation/rajasekaran
[6] F. Zandi, M. Ghobadi, and Y. Ganjali, “FORESIGHT: Joint Time and Space Scheduling for Efficient Distributed ML Training,” in IEEE/IFIP Networking, Apr. 2025, pp. 1–9. Available: https://networking.ifip.org/2025/images/Net25_papers/1571125810.pdf
[7] X. Liao et al., “MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training,” in Proceedings of the ACM SIGCOMM 2025 Conference, in SIGCOMM ’25. New York, NY, USA: Association for Computing Machinery, Aug. 2025, pp. 554–574. doi: 10.1145/3718958.3750465.
[8] F. De Marchi, J. Li, Y. Zhang, W. Bai, and Y. Xia, “Unlocking Superior Performance in Reconfigurable Data Center Networks with Credit-Based Transport,” in Proceedings of the ACM SIGCOMM 2025 Conference, in SIGCOMM ’25. New York, NY, USA: Association for Computing Machinery, Aug. 2025, pp. 842–860. doi: 10.1145/3718958.3754342.
[9] L. Poutievski et al., “Jupiter evolving: transforming google’s datacenter network via optical circuit switches and software-defined networking,” in Proceedings of the ACM SIGCOMM 2022 Conference, in SIGCOMM ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 66–85. doi: 10.1145/3544216.3544265.
[10] Q. Meng et al., “Astral: A Datacenter Infrastructure for Large Language Model Training at Scale,” in Proceedings of the ACM SIGCOMM 2025 Conference, in SIGCOMM ’25. New York, NY, USA: Association for Computing Machinery, Aug. 2025, pp. 609–625. doi: 10.1145/3718958.3750521.
[11] Y. Zu et al., “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 761–774. Available: https://www.usenix.org/conference/nsdi24/presentation/zu
[12] Q. Hu et al., “Characterization of Large Language Model Development in the Datacenter,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 709–729. Available: https://www.usenix.org/conference/nsdi24/presentation/hu
Final projects are completed by groups of 2-3 students . For each deliverable step, a single file (pdf only) should be submitted electronically to the instructor’s e-mail address. If the project involves any simulation, data analysis, … all the necessary files must be uploaded to a web page, and the link should be sent in the same e-mail as the pdf file. In all deliverables, please make sure that all the pages (and especially the figures) can be viewed properly in acrobat reader, and the document is printed properly.
You can choose any project related to computer networks for machine learning. You can use the list below as a starting point in your explorations, but certainly are not required to limit yourself to this list. The project can aim to replicate and improve an existing solution, provide a survery of solutions (create new insights by looking at the problem/solutions from different perspectives). You are also welcome to choose a problem that applies your own background/ideas to the topics covered in this course. Regardless of how you choose your topic, you are highly encouraged to discuss the topic with the instructor before submitting the proposal.
1) Optical Interconnects and Topology Co-Design
Traditional copper-based electrical links are hitting a “networking wall” due to power density and reach limitations. This project explores the shift toward reconfigurable optical fabrics.
Topic Ideas: Design a photonic collective communication library (PCCL) that reconfigures the network topology in real-time to match specific ML collective patterns (e.g., AllReduce, All-to-All).
Key Challenge: How can we minimize the “reconfiguration latency” of optical switches so it doesn’t stall the GPU computation?
2) Networking for Large Language Model (LLM) Inference
Unlike training, LLM inference is often memory-bandwidth bound and involves two distinct phases: Prefill (processing the prompt) and Decode (generating tokens one-by-one).
Topic Ideas: Investigate Network-Aware Request Scheduling for “Split-wise” inference, where prefill and decode tasks are handled by different clusters.
Key Challenge: Design a transport protocol that optimizes for the tiny, frequent packets of the Decode phase while maintaining high throughput for the Prefill phase.
3) CXL and Memory Disaggregation for ML
The “memory wall” is a major bottleneck for models that don’t fit on a single GPU’s HBM. Compute Express Link (CXL) allows for memory pooling across the network.
Topic Ideas: Evaluate the performance impact of CXL-based memory expansion on “Mixture of Experts” (MoE) models, where only a fraction of weights are active at any time.
Key Challenge: Develop a tiering strategy to decide which model weights stay in local GPU memory and which are fetched over the CXL fabric.
4) Multi-Tenant Isolation and “Noisy Neighbor” Mitigation
AI superclusters are increasingly shared across different teams or customers. A single “elephant flow” from one training job can devastate the latency of a real-time inference job.
Topic Ideas: Implement Fabric-Scheduled Ethernet or advanced congestion control (like a refined HPCC or Swift) specifically for multi-tenant ML workloads.
Key Challenge: How do we achieve strict performance isolation without underutilizing the expensive network fabric?
5) In-Network Aggregation and Collective Optimization
This builds on the In-Network Computing (INC) topic by focusing on the Ultra Ethernet Consortium (UEC) standards and modern programmable switches.
Topic Ideas: Implement a “Floating Point” aware in-network aggregator that can perform gradient summation directly on the switch hardware (e.g., using P4) to reduce data traffic by 50%.
Key Challenge: Handling packet loss in a way that doesn’t require a full re-transmission of the aggregated gradient.
6) Microburst Detection and Buffer Management (Enhanced)
Machine learning traffic is famously “bursty” due to the Barrier-Synchronous Parallel (BSP) nature of training.
Topic Ideas: Use ML-on-the-Switch to predict upcoming microbursts based on the communication phase of the training job.
Key Challenge: Creating a detection model small enough to run at line rate (nanosecond scale) on a network ASIC.