CSC2229: Computer Networks for Machine Learning - Winter 2026

Instructor: Prof. Yashar Ganjali
Class web page: http://www.cs.toronto.edu/~yganjali/teaching/csc2229-winter-2026/
Class Time: Tuesday 1 PM - 3 PM, Location: BA 2145
Office Hours: Tuesday 3 PM - 4 PM, Location: BA 5238

Schedule

This is a tentative course schedule. Handouts and slides will be added to this page as the course progresses. Make sure to refresh the page to see the latest handouts.

#	Date	Topic	Reading	Handouts	Assignments
1	Jan 6	Course Logistics and Introduction	-	H01 - Info sheet [pdf] H02 - Lecture 1 [pdf][pptx]
2	Jan 13	Data Center Networks, Network Programming	-	H03 - Lecture 2 [pdf][pptx]	-
3	Jan 20	Networks and ML	-	H04 - Lecture 3 [pdf][pptx]
4	Jan 27	High-Performance Transport	[1], [2], [3]	-
5	Feb 3	Job and Flow Scheduling	[4], [5], [6]	-
6	Feb 10	Reconfigurable Data Center Networks	[7], [8], [9]	-	Project proposal due on Feb 13
7	Feb 17	-	-	-
8	Feb 24	Scaling the solutions	[10], [11], [12]	-
9	Mar 3	Scaling Transport	[13], [14], [15]	-
10	Mar 10			-	Intermediate report due on Mar 13
11	Mar 17			-
12	Mar 24	Final project presentations	-	-	-
13	Mar 31	Final project presentations	-	-	Final report due on Apr 3

Course Description

This MSc/PhD-level course delves into the core challenges of interconnection networks, emphasizing the use of machine learning to address these issues. The rapid growth of computing demands, driven by machine learning applications, has introduced significant challenges in areas such as bandwidth, latency, and packet loss. Meeting these demands requires innovative techniques and a fresh approach to traditional networking solutions across various layers, including the link, transport, and application layers.

The course begins with a review of key concepts in computer networking, such as packet-switching systems, data center networks, and software-defined networking. It then explores advanced research challenges and cutting-edge solutions in the field. Topics include:

Hyperscale data center networking
Switch and controller design
Reliability, monitoring, and fault tolerance
Network optimization techniques
Network-Application Integration, reconfigurable data center networks
High-performance transport in data centers, focusing on congestion control, flow control, scheduling, and prioritization

Prerequisites

A previous course on computer networks (CSC2209H or equivalent) is highly recommended. Basic undergraduate courses in algorithms and probability theory are recommended.

Textbook

The course is based on recent research material, and we do not have a textbook.

Grading

Paper presentation: 20%
Final project: 70%
- Proposal: 5%
- Intermediate report: 10%
- Presentation: 20%
- Final report: 35%
Active participation in class and discussions: 10%

Late Submission Policy

You have a free late submission of two days. You can use these two days on any one of the deliverables (proposal, intermediate report or final report). You need to notify that TA before using your free late submission.

In addition to the free late submission, you can submit each assignment late by up to two days. For each late day beyond the free late submission, 10% of the mark will be deducted (up to 20%). Assignments/project reports will not be accepted after two days.

Teaching Assistant

Parsa Pazhooheshy < parsap @ cs . toronto . edu >

Bulletin Board

Please use our class bulletin board (on Piazza) to ask questions or discuss any course-related topics. You can sign up to the bulletin board here:

https://piazza.com/utoronto.ca/winter2026/csc2229

By using the bulletin board, everyone in class can read the replies, and the overall number of repeat questions is reduced. Please check the bulletin board before posting any new questions. If you have any questions that cannot be posted on the bulletin board (e.g. questions about your grades), you can e-mail the TA or the course instructor directly.

Please make sure to check the announcements folder regularly for updates regarding lectures, assignments, etc. or enable notifications on Piazza.

In addition to our bulletin board, we have a mailing list that will be used exclusively for sharing important information. We will use the email address you have used on ACORN to create this list (please make sure that is a valid email address). Please do not use this email to ask questions.

In-Class Presentations

Students will present papers from the “Reading List” throughout the term. Each presentation is expected to be 20 minutes followed by 10 minutes of Q&A and discussions.

Reading List

The list of papers we will read in this course will be added here.

Week 4

[1] Z. Wang et al., “{SRNIC}: A Scalable Architecture for {RDMA} {NICs},” presented at the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 1–14. Available: https://www.usenix.org/conference/nsdi23/presentation/wang-zilong.

[2] Y. Lu et al., “{Multi-Path} Transport for {RDMA} in Datacenters,” presented at the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 357–371. Available: https://www.usenix.org/conference/nsdi18/presentation/lu.

[3] K. Prasopoulos et al., “SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol,” presented at the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 451–471. Available: https://www.usenix.org/conference/nsdi25/presentation/prasopoulos

Week 5

[4] W. Li et al., “Flow Scheduling with Imprecise Knowledge,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 95–111. Available: https://www.usenix.org/conference/nsdi24/presentation/li-wenxin

[5] S. Rajasekaran, M. Ghobadi, and A. Akella, “CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1403–1420. Available: https://www.usenix.org/conference/nsdi24/ presentation/rajasekaran

[6] F. Zandi, M. Ghobadi, and Y. Ganjali, “FORESIGHT: Joint Time and Space Scheduling for Efficient Distributed ML Training,” in IEEE/IFIP Networking, Apr. 2025, pp. 1–9. Available: https://networking.ifip.org/2025/images/Net25_papers/1571125810.pdf

Week 6

[7] X. Liao et al., “MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training,” in Proceedings of the ACM SIGCOMM 2025 Conference, in SIGCOMM ’25. New York, NY, USA: Association for Computing Machinery, Aug. 2025, pp. 554–574. doi: 10.1145/3718958.3750465.

[8] F. De Marchi, J. Li, Y. Zhang, W. Bai, and Y. Xia, “Unlocking Superior Performance in Reconfigurable Data Center Networks with Credit-Based Transport,” in Proceedings of the ACM SIGCOMM 2025 Conference, in SIGCOMM ’25. New York, NY, USA: Association for Computing Machinery, Aug. 2025, pp. 842–860. doi: 10.1145/3718958.3754342.

[9] L. Poutievski et al., “Jupiter evolving: transforming google’s datacenter network via optical circuit switches and software-defined networking,” in Proceedings of the ACM SIGCOMM 2022 Conference, in SIGCOMM ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 66–85. doi: 10.1145/3544216.3544265.

Week 8

[10] Q. Meng et al., “Astral: A Datacenter Infrastructure for Large Language Model Training at Scale,” in Proceedings of the ACM SIGCOMM 2025 Conference, in SIGCOMM ’25. New York, NY, USA: Association for Computing Machinery, Aug. 2025, pp. 609–625. doi: 10.1145/3718958.3750521.

[11] Y. Zu et al., “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 761–774. Available: https://www.usenix.org/conference/nsdi24/presentation/zu

[12] Q. Hu et al., “Characterization of Large Language Model Development in the Datacenter,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 709–729. Available: https://www.usenix.org/conference/nsdi24/presentation/hu

Week 9

[13] H. Gu, A. J. Mashtizadeh, and B. Wong, “HA/TCP: A Reliable and Scalable Framework for TCP Network Functions,” presented at the 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 899–914. Available: https://www.usenix.org/conference/nsdi25/presentation/gu

[14] A. Dhamija et al., “A large-scale deployment of DCTCP,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 239–252 Available: https://www.usenix.org/conference/nsdi24/presentation/dhamija

[15] N. Blach et al., “A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1025–1044. Available: https://www.usenix.org/conference/nsdi24/presentation/blach

Final Project

Final projects are completed by groups of 2-3 students . For each deliverable step, a single file (pdf only) should be submitted electronically to the instructor’s e-mail address. If the project involves any simulation, data analysis, … all the necessary files must be uploaded to a web page, and the link should be sent in the same e-mail as the pdf file. In all deliverables, please make sure that all the pages (and especially the figures) can be viewed properly in acrobat reader, and the document is printed properly.

Project proposal (2 pages)
- Due: 5pm Fri. Feb. 13.
- Check out the list of suggested topics below.
- For each topic, I have included a brief description. This is not meant to be a complete description of the project, and is just a starting point for you to explore. Feel free to stop by my office hours and I can give you more specific problems related to the area.
- You are also welcome to come up with your own project topic.
- You are highly encouraged to meet with the instructor and discuss your ideas before submitting your proposal.
- Should include names of students, indicate who will work on what part, identify the project deliverable, what is the schedule/timetable of work.
Intermediate report (5 pages max)
- Due: 5pm Fri. Mar. 13.
- The following should be the same as the final report:
  - Table of contents
  - Introduction/background/references
  - Should have placeholder for final results and conclusions.
- Should Include: table of contents of final Report, background and motivation; problem statement; who did what so far; remaining work for rest of semester.
Final presentation
- All presentations in class on the last sessions of the course.
- 25 minute presentation of project results.
Final report (8 pages max)
- Due: Noon Wed. Apr. 3rd.

Suggested Topics for the Final Project

You can choose any project related to computer networks for machine learning. You can use the list below as a starting point in your explorations, but certainly are not required to limit yourself to this list. The project can aim to replicate and improve an existing solution, provide a survery of solutions (create new insights by looking at the problem/solutions from different perspectives). You are also welcome to choose a problem that applies your own background/ideas to the topics covered in this course. Regardless of how you choose your topic, you are highly encouraged to discuss the topic with the instructor before submitting the proposal.

1) Optical Interconnects and Topology Co-Design

Traditional copper-based electrical links are hitting a “networking wall” due to power density and reach limitations. This project explores the shift toward reconfigurable optical fabrics.

Topic Ideas: Design a photonic collective communication library (PCCL) that reconfigures the network topology in real-time to match specific ML collective patterns (e.g., AllReduce, All-to-All).

Key Challenge: How can we minimize the “reconfiguration latency” of optical switches so it doesn’t stall the GPU computation?

2) Networking for Large Language Model (LLM) Inference

Unlike training, LLM inference is often memory-bandwidth bound and involves two distinct phases: Prefill (processing the prompt) and Decode (generating tokens one-by-one).

Topic Ideas: Investigate Network-Aware Request Scheduling for “Split-wise” inference, where prefill and decode tasks are handled by different clusters.

Key Challenge: Design a transport protocol that optimizes for the tiny, frequent packets of the Decode phase while maintaining high throughput for the Prefill phase.

3) CXL and Memory Disaggregation for ML

The “memory wall” is a major bottleneck for models that don’t fit on a single GPU’s HBM. Compute Express Link (CXL) allows for memory pooling across the network.

Topic Ideas: Evaluate the performance impact of CXL-based memory expansion on “Mixture of Experts” (MoE) models, where only a fraction of weights are active at any time.

Key Challenge: Develop a tiering strategy to decide which model weights stay in local GPU memory and which are fetched over the CXL fabric.

4) Multi-Tenant Isolation and “Noisy Neighbor” Mitigation

AI superclusters are increasingly shared across different teams or customers. A single “elephant flow” from one training job can devastate the latency of a real-time inference job.

Topic Ideas: Implement Fabric-Scheduled Ethernet or advanced congestion control (like a refined HPCC or Swift) specifically for multi-tenant ML workloads.

Key Challenge: How do we achieve strict performance isolation without underutilizing the expensive network fabric?

5) In-Network Aggregation and Collective Optimization

This builds on the In-Network Computing (INC) topic by focusing on the Ultra Ethernet Consortium (UEC) standards and modern programmable switches.

Topic Ideas: Implement a “Floating Point” aware in-network aggregator that can perform gradient summation directly on the switch hardware (e.g., using P4) to reduce data traffic by 50%.

Key Challenge: Handling packet loss in a way that doesn’t require a full re-transmission of the aggregated gradient.

6) Microburst Detection and Buffer Management (Enhanced)

Machine learning traffic is famously “bursty” due to the Barrier-Synchronous Parallel (BSP) nature of training.

Topic Ideas: Use ML-on-the-Switch to predict upcoming microbursts based on the communication phase of the training job.

Key Challenge: Creating a detection model small enough to run at line rate (nanosecond scale) on a network ASIC.

Quick Access

Schedule
Course Description
Late Submission Policy
Teaching Assistant
Bulletin Board
In-Class Presentations
Reading List
Final Project
- Suggested Topics for the Final Project