CSC2229 – Computer Networks for Machine Learning – Winter 2025

Instructor: Prof. Yashar Ganjali
Class web page: http://www.cs.toronto.edu/~yganjali/teaching/csc2229-winter-2025/
Time: Fri. 9am - 11am, Location: AB114
Office Hours: Mon. 11am - noon and Fri. 11am - noon, Location: BA5238

Schedule

This is a tentative course schedule. Handouts and slides will be added to this page as the course progresses. Make sure to refresh the page to see the latest handouts.

#	Date	Topic	Reading	Handouts	Assignments
1	Jan 10	Course Logistics and Introduction	-	H01 - Info sheet [pdf] H02 - Lecture 1 [pdf][pptx]	-
2	Jan 17	Data Center Networks, Network Programming	-	H03 - Lecture 2 [pdf][pptx]	-
3	Jan 24	Networks and ML	-	H04 - Lecture 3 [pdf][pptx]
4	Jan 31	Network Transport and Multi-Path	[1], [2], [3]	-
5	Feb 7	High-Performance Transport	[4], [5], [6]	-
6	Feb 14	Job and Flow Scheduling	[7], [8], [9]	-	Project proposal due on Feb 14
7	Feb 21	-	-	-
8	Feb 28	Reconfigurable Data Center Networks	[10], [11], [12]	-
9	Mar 7	Scaling the solutions	[13], [14]	-
10	Mar 14		[15], [16], [17]	-	Intermediate report due on Mar 14
11	Mar 21			-
12	Mar 28	Final project presentations	-	-	-
13	Apr 4	Final project presentations	-	-	Final report due on Apr 8

Course Description

This MSc/PhD-level course delves into the core challenges of interconnection networks, emphasizing the use of machine learning to address these issues. The rapid growth of computing demands, driven by machine learning applications, has introduced significant challenges in areas such as bandwidth, latency, and packet loss. Meeting these demands requires innovative techniques and a fresh approach to traditional networking solutions across various layers, including the link, transport, and application layers.

The course begins with a review of key concepts in computer networking, such as packet-switching systems, data center networks, and software-defined networking. It then explores advanced research challenges and cutting-edge solutions in the field. Topics include:

Hyperscale data center networking
Switch and controller design
Reliability, monitoring, and fault tolerance
Network optimization techniques
Network-Application Integration, reconfigurable data center networks
High-performance transport in data centers, focusing on congestion control, flow control, scheduling, and prioritization

Prerequisites

A previous course on computer networks (CSC2209H or equivalent) is highly recommended. Basic undergraduate courses in algorithms and probability theory are recommended.

Grading

Paper presentation: 20%
Final project: 70%
- Proposal: 5%
- Intermediate report: 10%
- Presentation: 20%
- Final report: 35%
Active participation in class and discussions: 10%

Late Submission Policy

You have a free late submission of two days. You can use these two days on any one of the deliverables (proposal, intermediate report or final report). You need to notify that TA before using your free late submission.

In addition to the free late submission, you can submit each assignment late by up to two days. For each late day beyond the free late submission, 10% of the mark will be deducted (up to 20%). Assignments/project reports will not be accepted after two days.

Teaching Assistant

Farid Zandi Shafagh (faridzandi@cs.toronto.edu)

Bulletin Board

Please use our class bulletin board (on Piazza) to ask questions or discuss any course-related topics. You can sign up to the bulletin board here:

https://piazza.com/utoronto.ca/winter2025/csc2229

By using the bulletin board, everyone in class can read the replies, and the overall number of repeat questions is reduced. Please check the bulletin board before posting any new questions. If you have any questions that cannot be posted on the bulletin board (e.g. questions about your grades), you can e-mail the TA or the course instructor directly.

Please make sure to check the announcements folder regularly for updates regarding lectures, assignments, etc. or enable notifications on Piazza.

In addition to our bulletin board, we have a mailing list that will be used exclusively for sharing important information. We will use the email address you have used on ACORN to create this list (please make sure that is a valid email address). Please do not use this email to ask questions.

In-Class Presentations

Students will present papers from the “Reading List” throughout the term. Each presentation is expected to be 20 minutes followed by 10 minutes of Q&A and discussions.

Reading List

The list of papers we will read in this course will be added here.

Week 4

[1] Z. Wang et al., “{SRNIC}: A Scalable Architecture for {RDMA} {NICs},” presented at the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 1–14. Available: https://www.usenix.org/conference/nsdi23/presentation/wang-zilong.

[2] Y. Lu et al., “{Multi-Path} Transport for {RDMA} in Datacenters,” presented at the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 357–371. Available: https://www.usenix.org/conference/nsdi18/presentation/lu.

[3] A. Gangidi et al., “RDMA over Ethernet for Distributed Training at Meta Scale,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 57–70. doi: 10.1145/3651890.3672233. Available: https://dl.acm.org/doi/10.1145/3651890.3672233.

Week 5

[4] S. Agarwal, Q. Cai, R. Agarwal, D. Shmoys, and A. Vahdat, “Harmony: a congestion-free datacenter architecture,” in 21st USENIX symposium on networked systems design and implementation (NSDI 24), Santa Clara, CA: USENIX Association, Apr. 2024, pp. 329–343. Available: https://www.usenix.org/conference/nsdi24/presentation/agarwal-saksham

[5] S. Agarwal, A. Krishnamurthy, and R. Agarwal, “Host Congestion Control,” in Proceedings of the ACM SIGCOMM 2023 Conference, in ACM SIGCOMM ’23. New York, NY, USA: Association for Computing Machinery, Sep. 2023, pp. 275–287. doi: 10.1145/3603269.3604878. Available: https://dl.acm.org/doi/10.1145/3603269.3604878.

[6] H. Wang et al., “Towards {Domain-Specific} Network Transport for Distributed {DNN} Training,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1421–1443. Available: https://www.usenix.org/conference/nsdi24/presentation/wang-hao.

Week 6

[7] S. Rajasekaran, M. Ghobadi, and A. Akella, “{CASSINI}: {Network-Aware} Job Scheduling in Machine Learning Clusters,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1403–1420. Available: https://www.usenix.org/conference/nsdi24/presentation/rajasekaran.

[8] J. Cao et al., “Crux: GPU-Efficient Communication Scheduling for Deep Learning Training,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 1–15. doi: 10.1145/3651890.3672239. Available: https://dl.acm.org/doi/10.1145/3651890.3672239.

[9] X. Liu et al., “Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 16–37. doi: 10.1145/3651890.3672249. Available: https://dl.acm.org/doi/10.1145/3651890.3672249.

Week 7

Reading Week

Week 8

[10] W. M. Mellette, A. Forencich, R. Athapathu, A. C. Snoeren, G. Papen, and G. Porter, “Realizing RotorNet: Toward Practical Microsecond Scale Optical Networking,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 392–414. doi: 10.1145/3651890.3672273. Available: https://dl.acm.org/doi/10.1145/3651890.3672273.

[11] C. Liang et al., “NegotiaToR: Towards A Simple Yet Effective On-demand Reconfigurable Datacenter Network,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 415–432. doi: 10.1145/3651890.3672222. Available: https://dl.acm.org/doi/10.1145/3651890.3672222.

[12] D. Amir, N. Saran, T. Wilson, R. Kleinberg, V. Shrivastav, and H. Weatherspoon, “Shale: A Practical, Scalable Oblivious Reconfigurable Network,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 449–464. doi: 10.1145/3651890.3672248. Available: https://dl.acm.org/doi/10.1145/3651890.3672248.

Week 9

[13] Z. Jiang et al., “{MegaScale}: Scaling Large Language Model Training to More Than 10,000 {GPUs},” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760. Available: https://www.usenix.org/conference/nsdi24/presentation/jiang-ziheng.

[14] Q. Yang et al., “DeepQueueNet: towards scalable and generalized network performance estimation with packet-level visibility,” in Proceedings of the ACM SIGCOMM 2022 Conference, in SIGCOMM ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 441–457. doi: 10.1145/3544216.3544248. Available: https://dl.acm.org/doi/10.1145/3544216.3544248.

Week 10

[15] H. Lim, J. Ye, S. Abdu Jyothi, and D. Han, “Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 707–720. doi: 10.1145/3651890.3672228. Available: https://dl.acm.org/doi/10.1145/3651890.3672228. [Accessed: Feb. 08, 2025]

[16] Q. Hu et al., “Characterization of Large Language Model Development in the Datacenter,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 709–729. Available: https://www.usenix.org/conference/nsdi24/presentation/hu. [Accessed: Feb. 08, 2025]

[17] M. A. Qureshi et al., “PLB: congestion signals are simple and effective for network load balancing,” in Proceedings of the ACM SIGCOMM 2022 Conference, in SIGCOMM ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 207–218. doi: 10.1145/3544216.3544226. Available: https://doi.org/10.1145/3544216.3544226. [Accessed: Sep. 06, 2022]

Final Project

Final projects are completed by groups of 2-3 students . For each deliverable step, a single file (pdf only) should be submitted electronically to the instructor’s e-mail address. If the project involves any simulation, data analysis, … all the necessary files must be uploaded to a web page, and the link should be sent in the same e-mail as the pdf file. In all deliverables, please make sure that all the pages (and especially the figures) can be viewed properly in acrobat reader, and the document is printed properly.

Project proposal (2 pages)
- Due: 5pm Fri. Feb. 14.
- Check out the list of suggested topics below.
- For each topic, I have included a brief description. This is not meant to be a complete description of the project, and is just a starting point for you to explore. Feel free to stop by my office hours and I can give you more specific problems related to the area.
- You are also welcome to come up with your own project topic.
- You are highly encouraged to meet with the instructor and discuss your ideas before submitting your proposal.
- Should include names of students, indicate who will work on what part, identify the project deliverable, what is the schedule/timetable of work.
Intermediate report (5 pages max)
- Due: 5pm Fri. Mar. 14.
- The following should be the same as the final report:
  - Table of contents
  - Introduction/background/references
  - Should have placeholder for final results and conclusions.
- Should Include: table of contents of final Report, background and motivation; problem statement; who did what so far; remaining work for rest of semester.
Final presentation
- All presentations in class on the last sessions of the course.
- 25 minute presentation of project results.
Final report (8 pages max)
- Due: ~~Noon Wed. Apr. 8~~.
- Extended to Noon Wed. April 15th

Final Presentation Instructions

Although each talk is given 25 mins, the talk should last only 20 mins. The remaining 5 mins would be for questions. The 25 minute limit will be strictly imposed without exception.
You can bring your own laptop for presentation, or you can send me the powerpoint files before 9am on the day of presentation.

Presentation Guidelines (from Stanford’s EE384xy)

Giving an effective technical presentation requires a lot of preparation and practice. For example, in preparation for a 15 minute conference talk, it’s quite common for a researcher to spend several days preparing, and give several practice talks along the way. Here are some guidelines to follow while preparing your presentation.

In the beginning, forget the slides.

The most common mistake in giving a talk is to focus too much attention on the preparation of your slides. Remember that the talk is what comes out of the speaker’s mouth, not the slides. Resist the temptation to spend all your preparation time working on pretty powerpoint slides. Instead, prepare the outline and script of your talk first, and only then think about how slides might help to illustrate some of the key points you are trying to make. A good rule of thumb is to spend 75% of the preparation time on your outline and script, and 25% of the time preparing slides. Remember that the world’s best orators give great talks without slides!

Write a 1-paragraph abstract that summarizes your talk.

Before you start preparing the outline and details, write down a brief abstract. It’s worth spending some time on it. (For what it’s worth, even though I’ve given hundreds of talks and lectures, I still do this before every talk and lecture I prepare).

Try to write, in the most concise and clear way, a brief summary of the talk. What problem are you solving, and why is it interesting? What is the context in which you are doing the work? What is the essence of your work, and your results? What is the “ah hah!” factor —- what do you want your audience to take away from the talk? If you can’t write a 1-paragraph summary, it tells you that you don’t have a clear idea of what the talk is about. Once you have a good paragraph written, you can be sure that the talk will be a lot easier to prepare.

Prepare a bulleted outline of the whole talk.

Prepare an outline - perhaps in the form of 30-40 bullets - that shows the flow of the whole talk. This will help identify missing or redundant sections; and will help balance the amount of information you include in each part of the talk. It is often tempting to spend too much on one, unimportant detail, leaving too little time for the important stuff.

Script the whole talk, and learn it.

Yes - I mean it. Write down, word for word, your whole talk. You’d be amazed at how many people do this — even skilled speakers who give talks often. If the President can give speeches from a teleprompter, it tells you something about what makes a good talk. In many fields of the humanities, researchers read all their talks from a script.

The trick is to script the whole talk, read it aloud many times and then learn it. Then the day before your talk, throw away your script. You’ll remember the key sentences and points, and by not reading it you’ll make it sound more natural. Having a script will make your talk get off on the right foot, and help overcome nervousness. Perhaps most importantly, a script will help you avoid missing out some important points, and will help you make best use of the small amount of time you have.

Pick/design some slides to illustrate key ideas.

Think about what slides you need to illustrate your script. Your slides do not have to paint a complete picture on their own. It is a common misconception that the slides should be readable and meaningful without the speaker. This is baloney. If this were true, we wouldn’t need the speaker! Think of them as illustrative tools, to help explain key ideas. It is not a good idea to use them to jog your memory; that is what your own notes are for.

Practice the talk with the script several times.
Throw away the script.

Once you have practiced the scripted talk a few times, throw your script away. You will remember the key phrases; and in the moment, you will link it together more naturally than you would by reading it.

Give the presentation.

If all this seems like a lot of work, it is. Giving a good talk involves many hours of preparation.

Quick Access

Schedule
Bulletin Board
In-Class Presentations
Reading List
Final Project
- Final Presentation Instructions
- Presentation Guidelines (from Stanford’s EE384xy)