This is a tentative course schedule. Handouts and slides will be added to this page as the course progresses. Make sure to refresh the page to see the latest handouts.
# | Date | Topic | Reading | Handouts | Assignments |
---|---|---|---|---|---|
1 | Jan 10 | Course Logistics and Introduction | - | H01 - Info sheet [pdf] H02 - Lecture 1 [pdf][pptx] |
- |
2 | Jan 17 | Data Center Networks, Network Programming | - | H03 - Lecture 2 [pdf][pptx] | - |
3 | Jan 24 | Networks and ML | - | H04 - Lecture 3 [pdf][pptx] | |
4 | Jan 31 | Network Transport and Multi-Path | [1], [2], [3] | - | |
5 | Feb 7 | High-Performance Transport | [4], [5], [6] | - | |
6 | Feb 14 | Job and Flow Scheduling | [7], [8], [9] | - | Project proposal due on Feb 14 |
7 | Feb 21 | - | - | - | |
8 | Feb 28 | Reconfigurable Data Center Networks | [10], [11], [12] | - | |
9 | Mar 7 | Scaling the solutions | [13], [14] | - | |
10 | Mar 14 | [15], [16], [17] | - | Intermediate report due on Mar 14 | |
11 | Mar 21 | - | |||
12 | Mar 28 | Final project presentations | - | - | - |
13 | Apr 4 | Final project presentations | - | - | Final report due on Apr 8 |
This MSc/PhD-level course delves into the core challenges of interconnection networks, emphasizing the use of machine learning to address these issues. The rapid growth of computing demands, driven by machine learning applications, has introduced significant challenges in areas such as bandwidth, latency, and packet loss. Meeting these demands requires innovative techniques and a fresh approach to traditional networking solutions across various layers, including the link, transport, and application layers.
The course begins with a review of key concepts in computer networking, such as packet-switching systems, data center networks, and software-defined networking. It then explores advanced research challenges and cutting-edge solutions in the field. Topics include:
A previous course on computer networks (CSC2209H or equivalent) is highly recommended. Basic undergraduate courses in algorithms and probability theory are recommended.
You have a free late submission of two days. You can use these two days on any one of the deliverables (proposal, intermediate report or final report). You need to notify that TA before using your free late submission.
In addition to the free late submission, you can submit each assignment late by up to two days. For each late day beyond the free late submission, 10% of the mark will be deducted (up to 20%). Assignments/project reports will not be accepted after two days.
Please use our class bulletin board (on Piazza) to ask questions or discuss any course-related topics. You can sign up to the bulletin board here:
https://piazza.com/utoronto.ca/winter2025/csc2229
By using the bulletin board, everyone in class can read the replies, and the overall number of repeat questions is reduced. Please check the bulletin board before posting any new questions. If you have any questions that cannot be posted on the bulletin board (e.g. questions about your grades), you can e-mail the TA or the course instructor directly.
Please make sure to check the announcements folder regularly for updates regarding lectures, assignments, etc. or enable notifications on Piazza.
In addition to our bulletin board, we have a mailing list that will be used exclusively for sharing important information. We will use the email address you have used on ACORN to create this list (please make sure that is a valid email address). Please do not use this email to ask questions.
Students will present papers from the “Reading List” throughout the term. Each presentation is expected to be 20 minutes followed by 10 minutes of Q&A and discussions.
The list of papers we will read in this course will be added here.
[1] Z. Wang et al., “{SRNIC}: A Scalable Architecture for {RDMA} {NICs},” presented at the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 1–14. Available: https://www.usenix.org/conference/nsdi23/presentation/wang-zilong.
[2] Y. Lu et al., “{Multi-Path} Transport for {RDMA} in Datacenters,” presented at the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 357–371. Available: https://www.usenix.org/conference/nsdi18/presentation/lu.
[3] A. Gangidi et al., “RDMA over Ethernet for Distributed Training at Meta Scale,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 57–70. doi: 10.1145/3651890.3672233. Available: https://dl.acm.org/doi/10.1145/3651890.3672233.
[4] S. Agarwal, Q. Cai, R. Agarwal, D. Shmoys, and A. Vahdat, “Harmony: a congestion-free datacenter architecture,” in 21st USENIX symposium on networked systems design and implementation (NSDI 24), Santa Clara, CA: USENIX Association, Apr. 2024, pp. 329–343. Available: https://www.usenix.org/conference/nsdi24/presentation/agarwal-saksham
[5] S. Agarwal, A. Krishnamurthy, and R. Agarwal, “Host Congestion Control,” in Proceedings of the ACM SIGCOMM 2023 Conference, in ACM SIGCOMM ’23. New York, NY, USA: Association for Computing Machinery, Sep. 2023, pp. 275–287. doi: 10.1145/3603269.3604878. Available: https://dl.acm.org/doi/10.1145/3603269.3604878.
[6] H. Wang et al., “Towards {Domain-Specific} Network Transport for Distributed {DNN} Training,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1421–1443. Available: https://www.usenix.org/conference/nsdi24/presentation/wang-hao.
[7] S. Rajasekaran, M. Ghobadi, and A. Akella, “{CASSINI}: {Network-Aware} Job Scheduling in Machine Learning Clusters,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1403–1420. Available: https://www.usenix.org/conference/nsdi24/presentation/rajasekaran.
[8] J. Cao et al., “Crux: GPU-Efficient Communication Scheduling for Deep Learning Training,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 1–15. doi: 10.1145/3651890.3672239. Available: https://dl.acm.org/doi/10.1145/3651890.3672239.
[9] X. Liu et al., “Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 16–37. doi: 10.1145/3651890.3672249. Available: https://dl.acm.org/doi/10.1145/3651890.3672249.
[10] W. M. Mellette, A. Forencich, R. Athapathu, A. C. Snoeren, G. Papen, and G. Porter, “Realizing RotorNet: Toward Practical Microsecond Scale Optical Networking,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 392–414. doi: 10.1145/3651890.3672273. Available: https://dl.acm.org/doi/10.1145/3651890.3672273.
[11] C. Liang et al., “NegotiaToR: Towards A Simple Yet Effective On-demand Reconfigurable Datacenter Network,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 415–432. doi: 10.1145/3651890.3672222. Available: https://dl.acm.org/doi/10.1145/3651890.3672222.
[12] D. Amir, N. Saran, T. Wilson, R. Kleinberg, V. Shrivastav, and H. Weatherspoon, “Shale: A Practical, Scalable Oblivious Reconfigurable Network,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 449–464. doi: 10.1145/3651890.3672248. Available: https://dl.acm.org/doi/10.1145/3651890.3672248.
[13] Z. Jiang et al., “{MegaScale}: Scaling Large Language Model Training to More Than 10,000 {GPUs},” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760. Available: https://www.usenix.org/conference/nsdi24/presentation/jiang-ziheng.
[14] Q. Yang et al., “DeepQueueNet: towards scalable and generalized network performance estimation with packet-level visibility,” in Proceedings of the ACM SIGCOMM 2022 Conference, in SIGCOMM ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 441–457. doi: 10.1145/3544216.3544248. Available: https://dl.acm.org/doi/10.1145/3544216.3544248.
[15] H. Lim, J. Ye, S. Abdu Jyothi, and D. Han, “Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs,” in Proceedings of the ACM SIGCOMM 2024 Conference, in ACM SIGCOMM ’24. New York, NY, USA: Association for Computing Machinery, Aug. 2024, pp. 707–720. doi: 10.1145/3651890.3672228. Available: https://dl.acm.org/doi/10.1145/3651890.3672228. [Accessed: Feb. 08, 2025]
[16] Q. Hu et al., “Characterization of Large Language Model Development in the Datacenter,” presented at the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 709–729. Available: https://www.usenix.org/conference/nsdi24/presentation/hu. [Accessed: Feb. 08, 2025]
[17] M. A. Qureshi et al., “PLB: congestion signals are simple and effective for network load balancing,” in Proceedings of the ACM SIGCOMM 2022 Conference, in SIGCOMM ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 207–218. doi: 10.1145/3544216.3544226. Available: https://doi.org/10.1145/3544216.3544226. [Accessed: Sep. 06, 2022]
Final projects are completed by groups of 2-3 students . For each deliverable step, a single file (pdf only) should be submitted electronically to the instructor’s e-mail address. If the project involves any simulation, data analysis, … all the necessary files must be uploaded to a web page, and the link should be sent in the same e-mail as the pdf file. In all deliverables, please make sure that all the pages (and especially the figures) can be viewed properly in acrobat reader, and the document is printed properly.
Giving an effective technical presentation requires a lot of preparation and practice. For example, in preparation for a 15 minute conference talk, it’s quite common for a researcher to spend several days preparing, and give several practice talks along the way. Here are some guidelines to follow while preparing your presentation.
The most common mistake in giving a talk is to focus too much attention on the preparation of your slides. Remember that the talk is what comes out of the speaker’s mouth, not the slides. Resist the temptation to spend all your preparation time working on pretty powerpoint slides. Instead, prepare the outline and script of your talk first, and only then think about how slides might help to illustrate some of the key points you are trying to make. A good rule of thumb is to spend 75% of the preparation time on your outline and script, and 25% of the time preparing slides. Remember that the world’s best orators give great talks without slides!
Before you start preparing the outline and details, write down a brief abstract. It’s worth spending some time on it. (For what it’s worth, even though I’ve given hundreds of talks and lectures, I still do this before every talk and lecture I prepare).
Try to write, in the most concise and clear way, a brief summary of the talk. What problem are you solving, and why is it interesting? What is the context in which you are doing the work? What is the essence of your work, and your results? What is the “ah hah!” factor —- what do you want your audience to take away from the talk? If you can’t write a 1-paragraph summary, it tells you that you don’t have a clear idea of what the talk is about. Once you have a good paragraph written, you can be sure that the talk will be a lot easier to prepare.
Prepare an outline - perhaps in the form of 30-40 bullets - that shows the flow of the whole talk. This will help identify missing or redundant sections; and will help balance the amount of information you include in each part of the talk. It is often tempting to spend too much on one, unimportant detail, leaving too little time for the important stuff.
Yes - I mean it. Write down, word for word, your whole talk. You’d be amazed at how many people do this — even skilled speakers who give talks often. If the President can give speeches from a teleprompter, it tells you something about what makes a good talk. In many fields of the humanities, researchers read all their talks from a script.
The trick is to script the whole talk, read it aloud many times and then learn it. Then the day before your talk, throw away your script. You’ll remember the key sentences and points, and by not reading it you’ll make it sound more natural. Having a script will make your talk get off on the right foot, and help overcome nervousness. Perhaps most importantly, a script will help you avoid missing out some important points, and will help you make best use of the small amount of time you have.
Think about what slides you need to illustrate your script. Your slides do not have to paint a complete picture on their own. It is a common misconception that the slides should be readable and meaningful without the speaker. This is baloney. If this were true, we wouldn’t need the speaker! Think of them as illustrative tools, to help explain key ideas. It is not a good idea to use them to jog your memory; that is what your own notes are for.
Practice the talk with the script several times.
Throw away the script.
Once you have practiced the scripted talk a few times, throw your script away. You will remember the key phrases; and in the moment, you will link it together more naturally than you would by reading it.
If all this seems like a lot of work, it is. Giving a good talk involves many hours of preparation.