Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization

Jan 1, 2025·
Jianbo Dong
,
Bin Luo
,
Jun Zhang
,
Pengcheng Zhang
,
Fei Feng
,
Yikai Zhu
,
Ang Liu
,
Zian Chen
,
Yi Shi
,
Hairong Jiao
,
Others
· 0 min read
Type
Publication
2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)