基于数据并行的混合分布式训练 LongT5 模型研究

Journal: Advances in Computer and Autonomous Intelligence Research DOI: 10.12238/acair.v3i2.13514

谢文戈

杜克大学

Abstract

为了提升LongT5模型在超长文本任务中的训练效率与资源利用率,研究系统架构、并行策略与通信机制对训练性能的影响。结果表明,该方案在吞吐能力、显存占用与可扩展性方面均优于传统并行方式,具备良好的工程适配性与扩展潜力。

Keywords

LongT5模型；混合分布式训练；数据并行

Full Text

PDF - Viewed/Downloaded: 0 Times

References

[1] Guo M,Ainslie J,Uthus D,et al.LongT5:Efficient textto-text transformer for long sequences[J]. arXiv preprint arXiv:2112.07916,2021.
[2] Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J.,& Catanzaro,B.Megatron-LM:Training Multi-Billion Paramet er Language Models Using Model Parallelism[J/OL]. arXiv prepr int arXiv:1909.08053,1-15[2020-03-13].
[3] Rajbhandari,S.,Rasley,J.,Ruwase,O.,& He, Y. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models [J/OL].arXiv preprint arXiv:1910.02054,1-15[2020-03-13].
[4] Narayanan, D.,Harlap,A.,Phanishayee,A.,Seshadri,V.,Devan ur, N. R., Ganger, G. R., Gibbons, P. B., & Zaharia, M. PipeDream: Generalized Pipeline Parallelism for DNN Training[J/OL]. SOSP 2019,1-15[2019-10-27].
[5] Wang,B.,Xu,Q.,Bian,Z., & You, Y. Tesseract: Parallelize the Tensor Parallelism Efficiently[J/OL].arXiv preprint arXiv:2105.14500,1-15[2022-09-01].
[6] Tang,Z.,Shi,S.,Wang,W.,Li,B.,& Chu,X.(2023)."Communication -Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey."arXiv:2003.06307.

Citing this article:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License