Training machine learning foundation models with sometimes billions of parameters demands serious computing power. For example, the largest version of GPT-3, the famous large language model behind OpenAI’s DALL-E 2, has 175 billion parameters and needs truly powerful hardware. The model was trained on an AI supercomputer developed by Microsoft specifically for OpenAI that contains over 285,000 CPU cores, 10,000 GPUs, and 400gb/s InfiniBand networking.
These bespoke high performance computing systems are expensive and often out of reach for those outside a datacenter or research facility. Researchers at IBM and PyTorch are looking to change that.
IBM announced it has been collaborating with a distributed team within PyTorch, the open-source ML platform run by the Linux Foundation, to enable training large AI models on affordable networking hardware such as Ethernet. Additionally, the company has built an open source operator for optimizing PyTorch deployments on Red Hat OpenShift on IBM Cloud.
Using PyTorch’s FSDP, an API for data-parallel training, the team successfully trained models with 11 billion parameters across a multi-node, multi-GPU cluster using standard ethernet networking on IBM cloud. IBM says this method of training models with 12 billion or fewer parameters is 90% more efficient than pricey HPC networking systems.
“Our approach achieves on-par efficiency training models of this size as HPC networking systems, making HPC networking infrastructure virtually obsolete for small and medium-scale AI models,” said Mike Murphy, a research writer for IBM in a company blog post.
Murphy describes the infrastructure used for this work as “essentially off-the-shelf hardware” that runs on the IBM Cloud and consists of 200 nodes, each with eight Nvidia A100 80GB GPUs, 96 vCPUs, and 1.2TB CPU RAM. The GPU cards within single nodes are connected via NVLink with a card-to-card bandwidth of 600gb/s, and nodes are connected by two 100gb/s Ethernet links with an SR-IOV-based TCP/IP stack, which Murphy says provides a usable bandwidth of 120gb/s (though he notes for the 11B model, researchers observed peak network bandwidth utilization of 32gb/s).
This GPU system, configured with OpenShift, has been running since May. Currently, the research team is building a production-ready software stack for end-to-end training, tuning, and inference of large AI models.
Though this research was conducted with an 11 billion parameter model instead of a model of GPT-3’s size, IBM hopes to scale this technology for larger models.
“We believe this approach is the first in the industry to achieve scaling efficiencies for models with up to 11 billion parameters that use Kubernetes and PyTorch’s FSDP APIs with standard Ethernet,” said Murphy. “This will allow researchers and organizations to train massive models in any cloud in a far more cost-efficient and sustainable way. In 2023, the goal of the joint team is to continue scaling this technology to handle even larger models.”