Big Data News Hubb
Advertisement
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
Big Data News Hubb
No Result
View All Result
Home News

IBM Collaboration Looks to Bring Massive AI Models to Any Cloud

admin by admin
November 28, 2022
in News


Training machine learning foundation models with sometimes billions of parameters demands serious computing power. For example, the largest version of GPT-3, the famous large language model behind OpenAI’s DALL-E 2, has 175 billion parameters and needs truly powerful hardware. The model was trained on an AI supercomputer developed by Microsoft specifically for OpenAI that contains over 285,000 CPU cores, 10,000 GPUs, and 400gb/s InfiniBand networking.

These bespoke high performance computing systems are expensive and often out of reach for those outside a datacenter or research facility. Researchers at IBM and PyTorch are looking to change that.

IBM announced it has been collaborating with a distributed team within PyTorch, the open-source ML platform run by the Linux Foundation, to enable training large AI models on affordable networking hardware such as Ethernet. Additionally, the company has built an open source operator for optimizing PyTorch deployments on Red Hat OpenShift on IBM Cloud.

Using PyTorch’s FSDP, an API for data-parallel training, the team successfully trained models with 11 billion parameters across a multi-node, multi-GPU cluster using standard ethernet networking on IBM cloud. IBM says this method of training models with 12 billion or fewer parameters is 90% more efficient than pricey HPC networking systems.

(Laborant/Shutterstock)

“Our approach achieves on-par efficiency training models of this size as HPC networking systems, making HPC networking infrastructure virtually obsolete for small and medium-scale AI models,” said Mike Murphy, a research writer for IBM in a company blog post.

Murphy describes the infrastructure used for this work as “essentially off-the-shelf hardware” that runs on the IBM Cloud and consists of 200 nodes, each with eight Nvidia A100 80GB GPUs, 96 vCPUs, and 1.2TB CPU RAM. The GPU cards within single nodes are connected via NVLink with a card-to-card bandwidth of 600gb/s, and nodes are connected by two 100gb/s Ethernet links with an SR-IOV-based TCP/IP stack, which Murphy says provides a usable bandwidth of 120gb/s (though he notes for the 11B model, researchers observed peak network bandwidth utilization of 32gb/s).

This GPU system, configured with OpenShift, has been running since May. Currently, the research team is building a production-ready software stack for end-to-end training, tuning, and inference of large AI models.

Though this research was conducted with an 11 billion parameter model instead of a model of GPT-3’s size, IBM hopes to scale this technology for larger models.

“We believe this approach is the first in the industry to achieve scaling efficiencies for models with up to 11 billion parameters that use Kubernetes and PyTorch’s FSDP APIs with standard Ethernet,” said Murphy. “This will allow researchers and organizations to train massive models in any cloud in a far more cost-efficient and sustainable way. In 2023, the goal of the joint team is to continue scaling this technology to handle even larger models.”

Related Items:

One Model to Rule Them All: Transformer Networks Usher in AI 2.0, Forrester Says

IBM Research Open-Sources Deep Search Tools

Meta Releases AI Model That Translates Over 200 Languages



Source link

Previous Post

Six Key Components to Enhance Your MDM Program

Next Post

Introducing AWS Glue for Ray: Scaling your data integration workloads using Python

Next Post

Introducing AWS Glue for Ray: Scaling your data integration workloads using Python

Recommended

What is Data Science? Course Details, Careers, Jobs

December 24, 2022

A Beginner’s Guide to Understanding SAP S/4 HANA

January 11, 2023

How AI & Big Data is Changing Drug Discovery

October 27, 2022

Don't miss it

News

Stormy Skies Ahead? Report Finds 20% of Businesses Intend to Move Workloads From Cloud to On-Prem

February 5, 2023
Big Data

An Introduction to Disaster Recovery with the Cloudera Data Platform

February 4, 2023
Big Data

Comet Announces Convergence 2023, the Leading Conference to Explore the New Frontiers of Machine Learning

February 4, 2023
Big Data

Design Patterns for Batch Processing in Financial Services

February 4, 2023
News

AWS Lake Formation 2022 year in review

February 4, 2023
News

Data Mesh Creator Takes Next Data Step

February 4, 2023

big-data-footer-white

© 2022 Big Data News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Big Data
  • News
  • Contact us

Newsletter Sign Up

No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us

© 2022 Big Data News Hubb All rights reserved.