devxlogo

Meta unveils cutting-edge AI hardware at OCP

cutting-edge AI
cutting-edge AI

Meta showcased its latest open AI hardware designs at the Open Compute Project (OCP) Global Summit 2024. These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. The goal is to foster collaboration and drive innovation in AI infrastructure.

AI has been integral to the experiences Meta offers to people and businesses. As Meta develops and releases advanced AI models, it continuously enhances its infrastructure to support these new and emerging AI workloads. Llama 3.1 405B, Meta’s largest model, is a notable example.

This dense transformer has 405 billion parameters and can handle a context window of up to 128,000 tokens. Training this model required substantial optimizations across Meta’s entire training stack, utilizing more than 16,000 NVIDIA H100 GPUs. Throughout 2023, Meta rapidly scaled its training clusters from 1,000 to 16,000 GPUs to support AI workloads.

Currently, Meta trains models on two 24,000-GPU clusters and anticipates a continued increase in compute needs for AI training. Building efficient AI clusters requires more than just GPUs; networking and bandwidth are crucial to performance.

Meta unveils AI hardware advancements

Meta has introduced Catalina, a high-powered rack designed for AI workloads, focusing on modularity and flexibility. Catalina supports the latest NVIDIA GB200 Grace Blackwell Superchip and can handle up to 140 kW. Its liquid-cooled, modular design empowers customization to meet specific AI workloads while adhering to industry standards.

Meta has also expanded the Grand Teton platform to support AMD Instinct MI300X accelerators. This platform supports a range of accelerator designs and offers significant compute capacity, memory, and network bandwidth, enabling efficient scaling of training clusters. Meta is developing open, vendor-agnostic networking backends to enhance AI cluster performance.

See also  Microsoft Offers Higher Power Bills To Shield Residents

The new Open Disaggregated Scheduled Fabric (DSF) offers several advantages over existing switches, including overcoming limitations in scale, component supply options, and power density. Meta’s collaboration with Microsoft has been pivotal in advancing open innovation. Their joint projects, such as the Switch Abstraction Interface (SAI) and the Open Accelerator Module (OAM) standard, have contributed significantly to the OCP community.

Meta is committed to open-source AI, believing that it will democratize the benefits and opportunities of AI. Open software frameworks and standardized models are essential for driving innovation, ensuring portability, and promoting transparency in AI development. Open AI hardware systems are crucial for delivering high-performance, cost-effective, and adaptable AI infrastructure.

Cameron is a highly regarded contributor in the rapidly evolving fields of artificial intelligence (AI) and machine learning. His articles delve into the theoretical underpinnings of AI, the practical applications of machine learning across industries, ethical considerations of autonomous systems, and the societal impacts of these disruptive technologies.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.