The field of large language models (LLMs) has seen significant advancements in recent years, with the introduction of Mixture-of-Experts (MoE) architectures playing a crucial role in scaling these models. However, current MoE techniques are limited in their ability to accommodate a large number of experts. In a new paper, Google DeepMind introduces Parameter Efficient Expert Retrieval (PEER), a novel architecture aimed at addressing these limitations and enabling the scaling of MoE models to millions of experts.
The Challenge of Scaling Language Models
Scaling LLMs by increasing their parameter count has been shown to improve performance and enable new capabilities. However, as models grow in size, they run into computational and memory bottlenecks. The classic transformer architecture used in LLMs relies on attention layers and feedforward (FFW) layers. The FFW layers, which account for a significant portion of the model’s parameters, are one of the bottlenecks of scaling transformers. The computational footprint of FFW layers is directly proportional to their size, posing challenges in increasing model capacity without incurring higher computational costs.
MoE architectures address the challenges posed by FFW layers by replacing them with sparsely activated expert modules. Each expert contains a fraction of the parameters of the full dense layer and specializes in certain areas. By routing input data to these specialized experts, MoE models can increase their capacity without significantly increasing computational costs. The number of experts in an MoE model is crucial, with recent studies showing that high-granularity MoE models outperform dense models when balanced with other factors such as model size and training data.
Google DeepMind’s PEER architecture revolutionizes MoE models by enabling scalability to millions of experts. PEER replaces fixed routers with a learned index that efficiently routes input data to a vast pool of experts. The architecture uses tiny experts with a single neuron in the hidden layer, allowing for shared hidden neurons among experts. This design enhances knowledge transfer and parameter efficiency while maintaining high performance. Moreover, PEER adopts a multi-head retrieval approach, similar to the multi-head attention mechanism in transformer models, to further optimize expert selection and activation.
Researchers evaluated PEER’s performance on various benchmarks, comparing it against transformer models with dense feedforward layers and other MoE architectures. The experiments demonstrated that PEER models achieve a superior performance-compute tradeoff, achieving lower perplexity scores with the same computational budget. Increasing the number of experts in a PEER model led to further reduction in perplexity, highlighting the scalability and efficiency of the architecture. The findings challenge the notion that MoE models reach peak efficiency with a limited number of experts, showcasing the potential of PEER to scale MoE to millions of experts and reduce the cost and complexity of training and serving large language models.
The introduction of PEER architecture by Google DeepMind represents a significant advancement in the field of large language models. By enabling the scalability of MoE models to millions of experts, PEER offers a promising solution to the challenges of increasing model capacity while keeping computational costs low. The results of experiments underscore the superior performance and efficiency of PEER compared to existing architectures, paving the way for further advancements in the development of very large language models.
Leave a Reply