Multi-layer Perceptrons (MLPs) are the most fundamental type of neural network, so they play an important role in many machine learning systems and are the most theoretically studied type of neural network. A new paper from researchers at ETH Zurich pushes the limits of pure MLPs, and shows that scaling them up allows much better performance than expected from MLPs in the past. These findings may have important implications for the study of inductive biases, the theory of deep learning, and neural scaling laws. Our friends over at The Gradient provided this analysis.
Many neural network architectures have been developed for different tasks, but the simplest form is the MLP, which consists of dense linear layers composed with elementwise nonlinearities. MLPs are important for several reasons: they are used in certain settings such as implicit neural representations and processing tabular data, they are used as subcomponents within state-of-the-art models such as convolutional neural networks, graph neural networks, and Transformers, and they are widely studied in theoretical works that aim to understand deep learning more generally.
MLP-Mixer (left) versus pure MLPs for images (right). MLP-Mixer still encodes visual inductive biases, whereas the pure MLP approach simply treats images as arrays of numbers.
This current work scales MLPs for widely studied image classification tasks. The pure MLPs considered in this work significantly differ from MLP-based models for vision such as MLP-Mixer and gMLP. The latter two works use MLPs in a specific way that encodes visual inductive biases by decomposing linear maps into channel mixing maps and patch mixing maps. In contrast, pure MLPs flatten entire images into numerical vectors, which are then processed by general dense linear layers.
The authors consider isotropic MLPs in which every hidden layer has the same dimension and layernorm is added after each layer of activations. They also experiment with inverted bottleneck MLPs, which expand and contract the dimension of each layer and include residual connections. The inverted bottleneck MLPs generally perform much better than the isotropic MLPs.
Finetuned performance of inverted bottleneck MLPs pretrained on ImageNet21k.
Experiments on standard image classification datasets show that MLPs can perform quite well, despite their lack of inductive biases. In particular, MLPs perform very well at transfer learning — when pretrained on ImageNet21k, large inverted bottleneck MLPs can match or exceed the performance of ResNet18s (except on ImageNet itself). Moreover, as with other modern deep learning models, the performance of inverted bottleneck MLPs scales predictably with model size and dataset size; interestingly, these scaling laws show that MLP performance is more limited by dataset size than model size, which may be because MLPs have less inductive biases and hence require more data to learn well.
Why it’s important?
Scaling laws and gains from scaling model and dataset sizes are important to study, as larger versions of today’s models may have sufficient power to do many useful tasks. This work shows that MLP performance also follows scaling laws, though MLPs are more data-hungry than other deep learning models. Importantly, MLPs are extremely runtime efficient to train: their forward and backward passes are quick and, as shown in this work, they improve when they are trained with very large batch sizes. Thus, MLPs can be used to efficiently study pretraining and large dataset training.
The authors’ observations that MLPs perform well with very large batch sizes is very interesting. Convolutional neural networks generally perform better with smaller batch sizes. Thus, using MLPs as a proxy to study CNNs (for instance, in theoretical works) may be faulty in this sense, as the implicit biases or other properties of the optimization process may significantly differ when training with these two different architectures.
That large-scale MLPs can do well is even more evidence that inductive biases may be significantly less important than model and data scale in many settings. This finding aligns with the finding that at a large enough scale, Vision Transformers outperform CNNs in many tasks, even though CNNs have more visual inductive biases built in.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW