Behind the big model frenzy: The “aging” and retrofitting of AI infrastructure

The “aging” and retrofitting of AI infrastructure-1

Machine learning models have evolved into what some call “behemoths”. The world’s top tech companies are in an arms race to train the largest models (MUM, OPT, GPT-3, Megatron), while other production-focused companies are scaling up their original models with good results.
It’s all going well, but what’s rarely mentioned is the number of real challenges that massive models pose to existing AI infrastructure and development processes.
The weight of large models can reach 100+GB, but the current development tools have not caught up, it is very difficult to use, deployment often have to wait for several minutes or even several hours, which has become the pain of AI engineers, not only waste the engineer’s time, reduce work efficiency, but also slow down the iteration speed.
Developer productivity is one of the biggest costs when it comes to training and deploying models, according to Modular, the team working on AI infrastructure tools. Therefore, it is necessary to continuously optimize the tool chain to improve the experience of early adopters and make it easier for developers. This article explores the technical challenges of managing large amounts of data in the compilation process, and the infrastructure improvements Modular makes (and the MLIR compiler framework) to address those challenges. Compiled by the OneFlow community (ID: OneFlowTechnology).

1
The usability of the AI model kit is insufficient
Graph Transformations, optimization, and compilers perform transformations to improve the performance and portability of AI models so that they can be deployed on some target hardware.
Among the compilers, there is a high-level “compiler” such as TensorFlow Lite Converter, which converts the TensorFlow SavedModel model into a highly optimized program format (such as FlatBuffer) so that the model can be executed on edge devices; There are also domain-specific compilers such as XLA and TorchScript JIT Compiler, which create an intermediate representation (perhaps a “graph”) for the AI model and then compile it into another format — such as machine code or domain-specific runtime representation.
The compilation of AI diagrams is very different from traditional compilation. The AI diagram consists of two parts: the diagram topology (how the layers are connected) and the model weights (the parameters of a particular layer). In terms of size, the graph topology is measured in KB and the weight is measured in MB or even GB. For example, the Open Pre-trained Transformers model released by Meta had a number of participants ranging from 30 billion, 66 billion to 175 billion, equivalent to 100+GB weight. The Gopher and Megatron models are even larger.

The “aging” and retrofitting of AI infrastructure-2
Existing tools in the AI ecosystem do not yet handle large models well. For example, Protobufs limits data transfer sizes to 2GB, so models that use the Protobufs serialization format are hampered. The documentation for the latest version of TensorRT states that “for Transformer based neural network models such as BERT and GPT, TensorRT can consume up to 10 times the CPU memory of the model size at compile time”, indicating that TensorRT is not suitable for large models. If you use the ONNX file format to store a large model, you must split the model weights into multiple files to store separately.
All of this adds unnecessary complexity to the AI development workflow, removes the “single fact source” (SSOT) from models, and makes model distribution more difficult.
Workarounds may be adopted to deal with the problem of having too much weight on the model, which may end up making the overall AI development workflow more complex. For example, because some compiler phases take up to two minutes or more, Modular builds a mechanism for caching temporary files.

While this caching mechanism, like the other workarounds, only fixes the symptoms: it is neither 100% reliable nor solves the problem of Cache misses, it is adopted because Modular focuses heavily on increasing developer productivity.

2
Modular compiles MliRs in a stack

The “aging” and retrofitting of AI infrastructure-3
In Modular’s technology stack, the MLIR compiler architecture is responsible for representing and transforming AI models, including AI operator diagrams (for multiple frameworks), intermediate runtime primitives, and low-level machine code generation.
MLIR is a sub-project of the LLVM Compiler Infrastructure Project, which aims to provide a modern toolkit for building domain-specific compilers. MLIR provides a set of core components for modeling, analysis, and transformation in a variety of computing domains, including hardware design, quantum computing, and artificial intelligence.
Mlirs help build a single, full-stack complete system that is more powerful, modular, extensible, and easier to maintain than regular technology stacks. Using a unified infrastructure allows us to easily migrate each of these improvements to our own tool stack, enabling a higher level of modularity and assemblability in our development workflows.
In addition to Modular, TensorFlow, XLA, PyTorch, ONNX, and others use MLIR for model representation and transformation. As MLIR’s user ecosystem continues to expand, it is imperative to continue to improve and refine while praising the advantages of MLIR.

3

MLIR’s approach to managing weights leaves much to be desired
One of the basic components of MLIR is the Attribute mechanism, which can be thought of as constant data that is unique (or memoize or intern). Attributes are user extensible, that is, different attribute types can be used for different use cases. Many types of values can be assigned attributes, such as constant expression values (such as “5”, “10.0”, etc.), string literals, enumerated values (such as “less than”, “greater than”, “equal to”, etc.), data sets, etc. Most MLIR based AI tools use attributes to hold AI model weights.
However, a problem arises: Model weights can be extremely large, but MLIR stores 2 GB weights in the same way as it stores 4 B weights — using the same attribute, which contains a unique set of elements. But using a unique approach for gigabytes of data doesn’t make sense.
The difficulty with this approach is that in MLIR, when something is unique, it is allocated, hashed, and stored in the MLIRContext. These things have the same life cycle as the MLIRContext, and they are only destroyed when the MLIRContext is destroyed. For small numbers, this mechanism has many benefits, including the ability to pass values in and out, the ability to compare values after unique via Pointers, the ability to share memory allocations for attributes (very common), and so on.
But with a large number of weights, these advantages become disadvantages: we don’t want weights to be reassigned, copied, or unique, we only want them to exist for a while — and allow them to be released when the calculations no longer need to reference them. For example, when a model quantization tool is run, the operator graph needs to be transformed and new weights are generated, which may end up being copied in multiple copies, with a large number of weight copies consuming memory until the compilation is complete.
Another problem with ML tools is how MLIR is serialized to the file system. In the beginning, MLIR had no binary serialization format, only text format. This is problematic for large weights, since each byte of binary data is converted to hexadecimal, which takes up twice as much space. As a result, not only did the base conversion take quite a long time (about 20 seconds for a medium Gigabyte model), but the converted intermediate file was still twice as big — and twice as big!

4
Memory footprint: More serious impact than slowing down development efficiency
This is a well-intentioned design mechanism, but it has the potential to reduce compilation efficiency, even with the best compilers. The most obvious problem is that it takes longer to compile, monitor, and transform the model. If you’ve ever used “My code is still compiling” as an excuse to get out of bed on a daily basis, you know how painful it can be to wait to compile. Using this mechanism means that the processor has to constantly allocate, copy, and hash gigabytes of data.
A bigger problem than compile time is memory footprint, which affects the implementation of other architectural features in the Modular technology stack. For example, because our compiler and technology stack are themselves highly parallel and use advanced features such as online search, the memory footprint directly leads to some work not being developed in parallel and the highest quality results not being achieved.
The core value of Modular is building tools that users love. If advanced features don’t work well, or are inefficient, or come with caveats (e.g., “This feature doesn’t work for certain situations”), then users won’t use them at all. As a result, Modular aims to address the fundamental problems that come with large weights, simplifying user processes and developer workflows.

5
Core improvements to MLIR
The Modular team is an important contributor to MLIR projects. A key point of the corporate culture of Modular is “getting the product right”, which is shared by all projects that Modular participates in. While promoting the development of MLIR, Modular strives to ensure that every step of the MLIR project is correct, and also strengthens its cooperation with the MLIR community to gain acceptance for the approach adopted.
The Modular team lists the characteristics that a large modeling tool should have:
Don’t allocate memory unnecessarily: For large amounts of data (such as weights), memory mapping from disk is more efficient than copying data into blocks with allocated memory.
No hash or unique processing required: we don’t want to go to the trouble of checking the equality of 2 GB Blob data; To identify weights, you want to be able to do so by name, not by whether the content is unique or not.

Allow Inline Mutation: If the data is only to be used in one place, it should be allowed to quantify, transform, and manipulate the data in its original location, rather than copying the data first.
Allow free memory (deallocation) : Because the amount of data in a large model is so large, it should be allowed to free memory when all references to a piece of data do not exist.
Fast serialization: Whether it’s just-in-time compilation, searching optimization parameters, or local iteration, you need to cache IR, so this step must be fast.
These ideas are not new, but traditional compilers, such as those for typical CPU programming languages, have yet to implement these requirements.

6
Adjust weight attribute
The first four requirements address the fundamental question of how we should use MLIR: Weights, while constant data, should be managed differently from other MLIR attributes. Our weight management has been inappropriate. It’s like trying to squeeze a square peg into a round hole. It’s wasted space, slowed down our development, and increased user costs.
So Modular decided to take a different approach to managing weighted data, which led to MLIR’s first basic extension mechanism, the “Resource mechanism,” which separates data from references to data in calculations.
In the Resource mechanism, each Blob that serializes an MLIR may contain additional segments of information, called resources. The resources are either dialect (a namespace-like abstraction used when extending the MLIR) or “external” resources for a particular toolchain data. The data in the Resource is represented by simple key-value pairs, creating the JSON-like structure shown in the following figure.

/// Here we have some MLIR operations.
module {
func.func @foo() {
// Cool stuff here …
}
}

/// Here we have an `external_resources` section. The resource section’s syntax is designed to be unique as to not conflict with other MLIR syntax (which is user extensible!) .
{- #
external_resources: {
mlir_reproducer: {
pipeline: “func.func(cse,canonicalize),inline”,
disable_threading: true
}
}
# -}

The above example shows how to adjust MLIR to reproduce with Resource. An MLIR Reproducer is actually a configuration that contains execution information, such as a Transformation Pipeline, that is used to reproduce a particular failure or failure. Before using Resource, we represented this execution information by adding a comment at the top of the MLIR file. You can now combine this execution information into the first category using Resource.
Large weight data that used to require unique processing and thus occupy memory for a long time can now be stored using the Resource mechanism. In IR, we use lightweight references to attributes instead of using the underlying data:

Other advantages:
Debugging with IR is less error-prone, resulting in a better development experience: Resource is a dedicated information segment; We don’t have to worry about accidentally dumping a full 4GB of data while debugging.
We can reasonably handle IR without data: since IR only holds references to the data, not the data itself, we can omit the underlying Resource data if needed. The benefits include greatly simplifying the process of generating regenerators, which don’t need large weights to begin with (imagine sending a 1.2GB regenerator file to a colleague; now the regenerator file is 20MB).
By introducing

By hmimcu