How to benchmark resource requirements of deep learning models

Why do I need to benchmark?

Computational cost of training and deploying deep learning (DL) models has become an increasingly difficult problem, especially with the success of transformer-based models that span at least a billion parameters.

Terms

FLOP(S): Floating Point Operations (Per Second)
MAC: Multiply-Accumulate Operations
- 1 MAC ≈ 2 FLOP

What should I benchmark?

The idea behind benchmarking (also known as profiling in some context) is to understand the resource requirements for your planned DL pipeline. By getting the memory and computation requirements of your model, you can get an idea of how much resource you should expect to dedicate to your particular pipeline. In specific, this post will detail the process to benchmark the needs of an arbitrary pytorch model.

Installation

It's best if you know how to use some form of virtual environments (you can refer to this post if you need a refresher). It's not necessary, but it is strongly recommended.

Since this post focuses on pytorch, I will assume you already have some version of pytorch installed.

The benchmarking is done by the ptflops library. There are other libraries that achieve similar goals, but I decided to use ptflops for the following reasons.

fvcore: doesn't support attention operators
torchinfo: doesn't calculate MAC.

You can start by installing ptflops using pip:

pip install ptflops

Usage

Once ptflops is installed, you can write a simple script that loads your model and calculates the FLOPS and MACs using it.

Simple Case

A simple case where the input to the __forward__ call is just a single argument is as follows:

from torch import nn
from ptflops import get_model_complexity_info

"""
A simple linear classifier
"""
class SimpleNet(nn.Module):
    def __init__(self, d_in, d_out):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(d_in, d_out)

    def forward(self, x):
        return self.fc(x)

model = SimpleNet(10, 2)
macs, params = get_model_complexity_info(
    model,
    input_res=(1, 10),  ## batch_size, input_dim
    as_strings=True,
    print_per_layer_stat=True,
    verbose=True,
)

print("Compute:", macs)
print("Memory (## params):", params)

The second last line will print the amount of MACs required to run the model given the batch size (in this case set to 1). Using this value, you can now calculate how many epochs/batches you estimate to train your model and multiply this value by that amount to get an idea of how much total you need. The last line will tell you how many parameters are present in your model. This is useful to understand how much memory you need to store the model.

More Complex Case

The above example is sufficient if your model's input is a single parameter. But what it takes multiple tensors? For example, consider this model:

from torch import nn

from ptflops import get_model_complexity_info

"""
A simple linear classifier
"""
class SimpleNet(nn.Module):
    def __init__(self, d_in, d_out):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(d_in, d_out)

    def forward(self, x1, x2):
        return self.fc(x1) + self.fc(x2)

You will find that modifying input_res=(1, 10) to input_res=((1, 10), (1, 10)) will not work.

In this case, in addition to passing input_res, you can also include an argument for input_constructor, which expects a function that takes input_res as an input and returns some dummy tensors as a tuple (positional arguments) or dictionary (keyword arguments). This means that you can modify the previous code to this:

import torch
from torch import nn
from ptflops import get_model_complexity_info

"""
A simple linear classifier
"""
class SimpleNet(nn.Module):
    def __init__(self, d_in, d_out):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(d_in, d_out)

    def forward(self, x1, x2):
        return self.fc(x1) + self.fc(x2)

def input_constructor(input_res):
    return {
        "x1": torch.rand(input_res[0], 10),
        "x2": torch.rand(input_res[0], 10),
    }

model = SimpleNet(10, 2)
macs, params = get_model_complexity_info(
    model,
    input_res=(1,),  ## This argument should always be a tuple
    input_constructor=input_constructor,
    as_strings=True,
    print_per_layer_stat=True,
    verbose=True,
)

print(macs)

Note that the input_res argument should always be a tuple, even if it only has one element. As of ptflops==0.7.3, this is the case.

And there it is! You now successfully benchmarked the compute requirements of your model.

Inwon Kang

How to benchmark resource requirements of deep learning models

Tags

← Previous Article

Next Article →

Table of contents