- Published on
How to benchmark resource requirements of deep learning models
Why do I need to benchmark?
Computational cost of training and deploying deep learning (DL) models has become an increasingly difficult problem, especially with the success of transformer-based models that span at least a billion parameters.
Terms
- FLOP(S): Floating Point Operations (Per Second)
- MAC: Multiply-Accumulate Operations
- 1 MAC ≈ 2 FLOP
What should I benchmark?
The idea behind benchmarking (also known as profiling in some context) is to understand the resource requirements for your planned DL pipeline. By getting the memory and computation requirements of your model, you can get an idea of how much resource you should expect to dedicate to your particular pipeline. In specific, this post will detail the process to benchmark the needs of an arbitrary pytorch model.
Installation
It's best if you know how to use some form of virtual environments (you can refer to this post if you need a refersher). It's not necessary, but it is stronly recommended.
Since this post focuses on pytorch, I will assume you already have some version of pytorch installed.
The benchmarking is done by the ptflops
library. There are other libraries that achieve similar goals, but I decided to use ptflops
for the following reasons.
fvcore
: doesn't support attention operatorstorchinfo
: doesn't calculate MAC.
You can start by installing ptflops
using pip:
pip install ptflops
Usage
Once ptflops
is installed, you can write a simple script that loads your model and calculates the FLOPS and MACs using it.
Simple Case
A simple case where the input to the __forward__
call is just a single argument is as follows:
from torch import nn
from ptflops import get_model_complexity_info
"""
A simple linear classifier
"""
class SimpleNet(nn.Module):
def __init__(self, d_in, d_out):
super(SimpleNet, self).__init__()
self.fc = nn.Linear(d_in, d_out)
def forward(self, x):
return self.fc(x)
model = SimpleNet(10, 2)
macs, params = get_model_complexity_info(
model,
input_res=(1, 10), # batch_size, input_dim
as_strings=True,
print_per_layer_stat=True,
verbose=True,
)
print("Compute:", macs)
print("Memory (# params):", params)
The second last line will print the amount of MACs required to run the model given the batch size (in this case set to 1). Using this value, you can now calculate how many epochs/batches you estimate to train your model and multiply this value by that amount to get an idea of how much total you need. The last line will tell you how many parameters are present in your model. This is useful to understand how much memory you need to store the model.
More Complex Case
The above example is sufficient if your model's input is a single parameter. But what it takes multiple tensors? For example, consider this model:
from torch import nn
from ptflops import get_model_complexity_info
"""
A simple linear classifier
"""
class SimpleNet(nn.Module):
def __init__(self, d_in, d_out):
super(SimpleNet, self).__init__()
self.fc = nn.Linear(d_in, d_out)
def forward(self, x1, x2):
return self.fc(x1) + self.fc(x2)
You will find that modifying input_res=(1, 10)
to input_res=((1, 10), (1, 10))
will not work.
In this case, in addition to passing input_res
, you can also include an argument for input_constructor
, which expects a function that takes input_res
as an input and returns some dummy tensors as a tuple (positional arguments) or dictionary (keyword arguments). This means that you can modify the previous code to this:
import torch
from torch import nn
from ptflops import get_model_complexity_info
"""
A simple linear classifier
"""
class SimpleNet(nn.Module):
def __init__(self, d_in, d_out):
super(SimpleNet, self).__init__()
self.fc = nn.Linear(d_in, d_out)
def forward(self, x1, x2):
return self.fc(x1) + self.fc(x2)
def input_constructor(input_res):
return {
"x1": torch.rand(input_res[0], 10),
"x2": torch.rand(input_res[0], 10),
}
model = SimpleNet(10, 2)
macs, params = get_model_complexity_info(
model,
input_res=(1,), # This argument should always be a tuple
input_constructor=input_constructor,
as_strings=True,
print_per_layer_stat=True,
verbose=True,
)
print(macs)
Note that the input_res
argument should always be a tuple, even if it only has one element. As of ptflops==0.7.3
, this is the case.
And there it is! You now successfully benchmarked the compute requirements of your model.