模型训练和推理过程中的显存占用问题
迪丽瓦拉
2025-05-30 13:54:26
0

问题背景

我有两个GPT2的模型,模型1只有1亿参数,并以16位浮点数存储,也就是250MB左右,模型2有35亿参数,同样以16位浮点数存储,也就是7GB左右。

我以为推理的时候加载模型到显存中后占用的空间应该也是差不多的大小,但是1亿参数的模型加载到TorchServe中后却占用了957MB,不知道为什么多出来700多MB。

$ nvidia-smi
Sun Mar 19 13:54:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |    959MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A   3267643      C   /home/venv/bin/python             957MiB |
+-----------------------------------------------------------------------------+

解释

其实多出来的这部分是CUDA上下文占用的显存开销,它是在执行了第一个CUDA相关操作后创建的。

如果想知道自己显卡的CUDA上下文要占用多少显存,可以创建一个非常简单的张量,然后转移到GPU上,看一下显存占用即可。

$ python
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> data = [1]
>>> x_data = torch.tensor(data)
>>> x_data.cuda()
tensor([1], device='cuda:0')
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   33C    P8     9W / 250W |    437MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |    959MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3307828      C   python                            435MiB |
|    1   N/A  N/A   3267643      C   /home/venv/bin/python             957MiB |
+-----------------------------------------------------------------------------+

可以发现,即使我们创建了一个只包含一个元素的张量,但是显存的占用还是达到了435MB。

CUDA上下文占用的显存跟模型大小并不相关,只跟显卡型号有关。也就是说,相比于小模型,大模型占用的CUDA上下文并不会突出多少。

实验代码

实验模型

# 保存1亿参数的模型
from transformers import GPT2Tokenizer, GPT2LMHeadModel
hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
tokenizer = GPT2Tokenizer.from_pretrained(hf_model_path)
model = GPT2LMHeadModel.from_pretrained(hf_model_path)
model.half()
model.save_pretrained("Wenzhong-GPT2-110M")

TorchServe

模型处理脚本:Transformer_handler_generalized.py

import torch as th
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from ts.torch_handler.base_handler import BaseHandlerclass TransformersGpt2Handler(BaseHandler):def __init__(self):super(TransformersGpt2Handler, self).__init__()self.initialized = Falsedef initialize(self, ctx):self.manifest = ctx.manifestproperties = ctx.system_propertiesmodel_dir = properties.get("model_dir")self.device = th.device("cuda:" + str(properties.get("gpu_id"))if th.cuda.is_available() and properties.get("gpu_id") is not Noneelse "cpu")self.model = GPT2LMHeadModel.from_pretrained(model_dir, torch_dtype=th.float16)self.model.to(self.device)hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"self.tokenizer = GPT2TokenizerFast.from_pretrained(hf_model_path)self.end_token_id = self.tokenizer.add_special_tokens({"pad_token": "<|endoftext|>"})self.model.eval()self.initialized = Truedef preprocess(self, requests):inputs = Nonefor idx, data in enumerate(requests):input_text = data.get("body").get("prompt")if isinstance(input_text, (bytes, bytearray)):input_text = input_text.decode("utf-8")inputs = self.tokenizer(input_text, return_tensors="pt")return inputsdef inference(self, data, *args, **kwargs):generation_output = self.model.generate(**data.to(self.device), return_dict_in_generate=True, top_k=4, penalty_alpha=0.6,output_scores=True, do_sample=True, eos_token_id=91)return generation_outputdef postprocess(self, inference_output):inferences = []for idx, sentence in enumerate(inference_output.sequences):output = self.tokenizer.decode(sentence)inferences.append(output)return [inferences]

模型打包

torch-model-archiver --model-name Wenzhong-GPT2-110M --force --version 1.0 --serialized-file Wenzhong-GPT2-110M/pytorch_model.bin  --handler Transformer_handler_generalized.py    --export-path model_store/ --extra-files "Wenzhong-GPT2-110M/config.json"

拉取并启动TorchServe镜像

docker pull pytorch/torchserve:latest-gpu

config.properties配置文件

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-storecors_allowed_origin=*
cors_allowed_methods=*install_py_dep_per_model=truedefault_response_timeout=600

启动TorchServe

docker run --rm -it -d --name Wenzhong --gpus all -p 18080:8080 -p 18081:8081 -v ${pwd}/model_store:/home/model-server/model-store pytorch/torchserve:latest-gpu

安装transformers

docker exec Wenzhong pip install -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com transformers

注册模型

curl -X POST "http://localhost:18081/models?url=Wenzhong-GPT2-110M.mar"
curl -X PUT  "http://localhost:18081/models/Wenzhong-GPT2-110M?min_worker=1"
curl -X POST 'http://localhost:18080/predictions/Wenzhong-GPT2-110M' --data '{"prompt": "你是谁?"}'

参考资料

GitHub Issue: The memory occupied by the model becomes larger after it is loaded into the GPU

相关内容