AI大模型部署通易千问多模态大模型 Qwen2.5-VL-3B-Instruct /7B /72B 实战

部署通易千问多模态大模型 Qwen2.5-VL-3B-Instruct /7B /72B 实战

发表日期：2025-02-23 05:14:38 | 来源： | | 浏览(51) 分类：AI大模型

准备步骤：

下载安装miniconda3

https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe

安装注意事项：这是Python的软件包管理软件，需要下载很多依赖可以占用数十G空间，如果C盘空间不足，建议安装到别的位置，路径不能有空格等符号

安装完成在开始菜单找到：Anaconda PowerShell Prompt 打开进入命令行

方法一、

#切换至D盘
d:

#下载项目环境 至D:/Qwen2.5-VL
git clone https://github.com/QwenLM/Qwen2.5-VL

#切换至项目根目录
cd .\Qwen2.5-VL\

#创建这个项目的运行环境
conda create -n qwen_env python=3.10

#使用该项目
conda activate qwen_env
 
#使用pip命令安装这个文件里写的项目依赖 -i 是使用这个阿里云镜像下载，原始库国外下载超慢
pip install -r .\requirements_web_demo.txt -i https://mirrors.aliyun.com/pypi/simple/

#使用pip命令卸载这三个依赖包（上一步的依赖文件里会下载这三个包2.4版本的，但是经测试报错）
#OSError: [WinError 126] 找不到指定的模块。 Error loading "xxxx\fbgemm.dll" or one of its dependencies.
#也可以把上面requirements_web_demo.txt里的依赖版本改一下就不需要卸载这三个包再重装了
pip uninstall torch torchvision torchaudio

####无NVIDIA独立显卡######
#重新安装这三个包（默认会下载最新版2.6）
pip install torch torchvision torchaudio 
#或使用指定版本（推荐）
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

#----有NVIDIA独立显卡------
#安装cuda，去官网下载合适的版本，比如这个12.4的版本 ，然后安装
#https://developer.download.nvidia.cn/compute/cuda/12.4.1/local_installers/cuda_12.4.1_551.78_windows.exe


#重新安装这三个包（默认会下载最新版2.6）最好翻墙要不然下载100kb+，2.5G太慢
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
#或使用指定版本（推荐）
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
#和上面一样，只是多加了个--index-url 参数安装cuda支持，cu124 对应官网下载的cuda驱动版本

#python web_demo_mm.py --checkpoint-path "Qwen/Qwen2.5-VL-3B-Instruct"
#启动，会自己从抱脸网下载这个模型并运行，但是需要翻墙

#也可以用Git自己从国内镜像网站下载下来至 d:/Qwen2.5-VL-3B-Instruct
git clone https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git
 
#修改文件 \Qwen2.5-VL\web_demo_mm.py 
#把项目里配置的模型路径改为下载下来的保存路径 DEFAULT_CKPT_PATH = 'D:\Qwen2.5-VL-3B-Instruct'

#启动
python web_demo_mm.py

#不报错的话会显示:Running on local URL: http://127.0.0.1:7860
#打开这个web ui 网址就可以体验了，效果：

额外的话题：VL多模态模型要比语言模型占用的显存/内存多得多，比如这个3B的需要10G+显存，7B的需要20G+显存，所以显存如果太小的话，直接使用上面无独立显卡的安装方式跑，虽然会非常慢，但是gpu显存满了最终还是占用的内存和CPU跑的，对比了一下好像差不多慢，除非显存够大才能发挥GPU性能，还有一点是内存最好32G+，因为测试发现16G内存完全不够，还占用了10G+虚拟内存，所以非常慢。更重要的一点是本地环境试的话上传的图片分辨率小点，要不然性能不够总失败。

如果提示失败，不放关了多启动试几次。条件够的话用好一点配置的机器更容易成功。

我本地笔记本rtx 3060 6g显存用起来就非常慢和经常失败。

以上是我经过两天两夜各种试错，最终部署成功的总结。

方法二、使用vllm部署

conda create -n qwen_env python=3.10 -y
conda activate qwen_env
pip install vllm -i https://mirrors.aliyun.com/pypi/simple/
pip install git+https://github.com/huggingface/transformers -i https://mirrors.aliyun.com/pypi/simple/
pip install torch -i https://mirrors.aliyun.com/pypi/simple/
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --limit-mm-per-prompt image=4
#最终报错
import uvloopModuleNotFoundError: No module named ‘uvloop‘  
#花了N个小时反复弄不成，最终得知windows 目前不支持uvloop，所以这套方案目前只能在Linux上面跑，
#试错全是泪，没有一个文档和博主有提到这事

方法三、

conda create -n qwen_env python=3.10 -y
conda activate qwen_env
pip install git+https://github.com/huggingface/transformers accelerate  -i https://mirrors.aliyun.com/pypi/simple/

pip install qwen-vl-utils  -i https://mirrors.aliyun.com/pypi/simple/
pip uninstall torch torchvision torchaudio
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install modelscope -i https://mirrors.aliyun.com/pypi/simple/

#安装cuda，去官网下载合适的版本，比如这个12.4的版本 ，然后安装
# 

#最好先验证cuda是否可用，参考 http://www.canquick.com/article/ARTICLE_BB5E097E46C0CA60904B81FA.html

#新建Python文件 start.py：设置运行环境为上面创建的项目运行环境qwen_env ， 运行

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download

# default: Load the model on the available device(s)
model_dir = snapshot_download("Qwen/Qwen2.5-VL-3B-Instruct")
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     model_dir,
#     torch_dtype="auto",
#     device_map="auto"
# )

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    # attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained(model_dir)

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

运行后会自动下载模型

屏幕截图 2025-02-23 053901.png

下载完成后会启动项目：

但是，我的报错了：

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.20 GiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free. Of the allocated memory 5.61 GiB is allocated by PyTorch, and 182.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

显存溢出，因为我的是笔记本显卡RTX3060 6G，但是加载就需要12.2G，所以，这个方法只适用于显存>16G以上的机器，甚至跑起来需要更高。

不管怎么说，事情都到这了，我们也不能前功尽弃，还得想办法让项目跑起来，那么改一下代码。既然GPU显存不够，那就CPU和内存来凑、

#1.将：
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype="auto", device_map="auto"
)
#改为： 
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
#使用bfloat16量化，比fp32显存降低一半，速度快一半。当然精度会有所下降


#2.将：
processor = AutoProcessor.from_pretrained(model_dir)
#改为
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels,use_fast=True)
#限制图片像素，得以减少内存消耗，提升性能

使用GPU加速，如果显存不够了内存占用量和cpu使用量就会彪上去：我的cpu利用率并不高，再看16G内存似乎不太够，所以还使用了10G左右的虚拟内存，可能是内存瓶颈导致CPU闲置率太高速度上不去

16G内存是不太够的，所以会用到大量虚拟内存，最终拖慢计算速度

笔记本RTX3060 6G跑这个3B的多模态模型都差强人意，显存要求太高了，看样子至少需要16/24G，推荐3090/4090

运行之后，会长时间没反馈，等着吧不要怀疑，会非常慢，经多次测试，我的环境需要2分31秒才能输出结果：

测试用图：

分析结果：

['这张图片展示了一位女士和她的狗在海滩上互动的场景。女士坐在沙滩上，穿着格子衬衫，面带微笑，似乎正在与她的狗玩耍。她的狗戴着项圈，看起来很友好，正伸出手来与她握手。背景是大海和天空，阳光明媚，营造出一种温暖而愉快的氛围。整个画面给人一种宁静和谐的感觉。']

AI大模型（9）