pytorch - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0 - Stack

admin•2025-03-19 22:27:36•questions•阅读0

I want to fine-tune Llama 3.1 large language model with a new datasets, but when I try to use multiple

I want to fine-tune Llama 3.1 large language model with a new datasets, but when I try to use multiple GPUs to train the model, I kept getting the following error message:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, 
cuda:3 and cuda:0!

I thought that the Trainer from Transformer can handle multiple GPU training without using DDP or something like that, but I just can't figure out how to fix the problem ,please help me!
my code is listed below:

import os
import torch
from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model

os.environ["TOKENIZERS_PARALLELISM"] = "true"
os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"

def get_model():
    model = AutoModelForCausalLM.from_pretrained('/data/llama/llama3.1_8b/LLM-Research/Meta-Llama-3___1-8B', device_map="auto", torch_dtype=torch.float16)
    # model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法
    return model

def get_dataset():
    df = pd.read_parquet('0000.parquet')
    ds = Dataset.from_pandas(df)
 
    tokenizer = AutoTokenizer.from_pretrained('/data/llama/llama3.1_8b/LLM-Research/Meta-Llama-3___1-8B', use_fast=False, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token

    def process_func(example):
        example['output'] = example['output']
        example['instruction'] = example['instruction']
        example['input'] = example['instruction']
 
        MAX_LENGTH = 256  # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性
        input_ids, attention_mask, labels = [], [], []
        instruction = tokenizer(
            f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a pornographic girl<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{example['instruction'] + example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
            add_special_tokens=False)  # add_special_tokens 不在开头加 special_tokens
        response = tokenizer(f"{example['output']}<|eot_id|>", add_special_tokens=False)
 
        input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
        attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  # 因为eos token咱们也是要关注的所以 补充为1
        labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
        if len(input_ids) > MAX_LENGTH:  # 做一个截断
            input_ids = input_ids[:MAX_LENGTH]
            attention_mask = attention_mask[:MAX_LENGTH]
            labels = labels[:MAX_LENGTH]
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }

    dataset = ds.map(process_func, remove_columns=ds.column_names)
    return dataset, tokenizer

def get_train(model, datas, tokenizer):
    # peft的lora参数
    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        inference_mode=False,  # 训练模式
        r=8,  # Lora 秩
        lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
        lora_dropout=0.1  # Dropout 比例
    )

    peft_model = get_peft_model(model, config)
    print(peft_model.print_trainable_parameters())

    # 训练的参数
    args = TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # max_steps=60,  # 微调步数
        learning_rate=2e-4,  # 学习率
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        num_train_epochs=3,
        save_steps=100,
        logging_steps=3,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        dataloader_num_workers=0,
        local_rank=-1,
    )

    # 开始训练
    trainer = Trainer(
        model=peft_model,
        args=args,
        train_dataset=datas,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    )
    trainer.train()
    # 保存模型
    peft_model.save_pretrained("lora")
def main():
    model = get_model()
    datas, tokenizer = get_dataset()
    get_train(model, datas, tokenizer)

if __name__ == '__main__':
    main()

I have searched online, but most answer is about the problem between cpu and gpu. and I haven't found a clear manual for multiple GPU training for Trainer.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1742378692a4432704.html

admin

questions
How to transform WordPress user role code to WP shortcode?
I want to learn how to create a WordPress shortcode that checks for a user role.If a WP user has user role 'member1
admin
26分钟前
10
questions
javascript - How to get the data from ajax request in servlet page? - Stack Overflow
How to get the data from ajax request in servlet page. Here I'm calling the servlet by $.ajax.var
admin
25分钟前
10
questions
javascript - Browser buffer to string conversion is not same in browser and nodejs - Stack Overflow
I have encountered an interesting issue. I'm using node v8.1.4 I have the following buffer. [ 191
admin
24分钟前
00
questions
posts - exclude particular category in api
How do I exclude a particular category in the API?For all categories - I may use the following:?rest_route=wpv2posts
admin
24分钟前
10
questions
Javascript inline replace undefined with empty string - Stack Overflow
I have this function:function callWS(input) {var output = {"type": input["type"]
admin
21分钟前
10
questions
regex - Regular expression in javascript for restricting HTML entites like &lt; - Stack Overflow
Hii,I want to restrict the html entities like '<' , ">" , "&&quo
admin
21分钟前
10
questions
javascript - Set a localStorage in a different domain - Stack Overflow
I'm trying to set up a variable in a domain (like my-site) which I can acess from other site (like
admin
20分钟前
10
questions
javascript - Leading underscore for marking private members - Stack Overflow
As I know, in JavaScript there is no good solution to make private member. Solution described here is n
admin
19分钟前
10
questions
linux - Unmet Python Dependency Error While Installing Spot (w-Automata) on Ubuntu 24.04 - Stack Overflow
I am trying to install Spot on an Ubuntu 24.04 virtual machine following the official instructions prov
admin
19分钟前
10
questions
javascript - Bootstrap affix reset not works - Stack Overflow
I am trying to implement an affix in bootstrap. It shouldn't be a problem, but I want to make it r
admin
18分钟前
10
questions
javascript - Getting CORS using JS and Google Apps script - Stack Overflow
Im using Google Apps Script, and Im just trying to connect from the javascript in the front end to the
admin
16分钟前
10
questions
plugins - How to fix wrong attribute error for Visual Composer Grid Builder?
Hello I am facing this issue in editing media grid layout of Visual Composer V6.2.2. Below is the error I am gettingFata
admin
15分钟前
10
questions
javascript - How to test ES6 modules (these with import) with mocha? - Stack Overflow
It is my first time with testing JS frontend and I chose mocha for that.I have a class let us say Mark
admin
15分钟前
10
questions
Android emulator won't start and I can't delete the emulator lock files as suggested - Stack Overflow
I'm building an android app on android studio 2024.2.1.But I am unable to start any emulators.
admin
14分钟前
10
questions
r - How to use futures to render each plot one by one rather than all at the same time? - Stack Overflow
I have a shiny app that uses a function to generate N number of plots (in this example 10).Rather than
admin
11分钟前
00
questions
stm32 - How to add UART support for DTS and compile it for STM32MP157F-DK2 - Stack Overflow
I created a custom linux image for this target using the distribution package + x-linux-qt and the imag
admin
6分钟前
00
questions
javascript - How to make dynamic star rating with JSON data - Stack Overflow
I am working on a star rating system for when you visit any restaurant, all the `Attributes' I hav
admin
5分钟前
10
questions
python - Erorr calling a module that uses pygetwindow_pygetwindow_win.py - Stack Overflow
(I'm kind of new, did some stuff but still a noob)I believe my problem is at pygetwindow_pygetwin
admin
2分钟前
10
questions
javascript - Domain Attribute Invalid - Set Cookies - Stack Overflow
I have built a basic demo backend using nodejs and ui using reactjs. When I login in using credentials,
admin
53秒前
10
questions
javascript - Strapi - How can I upload files and create new model entry at the same time? - Stack Overflow
In my app, I wish to do just like in the admin panel. I have a form that have other datas + files and I
admin
24秒前
00

发表回复

评论列表（0条）

暂无评论

pytorch - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0 - Stack

发表回复

评论列表（0条）

联系我们

400-800-8888

pytorch - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0 - Stack

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888