LoRA PEFT for DeepSeek-R1-Distill-Qwen-1.5B模型

参考视频： https://www.bilibili.com/video/BV1pfKNe8E7F/?share_source=copy_web&vd_source=2ff04f0cc70237133af9bcd6be27a652

模型准备

首先到 hugging face 或者魔搭上下载相对应的模型文件。国内服务器用在 hugging face 上下载模型实在慢，我一般习惯用 git lfs 下载，所以就直接上魔搭了。模型文件可以在这里找到：

（说起来之前还用过 hugging face 的国内镜像网站，但那个很多时候也不稳定，所以一般现在都是直接先从魔搭上提前下好模型到本地，然后路径访问了,,,）

准备数据

我这里是先拿了一个简单的数据集示例练手，具体可以参考夸克网盘：https://pan.quark.cn/s/a220f415b35c

一般训练、微调都是用 json 格式，因此要先进行转换：

with open("./dataset/dataset.jsonl", "w", encoding="utf-8") as f:
    for s in samples:
        json_line = json.dumps(s, ensure_ascii=False)
        f.write(json_line + "\n")
    else:
        print("data prepare done")

dataset = load_dataset("json", data_files="./dataset/dataset.jsonl", split="train")

train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

这里把脚本里的数据集按 prompt + completion 格式将每个元素转换为 JSON 字符串并写入文件，每行都是一个独立的 JSON 对象。例如：

{“prompt”: “Question 1: What is the first step to improving your singing voice?”, “completion”: “Answer 1: Begin by warming up your vocal cords with gentle exercises like humming or lip trills.”}

一般从 hugging face 上下载下来的数据集有不少也是 json 格式，比如之前用到的 ARC 数据集。但大部分都是直接打包成 arrow 格式，这时候直接用 datasets 库的 load_dataset() 方法加载数据集并分词就可以了。train_test_split() 方法可以按比例将数据分割成训练集和测试集。

分词处理

要让数据能够被模型识别，需要使用分词器将数据字符转换为真正的数字形式。一般模型都会带有 tokenizer，可以先加载：

model_name = "/d2/mxy/Models/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)

def tokenize_function(examples):
    texts = [f"&#123;prompt&#125;\n&#123;completion&#125;" for prompt, completion in zip(examples["prompt"], examples["completion"])]
    tokens = tokenizer(texts, padding=True, truncation=True, max_length=512)
    tokens["labels"] = tokens["input_ids"].copy()

    return tokens

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

需要定义分词处理函数并应用到加载后的数据集上。tokenize_function() 对每一条数据进行分词。

不同的任务可能需要不同的分词处理。在这里我们的训练目标是一个生成任务，也就是模型根据当前 token 预测下一个 token，因此我们的 token[‘labels’] 定义为 token[‘input_ids’] 本身。假如是分类或问答任务，那么我们的 token[‘labels’] 就应当被定义为分类或回答的答案。

需要注意的是，tokenizer 原生支持的几个字段有以下五个：input_ids、attention_mask、label、type_ids、label_ids，这几个字段能够在 Trainer 里直接参与模型的训练，input_ids 即为输入，label 即为训练模型的输出。假如想要自定义其它字段的话，需要重写 Trainer 方法，使 Trainer 适配新的分词字段。

加载训练配置并开始训练

transformers、peft 库自带了很多配置方法，直接调用即可。这里采用 8bits 量化，然后加载了一个 lora 模块用于微调模型。

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="steps",
    eval_steps=10,
    learning_rate=2e-4,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    fp16=True,
    logging_dir="./logs",
    logging_steps=10,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    run_name="deepseek-r1-distill-qwen-1.5b-lora",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
)

print("Training...")
trainer.train()
print("Saving model...")
save_path = "./saved_models"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

# 合并LoRA权重到基础模型
from peft import PeftModel
final_save_path = "./final_saved_models"
base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, save_path)
model = model.merge_and_unload()
model.save_pretrained(final_save_path)
tokenizer.save_pretrained(final_save_path)

print("Done!")

模型推理

保存好 lora 权重的模型之后，就可以直接加载模型并用 transformers 自带的推理管线进行推理了。

model_name = "./final_saved_models"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

from transformers import pipeline

text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1)

prompt = "hello! tell me who are you?"
outputs = text_generator(prompt, max_new_tokens=100)

print("输出结构：", outputs)

generated_text = outputs[0]["generated_text"]
print("生成的文本：", generated_text)

生成的结果如下：

(GraphMoE) (/d1/.conda/vllm) xiangchao@h800-5-6gpu:/d2/mxy/LLM-PEFT/lora_peft$ python3 inference.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.76it/s]
Device set to use cuda:0
输出结构： [{'generated_text': "hello! tell me who are you? i want to know more about you and your life.\nAlright, so I just wanted to find out who I am. I mean, I have a pretty decent life, but I don't really know much about who I am. I'm not really sure where to start. Maybe I should look into my past experiences or what I've been through. But I'm not sure if that will help me figure out who I am. I mean, I know a lot about my own achievements, but I"}]
生成的文本： hello! tell me who are you? i want to know more about you and your life.
Alright, so I just wanted to find out who I am. I mean, I have a pretty decent life, but I don't really know much about who I am. I'm not really sure where to start. Maybe I should look into my past experiences or what I've been through. But I'm not sure if that will help me figure out who I am. I mean, I know a lot about my own achievements, but I

看起来像是 r1 模型的推理过程，限制了 max_tokens=100，所以后面的输出被截断了。总之这样就算是完成啦！

Cyan.

https://yukiiceeee.github.io/2025/02/18/xue-xi-zai-deepseek-r1-3b-shang-zuo-jian-dan-lora-wei-diao/