LoRA PEFT for DeepSeek-R1-Distill-Qwen-1.5B模型
模型准备
首先到 hugging face 或者魔搭上下载相对应的模型文件。国内服务器用在 hugging face 上下载模型实在慢,我一般习惯用 git lfs 下载,所以就直接上魔搭了。模型文件可以在这里找到:
(说起来之前还用过 hugging face 的国内镜像网站,但那个很多时候也不稳定,所以一般现在都是直接先从魔搭上提前下好模型到本地,然后路径访问了,,,)

准备数据
我这里是先拿了一个简单的数据集示例练手,具体可以参考夸克网盘:https://pan.quark.cn/s/a220f415b35c
一般训练、微调都是用 json 格式,因此要先进行转换:
with open("./dataset/dataset.jsonl", "w", encoding="utf-8") as f:
for s in samples:
json_line = json.dumps(s, ensure_ascii=False)
f.write(json_line + "\n")
else:
print("data prepare done")
dataset = load_dataset("json", data_files="./dataset/dataset.jsonl", split="train")
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]
这里把脚本里的数据集按 prompt + completion 格式将每个元素转换为 JSON 字符串并写入文件,每行都是一个独立的 JSON 对象。例如:
{“prompt”: “Question 1: What is the first step to improving your singing voice?”, “completion”: “Answer 1: Begin by warming up your vocal cords with gentle exercises like humming or lip trills.”}
一般从 hugging face 上下载下来的数据集有不少也是 json 格式,比如之前用到的 ARC 数据集。但大部分都是直接打包成 arrow 格式,这时候直接用 datasets 库的 load_dataset() 方法加载数据集并分词就可以了。train_test_split() 方法可以按比例将数据分割成训练集和测试集。
分词处理
要让数据能够被模型识别,需要使用分词器将数据字符转换为真正的数字形式。一般模型都会带有 tokenizer,可以先加载:
model_name = "/d2/mxy/Models/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)
def tokenize_function(examples):
texts = [f"{prompt}\n{completion}" for prompt, completion in zip(examples["prompt"], examples["completion"])]
tokens = tokenizer(texts, padding=True, truncation=True, max_length=512)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)
需要定义分词处理函数并应用到加载后的数据集上。tokenize_function() 对每一条数据进行分词。
不同的任务可能需要不同的分词处理。在这里我们的训练目标是一个生成任务,也就是模型根据当前 token 预测下一个 token,因此我们的 token[‘labels’] 定义为 token[‘input_ids’] 本身。假如是分类或问答任务,那么我们的 token[‘labels’] 就应当被定义为分类或回答的答案。
需要注意的是,tokenizer 原生支持的几个字段有以下五个:input_ids、attention_mask、label、type_ids、label_ids,这几个字段能够在 Trainer 里直接参与模型的训练,input_ids 即为输入,label 即为训练模型的输出。假如想要自定义其它字段的话,需要重写 Trainer 方法,使 Trainer 适配新的分词字段。
加载训练配置并开始训练
transformers、peft 库自带了很多配置方法,直接调用即可。这里采用 8bits 量化,然后加载了一个 lora 模块用于微调模型。
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.1,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="steps",
eval_steps=10,
learning_rate=2e-4,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
fp16=True,
logging_dir="./logs",
logging_steps=10,
per_device_eval_batch_size=16,
num_train_epochs=10,
weight_decay=0.01,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
run_name="deepseek-r1-distill-qwen-1.5b-lora",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_test_dataset,
)
print("Training...")
trainer.train()
print("Saving model...")
save_path = "./saved_models"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
# 合并LoRA权重到基础模型
from peft import PeftModel
final_save_path = "./final_saved_models"
base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, save_path)
model = model.merge_and_unload()
model.save_pretrained(final_save_path)
tokenizer.save_pretrained(final_save_path)
print("Done!")
模型推理
保存好 lora 权重的模型之后,就可以直接加载模型并用 transformers 自带的推理管线进行推理了。
model_name = "./final_saved_models"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
from transformers import pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1)
prompt = "hello! tell me who are you?"
outputs = text_generator(prompt, max_new_tokens=100)
print("输出结构:", outputs)
generated_text = outputs[0]["generated_text"]
print("生成的文本:", generated_text)
生成的结果如下:
(GraphMoE) (/d1/.conda/vllm) xiangchao@h800-5-6gpu:/d2/mxy/LLM-PEFT/lora_peft$ python3 inference.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.76it/s]
Device set to use cuda:0
输出结构: [{'generated_text': "hello! tell me who are you? i want to know more about you and your life.\nAlright, so I just wanted to find out who I am. I mean, I have a pretty decent life, but I don't really know much about who I am. I'm not really sure where to start. Maybe I should look into my past experiences or what I've been through. But I'm not sure if that will help me figure out who I am. I mean, I know a lot about my own achievements, but I"}]
生成的文本: hello! tell me who are you? i want to know more about you and your life.
Alright, so I just wanted to find out who I am. I mean, I have a pretty decent life, but I don't really know much about who I am. I'm not really sure where to start. Maybe I should look into my past experiences or what I've been through. But I'm not sure if that will help me figure out who I am. I mean, I know a lot about my own achievements, but I
看起来像是 r1 模型的推理过程,限制了 max_tokens=100,所以后面的输出被截断了。总之这样就算是完成啦!