A step-by-step guide to recreating Galenos using Phi-3-mini, QLoRA, and TRL.
Choose the foundation for your fine-tuning pipeline:
Create a JSONL dataset with chat-formatted medical dialogues:
{
"messages": [
{
"role": "system",
"content": "You are Galenos, a careful medical assistant..."
},
{
"role": "user",
"content": "Patient has fever and cough for 3 days..."
},
{
"role": "assistant",
"content": "**Summary:** ... **Possible Causes:** ..."
}
]
}⚠️ Avoid prompt leakage: Keep system prompts consistent and don't include hidden labels.
Use the model's native chat template to avoid format mismatches:
tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.apply_chat_template(messages, tokenize=False)
Load the model in 4-bit quantization to reduce memory usage:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)Configure low-rank adapters for efficient fine-tuning:
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)Use TRL's SFTTrainer with stability-focused hyperparameters:
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=config,
max_seq_length=1024,
args=TrainingArguments(
learning_rate=2e-4,
gradient_accumulation_steps=4,
max_grad_norm=1.0
)
)
trainer.train()Test format adherence, safety, and quality on a curated eval set.
Export LoRA adapters or merge with base model for deployment.
Create a Gradio interface or API endpoint for inference.