fine-tune - Samuel's Vault

# Fine-Tuning Process - _Model alignment_ - fine-tuning the base model so that it behaves as user intended. - _HHH_ - Helpful, honest, harmless fine-tuning. - _Supervised fine-tuning (SFT)_ model - fine-tuned on top of the base model using sample conversation between a person and an HHH assistant. - SFT models generate completions, human-experts score/rank them. (all with internal knowledge) - Reward model, tuned with [[reinforcement-learning]] techniques -- take SFT model, train it over the ranking of SFT generated completions, to return a numerical value representing the reward. Only ranking is studies, hence LLM knows to be consistent with its own internal knowledge. - RLHF model, starting from SFT model, generating completions, evaluated by reward model. Proximal policy optimization (PPO) is needed, RLHF can't produce text significantly different from SFT model (cheating) to get high score - Annotated `ChatML` trains chat model (instead of instruct models), `<|im_start|>` and `<|im_end|>`, three roles: `system`, `user`, `assistant`, `function` ## Techniques - _Full fine-tuning_ (_continued pre-training_), simply continues the training with new documents, all parameters are updated, computation intensive. - LoRA (_low-rank adaptation_), trains a "diff" on key parameter matrices. Suitable for teaching the model a new distribution, how to interpret the prompt, what completions are expected, etc. - With continued pre-training and LoRA, most of the static part of the prompt can be eliminated, including the few-shot examples. - _Soft prompting_, combining [[prompt|prompting]] and [[ml|machine learning]] to find the "state of mind" that is most likely to produce the desired outcome -- also considered a type of fine tuning. - PeFT After fine-tuning, make sure the new prompt follow the "new way", otherwise the model will tend to "forget" the fine-tuning.