Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math · Minideo

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

2:15:13

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

1:19:37

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

1:10:55

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

58:07

Aligning LLMs with Direct Preference Optimization

50:36

NLP Lecture 1: Regular Expressions

21:15

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

54:52

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

1:09:00

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Paper Explained)