Mixtral of Experts (Paper Explained)
1:02:17
RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)
37:17
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
50:03
V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)
48:53
Safety Alignment Should be Made More Than Just a Few Tokens Deep (Paper Explained)
22:43
How might LLMs store facts | DL7
12:33
Mistral 8x7B Part 1- So What is a Mixture of Experts Model?
36:15
Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
1:04:32