LLM inference optimization: Architecture, KV cache and Flash attention
39:42
Mixture of Experts: Mixtral 8x7B
36:12
Deep Dive: Optimizing LLM inference
32:07
Fast LLM Serving with vLLM and PagedAttention
48:25
Parameter-efficient Fine-tuning of LLMs with LoRA
1:16:12
Fine-Tuning Large Language Models (LLMs)
49:53
How a Transformer works at inference vs training time
57:45
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
11:54