TIL

11 things learned

May 2026

May 25 Claude Code

Typing ultrathink anywhere in a Claude Code prompt requests deeper reasoning for that single turn without changing your session effort level. Unlike older builds, ‘think’, ‘think hard’, and ‘think more’ are no longer special keywords; they’re passed through as ordinary prompt text.

May 25 Claude Code

/model opusplan runs Opus during plan mode then auto-switches to Sonnet for execution (Opus reasoning, Sonnet efficiency). Subtlety: the automatic 1M-token context upgrade applies only to plain opus; opusplan’s plan phase stays on the standard 200K window.

May 25 LLMs

Speculative decoding uses a small ‘draft’ model to propose several tokens that the large model verifies in one parallel forward pass. It’s mathematically lossless: the output distribution is identical to running the big model alone, and you just get the speedup for free on accepted tokens.

May 25 LLMs

Even at temperature=0, LLM outputs aren’t bitwise-deterministic on GPU. Floating-point addition isn’t associative, so parallel reductions and changing batch sizes can flip the argmax whenever two top logits are near-tied.

May 25 LLMs

Transformers dump a surprising amount of attention onto the very first token (an ‘attention sink’ that acts as a no-op dumping ground). StreamingLLM exploits this by always keeping the first few tokens in the KV cache, which stabilizes generation for effectively infinite-length streaming.

May 25 MoE

In a Mixture-of-Experts model a router fires only a few experts per token, so total vs. active parameter counts diverge hugely: a model can advertise hundreds of billions of params while only a small fraction actually compute on any given token. That’s why MoE models punch above their inference cost.

May 25 LLMs

KV cache size scales quadratically with sequence length under full attention, but only linearly with GQA (grouped-query attention), which is why Llama 3 uses GQA by default.

May 24 Python

dict.get(key, default) is faster than checking ‘key in dict’ first and then accessing it, because it avoids the double hash lookup.

May 22 CUDA

torch.compile() with mode=’reduce-overhead’ uses CUDA graphs to replay kernels without Python overhead. Biggest win on small models with repeated forward passes.

May 20 RAG

Hybrid search (BM25 + dense retrieval) consistently outperforms either alone in production. The BM25 leg catches exact keyword matches that embeddings miss.

May 18 ML

Gradient checkpointing trades ~30% more compute for a ~60-70% reduction in activation memory. Usually worth it when batch size is the bottleneck.

No entries match.