TIL
11 things learnedTyping ultrathink anywhere in a Claude Code prompt requests deeper reasoning for that single turn without changing your session effort level. Unlike older builds, ‘think’, ‘think hard’, and ‘think more’ are no longer special keywords; they’re passed through as ordinary prompt text.
/model opusplan runs Opus during plan mode then auto-switches to Sonnet for execution (Opus reasoning, Sonnet efficiency). Subtlety: the automatic 1M-token context upgrade applies only to plain opus; opusplan’s plan phase stays on the standard 200K window.
Speculative decoding uses a small ‘draft’ model to propose several tokens that the large model verifies in one parallel forward pass. It’s mathematically lossless: the output distribution is identical to running the big model alone, and you just get the speedup for free on accepted tokens.
Even at temperature=0, LLM outputs aren’t bitwise-deterministic on GPU. Floating-point addition isn’t associative, so parallel reductions and changing batch sizes can flip the argmax whenever two top logits are near-tied.
Transformers dump a surprising amount of attention onto the very first token (an ‘attention sink’ that acts as a no-op dumping ground). StreamingLLM exploits this by always keeping the first few tokens in the KV cache, which stabilizes generation for effectively infinite-length streaming.
In a Mixture-of-Experts model a router fires only a few experts per token, so total vs. active parameter counts diverge hugely: a model can advertise hundreds of billions of params while only a small fraction actually compute on any given token. That’s why MoE models punch above their inference cost.
KV cache size scales quadratically with sequence length under full attention, but only linearly with GQA (grouped-query attention), which is why Llama 3 uses GQA by default.
dict.get(key, default) is faster than checking ‘key in dict’ first and then accessing it, because it avoids the double hash lookup.
torch.compile() with mode=’reduce-overhead’ uses CUDA graphs to replay kernels without Python overhead. Biggest win on small models with repeated forward passes.
Hybrid search (BM25 + dense retrieval) consistently outperforms either alone in production. The BM25 leg catches exact keyword matches that embeddings miss.
Gradient checkpointing trades ~30% more compute for a ~60-70% reduction in activation memory. Usually worth it when batch size is the bottleneck.
No entries match.
