PR #1602
open[Non-record] Experimentation Summary: Autopsy of 100+ Experiments — What Worked, What Didn’t, Mind Map for LLM Agents, etc.
by SPTholeView on GitHub
val_bpb
1.0744
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Architecture
depth recurrence
3-layer recurrence in the encoder to create more virtual layers from fewer physical layers.
parameters: {"layers":3,"extra_passes":2}
BigramHash
Bigram-based hash embeddings / co-occurrence-driven attention initialization and related bigram features.
parameters: null
TrigramHash
Trigram hash table / trigram embedding variants used for token context features.
parameters: null
XSA
Cross-sequence attention added to improve modeling capacity and quantization behavior.
parameters: null
Partial RoPE
Uses RoPE on only part of the head dimension to free capacity for semantic matching.
parameters: {"rope_dims":16,"head_dims":64}
KV head count
Adjusted number of KV heads / multi-head attention width.
parameters: {"kv_heads":8}
Value Residual
Added value residual / cascading value highway to improve deep-layer information flow.
parameters: null
weight tying
Shared embeddings / tied weights used in some variants and ablations.
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP proj and attention
GPTQ
bits: null
scope: embeddings
QAT
bits: null
scope: all
STE QAT
bits: null
scope: all
int6
bits: 6
scope: MLP
Compression
zstd
level: null
lzma
level: null
Weight Averaging
SWA
parameters: {"every":100}
EMA
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"parallel":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":400,"start_step":900}
Regularization
label smoothing
parameters: null
magnitude pruning
parameters: {"sparsity":0.05}
layerwise LN scale
parameters: null
Initialization
OrthoInit
Orthogonal constraint / initialization behavior referenced in Muon-related experiments.
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Novel Contributions
- Large-scale autopsy of 100+ experiments with a structured mind map and experiment lineage analysis
- Demonstration that depth recurrence, partial RoPE, and parallel residual-style routing materially improve parameter efficiency
- Identification that warmdown scheduling was a major hidden issue and that fixing it produced a large bpb gain
- Evidence that auxiliary losses generally hurt in the compute-starved regime
- Discovery that simpler architectures and fewer competing objectives often outperform more complex variants
- Integration and ablation of meta-TTT, showing its gains are architecture-limited
- Community-derived improvements such as SP8192 tokenizer, improved parallel residuals, and QK gain initialization
- Quantization-focused improvements including GPTQ, mixed int5/int6, per-layer clipping, and artifact compression strategies