PR #1602

open

[Non-record] Experimentation Summary: Autopsy of 100+ Experiments — What Worked, What Didn’t, Mind Map for LLM Agents, etc.

by SPTholeView on GitHub

val_bpb

1.0744

Architecture

Transformer

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence in the encoder to create more virtual layers from fewer physical layers.

parameters: {"layers":3,"extra_passes":2}

BigramHash

Bigram-based hash embeddings / co-occurrence-driven attention initialization and related bigram features.

parameters: null

TrigramHash

Trigram hash table / trigram embedding variants used for token context features.

parameters: null

XSA

Cross-sequence attention added to improve modeling capacity and quantization behavior.

parameters: null

Partial RoPE

Uses RoPE on only part of the head dimension to free capacity for semantic matching.

parameters: {"rope_dims":16,"head_dims":64}

KV head count

Adjusted number of KV heads / multi-head attention width.

parameters: {"kv_heads":8}

Value Residual

Added value residual / cascading value highway to improve deep-layer information flow.

parameters: null

weight tying

Shared embeddings / tied weights used in some variants and ablations.

parameters: null

Quantization

mixed int5/int6

bits: null

scope: MLP proj and attention

GPTQ

bits: null

scope: embeddings

QAT

bits: null

scope: all

STE QAT

bits: null

scope: all

int6

bits: 6

scope: MLP

Compression

zstd

level: null

lzma

level: null

Weight Averaging

SWA

parameters: {"every":100}

EMA

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"parallel":true}

LR Schedule

warmdown

parameters: {"warmdown_steps":400,"start_step":900}

Regularization

label smoothing

parameters: null

magnitude pruning

parameters: {"sparsity":0.05}

layerwise LN scale

parameters: null

Initialization

OrthoInit

Orthogonal constraint / initialization behavior referenced in Muon-related experiments.

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Novel Contributions

Large-scale autopsy of 100+ experiments with a structured mind map and experiment lineage analysis
Demonstration that depth recurrence, partial RoPE, and parallel residual-style routing materially improve parameter efficiency
Identification that warmdown scheduling was a major hidden issue and that fixing it produced a large bpb gain
Evidence that auxiliary losses generally hurt in the compute-starved regime
Discovery that simpler architectures and fewer competing objectives often outperform more complex variants
Integration and ablation of meta-TTT, showing its gains are architecture-limited
Community-derived improvements such as SP8192 tokenizer, improved parallel residuals, and QK gain initialization
Quantization-focused improvements including GPTQ, mixed int5/int6, per-layer clipping, and artifact compression strategies