val_bpb
1.1290
Architecture
Transformer
Optimizer
Muon
Artifact Size
15751324 bytes
Training Techniques
Architecture
U-Net skip connections
Learnable skip connections in the model backbone.
parameters: null
ReLU²
Uses relu^2 MLP activation.
parameters: null
GQA
Grouped Query Attention with fewer KV heads than query heads.
parameters: {"heads":8,"kv_heads":4}
SmearGate
Sigmoid gate interpolates current and previous token representations.
parameters: null
BigramHash
XOR-hash bigram embedding with 2048 buckets.
parameters: {"buckets":2048,"embed_dim":128}
XSA
Applied XSA on the last 4 layers with SDPA layout adaptation.
parameters: {"layers":4}
Partial RoPE
Rotates only the first 16 dimensions while passing the rest through.
parameters: {"rotated_dims":16,"total_dims":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"decoupled_update":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"tok/scalar/head optimizers"}
Quantization
mixed int6/int8
bits: 6
scope: MLP+attn int6; embeddings+other int8
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with 1/sqrt(2*num_layers) scaling for projection layers.
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx + 1)"}
Novel Contributions
- Pre-TTT anchor submission adapted from the repo-root train_gpt.py skeleton
- SDPA-based adaptation of donor features originally designed for flash_attn_3
- Selective transplant of donor techniques including SmearGate, BigramHash, XSA, and Partial RoPE
- Mixed int6/int8 export with zstd compression to fit the 16MB artifact limit
- Stride-64 sliding evaluation for validation
- EMA-based weight averaging and Muon optimization with decoupled weight decay