val_bpb
1.3220
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB
Training Techniques
Architecture
depth recurrence
Layers 3-5 are looped multiple times, reusing the same weights across passes.
parameters: {"layers":[3,4,5],"loops":2}
XSA
Cross-segment attention applied across layers.
parameters: null
Partial RoPE
Rotary position embeddings applied to only part of the head dimension with NTK scaling.
parameters: {"dimensions":16}
U-Net skip connections
Learned skip connections between encoder and decoder halves.
parameters: null
LeakyReLU
Uses LeakyReLU(0.5)^2 as the MLP activation.
parameters: null
GQA
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"row_normalization":true,"momentum_warmup":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: adapter weights
int8
bits: 8
scope: embeddings
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Other
other
Frozen random backbone reconstructed deterministically from seed=42; only LoRA adapters, embeddings, and control tensors are serialized.
parameters: {"seed":42,"lora_rank":304}
Novel Contributions
- Frozen random backbone reconstructed from a deterministic seed so backbone weights take 0 bytes in the artifact
- Rank-304 LoRA adapters on every linear layer with only adapter weights serialized
- Depth recurrence combined with adapters, reusing adapter weights across repeated layers
- Disabling EMA for random adapters to avoid regression from averaging adapter_B toward zero initialization
- GPTQ int6 plus brotli serialization pipeline that fits within the 16MB limit