val_bpb
1.1454
Architecture
Transformer
Optimizer
Muon + Adam
Artifact Size
15.88MB
Training Techniques
Architecture
depth recurrence
Replaced unique transformer blocks with shared blocks repeated across depth to create effective deeper computation with fewer unique parameters.
parameters: {"blocks":3,"repeats":4,"effective_layers":12}
cross-repeat skip
Adds a weighted residual from the previous repeat to make the recurrent depth stateful.
parameters: null
value embeddings
Adds two extra embedding tables mixed into the residual stream at each effective layer with learned scales.
parameters: {"tables":2}
loop embedding
Learns a per-layer vector added before each block as depth-wise positional encoding.
parameters: null
KV head count
Uses 4 KV heads with 8 attention heads.
parameters: {"heads":8,"kv_heads":4}
Evaluation
sliding window eval
parameters: {"stride":256,"window":1024}
Other
other
Hedge Mixer online ensemble at eval time combining neural, unigram, bigram, trigram, and entropy experts via Hedge algorithm using only already-scored tokens.
parameters: {"experts":5}
LR Schedule
warmdown
parameters: {"warmdown_iters":2000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Compression
zlib
level: null
Novel Contributions
- Progressive depth / depth recurrence with shared transformer blocks
- Cross-Repeat Skip for stateful recurrent depth
- Value embeddings mixed into the residual stream
- Loop embedding as depth-wise positional encoding
- Hedge Mixer online ensemble at evaluation time
- Sliding-window evaluation with stride 256
- Learning-rate and warmdown tuning