val_bpb
1.1609
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,977,633 bytes
Training Techniques
Quantization
int6
bits: 6
scope: all weights per-row
Architecture
MLP3x
3x MLP with 1536 hidden size
parameters: {"hidden_size":1536}
GQA
Grouped-query attention with 8/4 heads
parameters: {"query_heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied
parameters: null
SmearGate
Custom gating mechanism used in the model
parameters: null
BigramHash
Bigram hash feature module
parameters: {"size":"2048x128"}
RoPE
Rotary positional embeddings with NTK scaling
parameters: {"sequence_length":2048}
Partial RoPE
Applies RoPE to only part of the dimensions
parameters: {"dimensions":"16/64"}
XSA
XSA applied on the last 4 layers
parameters: {"layers":4}
Initialization
OrthoInit
Orthogonal initialization combined with muP
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN Scale
parameters: null
weight decay
parameters: {"value":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
online logit bias
parameters: {"learning_rate":0.1,"enabled":false}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
zstd
level: null
Novel Contributions
- Online logit bias (OLB) evaluation technique that updates a per-token bias vector during sliding-window evaluation using the exact cross-entropy gradient
- Int6 per-row quantized model with zstd compression
- Sliding-window evaluation with stride 64
- Custom 11-layer architecture with SmearGate, BigramHash, XSA, partial RoPE, and tied embeddings