val_bpb
1.1266
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.99 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings using a single shared embedding matrix.
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Expanded MLP width beyond the baseline.
parameters: {"multiplier":3.5,"hidden_dim":1792}
U-Net skip connections
Added encoder-decoder style skip connections between matching layers.
parameters: null
LeakyReLU
Used LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
XSA
Applied cross-sequence attention in the last layers.
parameters: {"layers":4}
Quantization
GPTQ
bits: 5
scope: attention and MLP weights
QAT
bits: 5
scope: MLP layers
int8
bits: 8
scope: tied embeddings
Compression
brotli
level: 11
Other
other
Applied byte-shuffle pre-filter before brotli to improve compression of quantized weights.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.003,"epochs_per_chunk":20,"chunks":348}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Adam
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
LR Schedule
warmdown
parameters: {"fraction":0.35}
cosine decay
parameters: {"used_for":"TTT","lr":0.003}
Regularization
logit softcap
parameters: {"cap":30}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Custom sp4096 SentencePiece tokenizer hosted on HuggingFace
- Mixed int5/int8 quantization scheme with int8 tied embeddings
- Byte-shuffle plus brotli compression to fit under the 16MB cap
- GPTQ calibration using self-generated autoregressive sequences
- Score-first test-time training with full-block SGD adaptation