val_bpb
1.2350
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,330,708 bytes
Training Techniques
Quantization
STE QAT
bits: null
scope: weights
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"heads":16,"kv_heads":4}
RoPE
Uses YaRN/RoPE context extension.
parameters: {"context_length":4096}
depth recurrence
Training and evaluation use depth recurrence.
parameters: {"training_depth_recurrence":1,"evaluation_depth_recurrence":1}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_parameters":true,"scalar_embedding_parameters":"Adam"}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
warmup
parameters: {"warmup_steps":1}
Other
other
Runtime-row ternary scaling aligned to Model Stack's packed BitNet runtime export format.
parameters: {"scale_layout":"runtime_row","group_size":64}
Novel Contributions
- Non-record 65M-parameter BitNet-style ternary transformer submission
- Runtime-row ternary scaling matched to Model Stack packed BitNet inference layout
- Exact packed BitNet runtime export with zero skipped tensors and zero packed-weight reconstruction error
- Near-zero roundtrip gap between pre-roundtrip and final val_bpb
- Training a larger ternary model within the 16MB Parameter Golf artifact budget