PR #2109

open

Track 2: PR #1855 + MP3 marker-pair fusion + alias smear boundary (val_bpb 1.05917838, 3-seed avg)

val_bpb
1.0592
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,907,150 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
mixed int7/int8
bits: null
scope: embeddings and attention gate
Architecture
SmearGate
Disables previous-position smear immediately after alias tokens via an alias boundary mask.
parameters: {"alias_prev_smear_scale":0}
Gated Attention
SparseAttnGate / attention gating used in the inherited stack.
parameters: {"scale":0.5,"window":12}
XSA
Inherited XSA-based architecture from PR #1855.
parameters: {"layers":11}
weight tying
Not explicitly stated as used; null not included.
Test-Time Training
full TTT
parameters: {"phased":true,"chunk_size":48,"lora_rank":80}
Regularization
logit softcap
parameters: {"enabled":true}
Compression
lrzip+brotli
level: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"warmup_steps":20}
Optimizer
Muon
weight_decay: 0.5
momentum: 0.9
other_params: {"beta2":0.99}
Initialization
resid mix
Alias rows are warm-initialized as a weighted composite of constituent marker embeddings.
Other
other
MP3 marker-pair fusion: fuses [▁, TITLE], [▁, ALLCAPS], and [▁, CAPNEXT] into alias donor tokens to reduce token count.
parameters: {"token_saving_percent":8.47}

Novel Contributions

  • MP3 marker-pair fusion that merges three frequent marker bigrams into alias donor tokens
  • Alias smear boundary that turns off SmearGate's previous-position contribution immediately after alias tokens
  • Warm-initialization of alias rows from constituent marker embeddings
  • Token-stream reduction of 8.47% while preserving the downstream word boundary signal
  • Extension of the PR #1855 stack with phased TTT and alias-aware evaluation/training handling