val_bpb
1.0354
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,995,750 bytes
Training Techniques
Test-Time Training
full TTT
parameters: {"rank":8,"epochs":21,"federated_averaging":true,"lr_schedule":"epoch-level cosine"}
Architecture
XSA
XSA applied on all layers
parameters: null
depth recurrence
3-layer depth recurrence over layers 3-5
parameters: {"layers":[3,4,5]}
U-Net skip connections
Parallel residual path from layer 7 onward
parameters: {"start_layer":7}
LeakyReLU
LeakyReLU(0.5)^2 MLP activation
parameters: {"slope":0.5}
KV head count
8-head attention with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
RoPE
Partial RoPE lineage referenced in the stack
parameters: null
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"parallel_ranks":8,"epochs":21}
Quantization
GPTQ
bits: 6
scope: model matrices
int8
bits: 8
scope: embeddings
Compression
Brotli
level: null
LZMA
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null
Regularization
weight decay
parameters: {"high_wd":true}
LR Schedule
cosine decay
parameters: {"epoch_level":true}
Novel Contributions
- Combines CaseOps reversible capitalization tokenization with pre-quant TTT
- Adds byte-sidecar validation accounting for transformed CaseOps tokens
- Threads original-byte sidecars through validation and sliding evaluation
- Uses pre-quant AdamW TTT before GPTQ export to improve the fixed artifact
- Achieves a new record mean val_bpb of 1.03540487 on track_10min_16mb