PR #1055
openSOTA Record: Novel Test-Time Method TARA Val BPB=0.97 under 4min (training-free unlike TTT)
by sanyalsunny111View on GitHub
val_bpb
0.9693
Architecture
Transformer
Optimizer
Muon
Artifact Size
~12 MB
Training Techniques
Architecture
GQA
Grouped query attention in the GPT-like architecture.
parameters: {"layers":9,"dimensions":512,"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings used in the model.
parameters: {"base":50000}
weight tying
Tied input and output embeddings.
parameters: null
ReLU²
Uses relu squared MLP activation.
parameters: null
Quantization
int6
bits: 6
scope: per-row weights
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"embeddings_scalars_optimizer":"Adam"}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Test-Time Activation Re-Alignment (TARA), a training-free inference-time method that re-aligns final-layer activations against earlier hidden-state activations to improve predictions.
parameters: {"alpha":0.1,"beta":0.2,"candidate_layers":[0,1,2,3]}
Novel Contributions
- Introduces TARA, a training-free test-time activation realignment method.
- Improves validation BPB to 0.9693 without gradient steps or weight updates.
- Uses cosine-distance-based selection of premature activations from earlier layers.
- Applies a contrastive adjustment at inference time to sharpen predictions.
- Combines a compact GPT-like architecture with int6 quantization and zstd compression.