PR #1785
openRecord: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed)
by OE-GODView on GitHub
val_bpb
1.0192
Architecture
Transformer
Optimizer
—
Artifact Size
15.96-15.98 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: model weights
Compression
lzma
level: null
Weight Averaging
EMA
parameters: null
Architecture
depth recurrence
Depth-recurrent Transformer stack inherited from the base submission.
parameters: null
LeakyReLU
Leaky ReLU activation used in the MLP.
parameters: null
Evaluation
sliding window eval
parameters: {"context_length":4096}
Test-Time Training
score-first TTT
parameters: {"method":"byte-level PPM mixture on already-scored val tokens"}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Other
other
Adaptive-λ byte-level PPM-D mixture with NN per-token logprobs in byte-probability space, gated by PPM confidence.
parameters: {"order":5,"lambda_high":0.9,"lambda_low":0.05,"threshold":0.9}
Novel Contributions
- Adds a byte-level order-5 PPM-D predictor at evaluation time.
- Mixes NN and PPM probabilities in byte-probability space.
- Uses an adaptive-λ gate based on PPM confidence to route rare-repeat bytes to PPM.
- Collects per-token logprobs across ranks and applies the mixture on the first 5M tokens.
- Compresses the training script with an lzma+base85 exec-stub to fit under the 16 MB artifact cap.