val_bpb
1.1400
Architecture
Transformer
Optimizer
—
Artifact Size
15,598,112 B
Training Techniques
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Architecture
MLP3x
MLP widened to 3.25x
parameters: {"multiplier":3.25}
LeakyReLU
Uses squared LeakyReLU activation
parameters: {"power":2}
weight tying
Tied input and output embeddings
parameters: null
Quantization
late QAT
bits: 8
scope: attn/KV
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- 3-seed locked submission with reported mean score
- Single recipe combining sp4096 training, widened MLP, squared LeakyReLU, late int8 QAT for attention/KV, and tied embeddings
- Submitted artifact is the seed 1339 run with byte audit under the 16 MB cap
- Uses sliding window evaluation with stride 64