val_bpb
1.3081
Architecture
Transformer
Optimizer
—
Artifact Size
12889311 bytes
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
LeakyReLU
MLP activation changed from ReLU^2 to LeakyReLU(0.5)^2
parameters: {"slope":0.5}
Partial RoPE
Uses Partial RoPE in the attention stack
parameters: {"ratio":"16/64"}
XSA
XSA enabled on the last 4 layers
parameters: {"layers":4}
weight tying
Tied embeddings / tied embedding weights
parameters: null
MLP3x
Uses a 3x-MLP stack
parameters: null
VE128
Value residual enabled on layers 9 and 10
parameters: {"layers":[9,10],"dimensions":128}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: null
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- Non-record 16MB submission folder for the v1 1xH100 screening run
- LeakyReLU(0.5)^2 MLP activation change ported from a March 23 record
- Explicit conservative roundtrip metric recorded in submission.json instead of the legacy trailing label
- Included exact train log, submission metadata, and code snapshot used for the run
- Documented the additional sliding-window evaluation line printed by the script