val_bpb
0.9417
Architecture
Transformer
Optimizer
—
Artifact Size
15,868,157 bytes
Training Techniques
Architecture
XSA
XSA active on all layers in the Scylla attention/eval path.
parameters: {"layers":11}
weight tying
Tied embeddings are used.
parameters: null
depth recurrence
Reuses layers 3-5 as virtual layers after part of training without increasing serialized parameter count.
parameters: {"layers":[3,4,5],"enable_after_training_frac":0.35}
BigramHash
Reduced bigram dimension to create artifact headroom while retaining quality gains.
parameters: {"dimensions":40}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
QK gain set to 5.25.
parameters: {"qk_gain_init":5.25}
other
Training uses 8xH100 SXM and stops by wallclock around 10 minutes.
parameters: {"gpus":8,"hardware":"H100 SXM","time_limit_seconds":600}
Novel Contributions
- Scylla tokenizer and data path with correct HF tokenizer metadata
- QK-Gain 5.25 configuration
- 3-layer depth recurrence over layers 3-5
- Reduced bigram dimension to 40 to fit within the artifact cap
- Full GPTQ int6 plus LZMA compressed submission artifact
- No TTT or eval-time adaptation