PR #1015
openAdd Parameter Golf submission: Vocab768_LinearPhaseInit_GatedXSA_EMA_…
by shram86View on GitHub
val_bpb
1.2115
Architecture
Transformer
Optimizer
—
Artifact Size
15,082,805 bytes
Training Techniques
Architecture
XSA
Gated XSA applied to the last layers of the Transformer.
parameters: {"layers":2,"mode":"gated"}
Value Residual
Value bias/residual applied to the last layers.
parameters: {"layers":2,"dimension":128}
Quantization
late QAT
bits: null
scope: matrix_only
Weight Averaging
EMA
parameters: {"start_step":9094}
Initialization
phase-mix init
Linear phase-mix initialization.
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
FlashAttention 3 backend used for attention computation.
parameters: {"backend":"flash_attn_3"}
Novel Contributions
- Custom sp768 tokenizer export with vocab size 768
- Linear phase-mix initialization
- Gated XSA on the last 2 layers
- EMA during late training
- Late matrix-only QAT
- FlashAttention 3 backend
- Tokenizer and dataset export published to Hugging Face and loaded via patched manifest-driven loader