PR #1015

open

Add Parameter Golf submission: Vocab768_LinearPhaseInit_GatedXSA_EMA_…

by shram86View on GitHub
val_bpb
1.2115
Architecture
Transformer
Optimizer
Artifact Size
15,082,805 bytes

Training Techniques

Architecture
XSA
Gated XSA applied to the last layers of the Transformer.
parameters: {"layers":2,"mode":"gated"}
Value Residual
Value bias/residual applied to the last layers.
parameters: {"layers":2,"dimension":128}
Quantization
late QAT
bits: null
scope: matrix_only
Weight Averaging
EMA
parameters: {"start_step":9094}
Initialization
phase-mix init
Linear phase-mix initialization.
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
FlashAttention 3 backend used for attention computation.
parameters: {"backend":"flash_attn_3"}

Novel Contributions

  • Custom sp768 tokenizer export with vocab size 768
  • Linear phase-mix initialization
  • Gated XSA on the last 2 layers
  • EMA during late training
  • Late matrix-only QAT
  • FlashAttention 3 backend
  • Tokenizer and dataset export published to Hugging Face and loaded via patched manifest-driven loader