PR #636
openAdd non-record 10min submission: 11L XSA4 + EMA + GPTQ + FA3 (1.12336724)
by NewyorkDevView on GitHub
val_bpb
1.1234
Architecture
Transformer
Optimizer
Muon + Adam-style groups
Artifact Size
15,853,809 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
Cross-layer self-attention on the last 4 layers
parameters: {"layers":4}
SmearGate
Token mixing technique combined with BigramHash and tied embeddings
parameters: null
BigramHash
Token mixing with BigramHash embedding
parameters: {"vocab_size":2048,"dim":128}
tied embeddings
Weight tying of embeddings
parameters: null
VE
Late-layer vector embedding enabled on layers 9 and 10
parameters: {"layers":[9,10],"dim":128}
MLP
3x expansion MLP
parameters: {"expansion":3}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Adam-style groups
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: null
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"exact":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
Late QAT trigger before full GPTQ int6 export
parameters: {"late_qat_threshold":0.15}
other
FlashAttention 3 kernel on Hopper hardware with PyTorch SDPA fallback
parameters: null
Novel Contributions
- Combination of 11-layer 512d GQA model with 2048-token training and tied embeddings
- Use of BigramHash + SmearGate token mixing
- Cross-layer self-attention (XSA) on the last 4 layers
- Late-layer vector embedding (VE) enabled on layers 9 and 10
- EMA applied before export
- Late QAT trigger followed by full GPTQ int6 quantization
- Use of FlashAttention 3 kernel on Hopper hardware with fallback to PyTorch SDPA
- Submission as a fully preserved single-run official log without multi-seed statistical significance claim