PR #636

open

Add non-record 10min submission: 11L XSA4 + EMA + GPTQ + FA3 (1.12336724)

by NewyorkDevView on GitHub

val_bpb

1.1234

Architecture

Transformer

Optimizer

Muon + Adam-style groups

Artifact Size

15,853,809 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA

Cross-layer self-attention on the last 4 layers

parameters: {"layers":4}

SmearGate

Token mixing technique combined with BigramHash and tied embeddings

parameters: null

BigramHash

Token mixing with BigramHash embedding

parameters: {"vocab_size":2048,"dim":128}

tied embeddings

Weight tying of embeddings

parameters: null

Late-layer vector embedding enabled on layers 9 and 10

parameters: {"layers":[9,10],"dim":128}

MLP

3x expansion MLP

parameters: {"expansion":3}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Adam-style groups

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: null

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"exact":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Other

other

Late QAT trigger before full GPTQ int6 export

parameters: {"late_qat_threshold":0.15}

other

FlashAttention 3 kernel on Hopper hardware with PyTorch SDPA fallback

parameters: null

Novel Contributions

Combination of 11-layer 512d GQA model with 2048-token training and tied embeddings
Use of BigramHash + SmearGate token mixing
Cross-layer self-attention (XSA) on the last 4 layers
Late-layer vector embedding (VE) enabled on layers 9 and 10
EMA applied before export
Late QAT trigger followed by full GPTQ int6 quantization
Use of FlashAttention 3 kernel on Hopper hardware with fallback to PyTorch SDPA
Submission as a fully preserved single-run official log without multi-seed statistical significance claim