PR #1361
open1.1220 bpb: GPTQ + EMA + XSA-all + BigramHash3072 (11L 512dim)
by jorge-asenjoView on GitHub
val_bpb
1.1220
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.1 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
BigramHash
Token-pair hash embeddings for richer input representation.
parameters: {"buckets":3072,"dims":112}
SmearGate
Learned token-level blending with previous position.
parameters: null
U-Net skip connections
Encoder-decoder style skip connections across layers.
parameters: {"encoder_layers":5,"decoder_layers":6,"skip_weights":5}
Value Embedding
Shared value embedding table injected into attention values at later layers.
parameters: {"layers":[9,10],"table_shape":"1024x128"}
XSA
Exclusive Self-Attention applied to all layers.
parameters: {"layers":11}
Partial RoPE
Rotary embeddings applied to a subset of head dimensions.
parameters: {"rotary_dims":16,"head_dims":64}
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
MLP3x
Three-times expanded MLP with LeakyReLU² activation.
parameters: {"activation":"LeakyReLU²"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: MLP + attention weights
late QAT
bits: 6
scope: final warmdown phase
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32,"seq_len":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"ns_steps":5,"lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iterations":3200}
Regularization
LN scale
parameters: {"rule":"1/sqrt(layer_idx+1)"}
logit softcap
parameters: {"value":30}
Novel Contributions
- XSA applied to all 11 layers
- BigramHash token-pair embeddings with 3072 buckets
- GPTQ-based Hessian quantization with late QAT
- EMA-weighted final model
- Value Embedding injected into later attention layers
- U-Net skip connections combined with partial RoPE and SmearGate