PR #592

open

Submission: 12L Int5-MLP BigramHash10K EMA (1.1476 BPB)

by SkytuhuaView on GitHub

val_bpb

1.1476

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,497,769 bytes

Training Techniques

Quantization

mixed Int5/Int6 QAT

bits: null

scope: MLP weights Int5, Attention weights Int6

Architecture

BigramHash

Expanded from 2048 to 10240 buckets, XOR hash of consecutive token pairs into learned 128-dim embeddings to reduce collisions and improve bigram-level signal

parameters: {"buckets":10240,"embedding_dim":128}

XSA

Cross-layer self-attention applied on last 4 layers (layers 8-11)

parameters: {"layers":4,"layer_indices":[8,9,10,11]}

SmearGate

Applied SmearGate gating mechanism

parameters: null

Partial RoPE

Rotary positional embeddings applied partially on 16 dimensions

parameters: {"dimensions":16}

OrthoInit

Orthogonal initialization of weights

parameters: null

Value Embed

Value embeddings added on layers 10 and 11 with dimension 128

parameters: {"layers":[10,11],"dim":128}

MLP expansion

MLP expansion factor 3x (hidden=1536)

parameters: {"expansion_factor":3,"hidden_dim":1536}

Embedding

Tied FP16 embeddings

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"start":"last 20% of warmdown","frequency_steps":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmup and warmdown

parameters: {"warmup_steps":1500,"late_QAT_start_scale":0.15}

Novel Contributions

Mixed Int5/Int6 quantization with MLP weights at Int5 and Attention weights at Int6 to save artifact size
Addition of a 12th transformer layer funded by Int5 MLP compression
Expansion of BigramHash embedding buckets from 2048 to 10240 to reduce hash collisions and improve bigram-level signal
Use of SmearGate gating mechanism and OrthoInit initialization
Application of cross-layer self-attention (XSA) on last 4 layers
Partial RoPE applied on 16 dimensions
Late Quantization-Aware Training (QAT) combined with GPTQ-lite clip search
Use of EMA and SWA weight averaging techniques