PR #505

open

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)

by JoeProAIView on GitHub

val_bpb

1.1181

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

int6 + GPTQ-lite + QAT

bits: 6

scope: null

Architecture

SwiGLU FFN

Feed-forward network with SwiGLU activation and Star-ReLU

parameters: {"hidden":1792}

U-Net Skip Gates

5 encoder and 6 decoder layers with learned gating

parameters: {"encoder_layers":5,"decoder_layers":6}

XSA4

Extended Self-Attention in last 4 layers

parameters: {"layers":4}

Value Embeddings (VE128)

128-dimensional shared embedding with per-layer scales on layers 9-10

parameters: {"dimensions":128,"layers":[9,10]}

BigramHash

8192 buckets with 128-dimensional embeddings

parameters: {"buckets":8192,"dimensions":128}

Partial RoPE

Rotary positional embeddings applied partially

parameters: {"dimensions":16}

LN Scale

Layer-dependent normalization scaling

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Compression

zstd

level: 22

Novel Contributions

Demonstrated SwiGLU FFN viability without test-time training when paired with proper training configuration
Introduced U-Net Skip Gates with learned gating in transformer architecture
Applied Extended Self-Attention (XSA4) in last 4 layers
Incorporated 128-dimensional Value Embeddings with per-layer scaling on layers 9-10
Used BigramHash embeddings with 8192 buckets and 128 dimensions
Utilized Partial RoPE with 16 dimensions
Enabled Late Quantization-Aware Training (QAT) at learning rate scale < 0.15
Achieved improved val_bpb by increasing sequence length from 1024 to 2048
Combined int6 quantization with GPTQ-lite compression and zstd-22 for artifact size reduction
No test-time training (No TTT) used