PR #505

open

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean)

by JoeProAIView on GitHub
val_bpb
1.1181
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
int6 + GPTQ-lite + QAT
bits: 6
scope: null
Architecture
SwiGLU FFN
Feed-forward network with SwiGLU activation and Star-ReLU
parameters: {"hidden":1792}
U-Net Skip Gates
5 encoder and 6 decoder layers with learned gating
parameters: {"encoder_layers":5,"decoder_layers":6}
XSA4
Extended Self-Attention in last 4 layers
parameters: {"layers":4}
Value Embeddings (VE128)
128-dimensional shared embedding with per-layer scales on layers 9-10
parameters: {"dimensions":128,"layers":[9,10]}
BigramHash
8192 buckets with 128-dimensional embeddings
parameters: {"buckets":8192,"dimensions":128}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":16}
LN Scale
Layer-dependent normalization scaling
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
zstd
level: 22

Novel Contributions

  • Demonstrated SwiGLU FFN viability without test-time training when paired with proper training configuration
  • Introduced U-Net Skip Gates with learned gating in transformer architecture
  • Applied Extended Self-Attention (XSA4) in last 4 layers
  • Incorporated 128-dimensional Value Embeddings with per-layer scaling on layers 9-10
  • Used BigramHash embeddings with 8192 buckets and 128 dimensions
  • Utilized Partial RoPE with 16 dimensions
  • Enabled Late Quantization-Aware Training (QAT) at learning rate scale < 0.15
  • Achieved improved val_bpb by increasing sequence length from 1024 to 2048
  • Combined int6 quantization with GPTQ-lite compression and zstd-22 for artifact size reduction
  • No test-time training (No TTT) used