PR #215

open

Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal

by JayCheng113View on GitHub

val_bpb

1.1548

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.7MB

Training Techniques

Architecture

low-rank Q

Factorized the Q projection into down/up matrices with rank 192 to reduce parameters and improve compressibility.

parameters: {"rank":192}

depth

Used 11 transformer layers with encoder-decoder skip connections (5 encoder + 6 decoder).

parameters: {"layers":11,"encoder_layers":5,"decoder_layers":6}

tied embeddings

Used tied input/output embeddings.

parameters: null

KV head count

Used grouped-query attention with 4 KV heads.

parameters: {"kv_heads":4}

MLP3x

Used a 3x MLP width with relu-squared activation.

parameters: {"mlp_mult":3}

RoPE

Applied rotary positional embeddings.

parameters: {"base":10000}

Quantization

int6

bits: 6

scope: MLP and attention weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: 0.038

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"warmdown_iters":3000}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.038}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Initialization

resid mix

Explored Legendre-based initialization for resid_mix parameters, though it did not improve results.

Other

other

Clean compile cache was used for each run to ensure reproducibility and consistent compilation behavior.

parameters: null

Novel Contributions

Low-rank Q factorization with rank 192 to exploit the apparent low-rank structure of Q projections.
11-layer encoder-decoder skip-connected Transformer within the 16MB budget.
Int6 per-row quantization combined with zstd-22 compression for model weights.
Sliding-window evaluation with stride 64 for final scoring.
Analysis-driven exploration of alternative ideas such as Legendre resid_mix initialization, content-dependent pre-rotation, and depth-attention residuals.