PR #103

open

Non-record: Looped Transformer + LoRA + Skip Connections + NorMuon + SWA + Int6 + Sliding Window

by MatthewHRockwellView on GitHub
val_bpb
1.5000
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14.9 MB

Training Techniques

Architecture
depth recurrence
5 unique transformer blocks are looped to create 30 virtual layers, increasing effective depth without storing all layers.
parameters: {"unique_layers":5,"virtual_depth":30}
skip connections
Encoder-decoder style skip connections store tensors in the first half of virtual layers and consume them in reverse in the decoder half via learned skip weights.
parameters: {"encoder_layers":15,"decoder_layers":15}
LoRA
Per-virtual-layer LoRA adapters on Q and V projections differentiate each virtual layer with low parameter overhead.
parameters: {"rank":4}
residual mixing
Learned blend of hidden state with original embedding at each layer.
parameters: null
tied embeddings
Input/output embeddings are tied.
parameters: null
Optimizer
NorMuon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"warmup_start":0.92,"warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoints":7}
Quantization
int6
bits: 6
scope: block weights with fp16 embedding and fp16 LoRA passthrough
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"wallclock_aware":true}
Regularization
gradient clipping
parameters: {"norm":1}

Novel Contributions

  • Looped transformer depth recurrence with 5 stored blocks expanded to 30 virtual layers
  • Encoder-decoder skip connections across virtual layers with learned skip weights
  • Per-virtual-layer LoRA adapters to specialize each repeated layer
  • Residual mixing with the original embedding at each layer
  • NorMuon optimization with wallclock-aware warmdown
  • Stochastic Weight Averaging over 7 checkpoints
  • Int6 quantization with fp16 embedding and LoRA passthrough
  • Sliding-window evaluation with stride 64