PR #103

open

Non-record: Looped Transformer + LoRA + Skip Connections + NorMuon + SWA + Int6 + Sliding Window

by MatthewHRockwellView on GitHub

val_bpb

1.5000

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

14.9 MB

Training Techniques

Architecture

depth recurrence

5 unique transformer blocks are looped to create 30 virtual layers, increasing effective depth without storing all layers.

parameters: {"unique_layers":5,"virtual_depth":30}

skip connections

Encoder-decoder style skip connections store tensors in the first half of virtual layers and consume them in reverse in the decoder half via learned skip weights.

parameters: {"encoder_layers":15,"decoder_layers":15}

LoRA

Per-virtual-layer LoRA adapters on Q and V projections differentiate each virtual layer with low parameter overhead.

parameters: {"rank":4}

residual mixing

Learned blend of hidden state with original embedding at each layer.

parameters: null

tied embeddings

Input/output embeddings are tied.

parameters: null

Optimizer

NorMuon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"warmup_start":0.92,"warmup_steps":1500}

Weight Averaging

SWA

parameters: {"checkpoints":7}

Quantization

int6

bits: 6

scope: block weights with fp16 embedding and fp16 LoRA passthrough

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":4096}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"wallclock_aware":true}

Regularization

gradient clipping

parameters: {"norm":1}

Novel Contributions

Looped transformer depth recurrence with 5 stored blocks expanded to 30 virtual layers
Encoder-decoder skip connections across virtual layers with learned skip weights
Per-virtual-layer LoRA adapters to specialize each repeated layer
Residual mixing with the original embedding at each layer
NorMuon optimization with wallclock-aware warmdown
Stochastic Weight Averaging over 7 checkpoints
Int6 quantization with fp16 embedding and LoRA passthrough
Sliding-window evaluation with stride 64