PR #1620

open

Submit Lim Shiaw Yong: 1.66 BPB 12MB Squeeze Architecture

by shiawyonglimView on GitHub

val_bpb

1.6644

Architecture

Transformer

Optimizer

—

Artifact Size

11.74 MB

Training Techniques

Architecture

depth recurrence

Physically instantiates 6 unique Transformer blocks and routes data through them in a palindrome loop to simulate 12 logical layers.

parameters: {"unique_blocks":6,"logical_layers":12}

parallel residuals

Computes attention and MLP branches simultaneously and injects them into the residual stream together to help gradient flow.

parameters: null

Test-Time Training

full TTT

parameters: {"micro_batching":true}

Quantization

QAT

bits: 6

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 65536

eval_length: null

LR Schedule

warmup

parameters: {"warmup_steps":20}