PR #1919

open

Add SP8192 + ParResid + DR + LoRA TTT + Mixed int4/int6/int8 + AWQ su…

by dev-pratap-singhView on GitHub

val_bpb

1.0587

Architecture

Transformer

Optimizer

Muon

Artifact Size

≤16 MB

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

ReLU²

Uses relu squared activation in the MLP.

parameters: null

parallel residuals

Attention and MLP both read the same residual input and their outputs are added together in a fused residual update.

parameters: {"blocks":"every block"}

depth recurrence

Recurrent execution over a middle band of layers.

parameters: {"layers":[3,7],"repetitions":3}

U-Net skip connections

Skip connections across pre- and post-recurrent zones.

parameters: {"num_skip_weights":3}

Quantization

mixed int4/int6/int8

bits: null

scope: embeddings and block weights

AWQ

bits: null

scope: int4-bound linear layers

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":16,"alpha":16,"steps_per_chunk":4,"learning_rate":0.001}

score-first TTT

parameters: {"chunk_tokens":16384}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"newton_schulz_steps":5,"warmup_momentum_start":0.85}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":["tok_emb","scalars","skip_weights"]}

LR Schedule

linear warmup

parameters: {"warmup_chunks":100}

warmdown

parameters: {"warmdown_iters":1800}

Regularization

logit softcap

parameters: {"value":15}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"causal":true}

Initialization

resid mix

Per-block resid_mix re-injects the original embedding into each recurrent block.

Novel Contributions

SP8192 tokenizer with int8 embeddings
Parallel residuals in every block
Depth recurrence over the middle layer band
LoRA-only score-first test-time training
Mixed int4/int6/int8 quantization with AWQ
zstd-compressed artifact export