PR #1919
openAdd SP8192 + ParResid + DR + LoRA TTT + Mixed int4/int6/int8 + AWQ su…
by dev-pratap-singhView on GitHub
val_bpb
1.0587
Architecture
Transformer
Optimizer
Muon
Artifact Size
≤16 MB
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
ReLU²
Uses relu squared activation in the MLP.
parameters: null
parallel residuals
Attention and MLP both read the same residual input and their outputs are added together in a fused residual update.
parameters: {"blocks":"every block"}
depth recurrence
Recurrent execution over a middle band of layers.
parameters: {"layers":[3,7],"repetitions":3}
U-Net skip connections
Skip connections across pre- and post-recurrent zones.
parameters: {"num_skip_weights":3}
Quantization
mixed int4/int6/int8
bits: null
scope: embeddings and block weights
AWQ
bits: null
scope: int4-bound linear layers
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":16,"alpha":16,"steps_per_chunk":4,"learning_rate":0.001}
score-first TTT
parameters: {"chunk_tokens":16384}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"newton_schulz_steps":5,"warmup_momentum_start":0.85}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":["tok_emb","scalars","skip_weights"]}
LR Schedule
linear warmup
parameters: {"warmup_chunks":100}
warmdown
parameters: {"warmdown_iters":1800}
Regularization
logit softcap
parameters: {"value":15}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"causal":true}
Initialization
resid mix
Per-block resid_mix re-injects the original embedding into each recurrent block.
Novel Contributions
- SP8192 tokenizer with int8 embeddings
- Parallel residuals in every block
- Depth recurrence over the middle layer band
- LoRA-only score-first test-time training
- Mixed int4/int6/int8 quantization with AWQ
- zstd-compressed artifact export