PR #669

open

Add TRN hybrid non-record submission (1.4942 bpb, 1x RTX 5090)

by amabitoView on GitHub

val_bpb

1.4942

Architecture

TRN hybrid

Optimizer

Muon

Artifact Size

15.28 MB

Training Techniques

Quantization

int5 QAT

bits: 5

scope: all matrix weights; embeddings remain fp16

Architecture

TRN hybrid

10-layer interleaved hybrid model combining 7 TRN layers with 3 causal attention layers for pattern compression plus exact retrieval.

parameters: {"layers":10,"trn_layers":7,"attention_layers":3}

BigramHash

Token-pair hash table added to the embedding stack to improve representation capacity.

parameters: {"vocab_size":10240,"dim":128}

tied embeddings

Input and output embeddings are tied.

parameters: null

GQA

Grouped-query attention used in the attention layers.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"lr":0.04,"scope":"matrices only"}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.05,"beta1":0.9,"beta2":0.95,"scope":"embeddings"}

Weight Averaging

EMA

parameters: {"decay":0.997,"start":"50% of training"}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}

Regularization

weight decay

parameters: {"value":0.04,"scope":"Muon matrices only"}

Other

other

Kogge-Stone parallel prefix scan over complex-valued oscillators implemented in pure PyTorch for TRN recurrence.

parameters: {"scan_type":"Kogge-Stone","implementation":"pure PyTorch"}

other

Token shift enabled in RWKV-6 style pre-resonance mixing.

parameters: {"enabled":true}

other

LeakyReLU squared activation with PCG lambda regularization-like setting.

parameters: {"activation":"LeakyReLU(0.5)^2","pcg_lambda":0.5}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"chunk":256}

Novel Contributions

Hybrid architecture combining TRN recurrence with periodic causal attention layers
Complex-valued oscillator recurrence with learned frequency, phase, amplitude, and decay
Kogge-Stone parallel prefix scan implementation in pure PyTorch without Triton or custom CUDA
Int5 QAT under a 16 MB artifact constraint
BigramHash token-pair embedding augmentation
Detailed analysis of int5 quantization collapse in oscillatory recurrence parameters
Interleaved TRN/attention layer layout for balancing compression and exact retrieval