val_bpb
1.4942
Architecture
TRN hybrid
Optimizer
Muon
Artifact Size
15.28 MB
Training Techniques
Quantization
int5 QAT
bits: 5
scope: all matrix weights; embeddings remain fp16
Architecture
TRN hybrid
10-layer interleaved hybrid model combining 7 TRN layers with 3 causal attention layers for pattern compression plus exact retrieval.
parameters: {"layers":10,"trn_layers":7,"attention_layers":3}
BigramHash
Token-pair hash table added to the embedding stack to improve representation capacity.
parameters: {"vocab_size":10240,"dim":128}
tied embeddings
Input and output embeddings are tied.
parameters: null
GQA
Grouped-query attention used in the attention layers.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"lr":0.04,"scope":"matrices only"}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.05,"beta1":0.9,"beta2":0.95,"scope":"embeddings"}
Weight Averaging
EMA
parameters: {"decay":0.997,"start":"50% of training"}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Regularization
weight decay
parameters: {"value":0.04,"scope":"Muon matrices only"}
Other
other
Kogge-Stone parallel prefix scan over complex-valued oscillators implemented in pure PyTorch for TRN recurrence.
parameters: {"scan_type":"Kogge-Stone","implementation":"pure PyTorch"}
other
Token shift enabled in RWKV-6 style pre-resonance mixing.
parameters: {"enabled":true}
other
LeakyReLU squared activation with PCG lambda regularization-like setting.
parameters: {"activation":"LeakyReLU(0.5)^2","pcg_lambda":0.5}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"chunk":256}
Novel Contributions
- Hybrid architecture combining TRN recurrence with periodic causal attention layers
- Complex-valued oscillator recurrence with learned frequency, phase, amplitude, and decay
- Kogge-Stone parallel prefix scan implementation in pure PyTorch without Triton or custom CUDA
- Int5 QAT under a 16 MB artifact constraint
- BigramHash token-pair embedding augmentation
- Detailed analysis of int5 quantization collapse in oscillatory recurrence parameters
- Interleaved TRN/attention layer layout for balancing compression and exact retrieval