PR #1095
openRecord: Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905
by vimetoView on GitHub
val_bpb
0.0905
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.09 MB
Training Techniques
Initialization
OrthoInit
Frozen orthogonal random projections regenerated from deterministic 8-byte seeds at load time.
Quantization
int8
bits: 8
scope: LoRA adapters
Compression
lzma
level: 9
Architecture
LeakyReLU
Uses LeakyReLU(0.5) squared activation in the MLP.
parameters: {"slope":0.5,"squared":true}
weight tying
Uses tied embeddings.
parameters: null
GQA
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses an MLP multiplier of 3.0.
parameters: {"multiplier":3}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Incrementally builds an n-gram cache during training and freezes it for evaluation; blends neural and n-gram probabilities with entropy-adaptive alpha.
parameters: {"ngram_orders":"2-7","cache_type":"INT16","multi_gpu_sync":"all_reduce"}
Novel Contributions
- Seed-regenerated frozen orthogonal random base weights stored as 8-byte seeds instead of full matrices
- Incremental n-gram cache built during training with negligible overhead
- Entropy-adaptive blending of neural and n-gram probabilities
- INT8 quantization of LoRA adapters with small BPB loss
- Orthogonal initialization enabling stable training of deeper random-weight models