PR #1983
openAdd submission: Int5/Int6 + BigramHash + SmearGate + SWA + LLMAdvisor…
by harborglowvintage-ossView on GitHub
val_bpb
1.1586
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.72 MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP weights and attention weights
Architecture
BigramHash
Bigram-hash embeddings used in place of standard embeddings.
parameters: null
SmearGate
Gate mechanism added to the model.
parameters: null
weight tying
Input and output embeddings are tied.
parameters: null
Weight Averaging
SWA
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw_used_for":"scalars/embeddings"}
Compression
zstd
level: 22
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Mixed Int5/Int6 quantization
- BigramHash embeddings
- SmearGate
- SWA
- Muon optimizer with AdamW for scalars/embeddings
- zstd-22 artifact compression
- 3-seed ensemble-style reporting