PR #759
openSubmission Record Series: BatchOpt+MLP4+RoPE100k and 8L EMA Int6 Bigram65k on Single 20GB GPU (val_bpb 1.7810 → 1.3092)
by markste-inView on GitHub
val_bpb
1.3092
Architecture
Transformer
Optimizer
—
Artifact Size
15.93MB
Training Techniques
Architecture
MLP4
Increased MLP multiplier to 4.
parameters: null
RoPE
Used RoPE with a larger base for longer-range positional encoding.
parameters: {"base":100000}
BigramHash
Scaled BigramHash vocabulary size.
parameters: {"vocab_size":65000}
8-layer architecture
Used an 8-layer model.
parameters: {"layers":8}
Weight Averaging
EMA
parameters: null
Quantization
int6
bits: 6
scope: MLP
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
magnitude pruning
parameters: {"sparsity":"1%"}
LR Schedule
warmdown
parameters: {"warmdown_steps":600}
linear warmdown
parameters: {"warmdown_steps":3000}
Other
other
Reduced batch size to increase update count during training.
parameters: {"tokens_per_batch":{"before":196000,"after":98000}}
Novel Contributions
- Batch reduction to increase update count
- MLP multiplier increased to 4
- RoPE base increased to 100k
- 8-layer model with BigramHash vocabulary scaled to 65k
- EMA replacing SWA
- Int6 MLP quantization
- Stride-64 sliding evaluation
- 1% magnitude pruning
- Single 20GB GPU training within 600s wall-clock constraint