PR #759

open

Submission Record Series: BatchOpt+MLP4+RoPE100k and 8L EMA Int6 Bigram65k on Single 20GB GPU (val_bpb 1.7810 → 1.3092)

by markste-inView on GitHub
val_bpb
1.3092
Architecture
Transformer
Optimizer
Artifact Size
15.93MB

Training Techniques

Architecture
MLP4
Increased MLP multiplier to 4.
parameters: null
RoPE
Used RoPE with a larger base for longer-range positional encoding.
parameters: {"base":100000}
BigramHash
Scaled BigramHash vocabulary size.
parameters: {"vocab_size":65000}
8-layer architecture
Used an 8-layer model.
parameters: {"layers":8}
Weight Averaging
EMA
parameters: null
Quantization
int6
bits: 6
scope: MLP
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
magnitude pruning
parameters: {"sparsity":"1%"}
LR Schedule
warmdown
parameters: {"warmdown_steps":600}
linear warmdown
parameters: {"warmdown_steps":3000}
Other
other
Reduced batch size to increase update count during training.
parameters: {"tokens_per_batch":{"before":196000,"after":98000}}

Novel Contributions

  • Batch reduction to increase update count
  • MLP multiplier increased to 4
  • RoPE base increased to 100k
  • 8-layer model with BigramHash vocabulary scaled to 65k
  • EMA replacing SWA
  • Int6 MLP quantization
  • Stride-64 sliding evaluation
  • 1% magnitude pruning
  • Single 20GB GPU training within 600s wall-clock constraint