PR #759

open

Submission Record Series: BatchOpt+MLP4+RoPE100k and 8L EMA Int6 Bigram65k on Single 20GB GPU (val_bpb 1.7810 → 1.3092)

by markste-inView on GitHub

val_bpb

1.3092

Architecture

Transformer

Optimizer

—

Artifact Size

15.93MB

Training Techniques

Architecture

MLP4

Increased MLP multiplier to 4.

parameters: null

RoPE

Used RoPE with a larger base for longer-range positional encoding.

parameters: {"base":100000}

BigramHash

Scaled BigramHash vocabulary size.

parameters: {"vocab_size":65000}

8-layer architecture

Used an 8-layer model.

parameters: {"layers":8}

Weight Averaging

EMA

parameters: null

Quantization

int6

bits: 6

scope: MLP

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

magnitude pruning

parameters: {"sparsity":"1%"}

LR Schedule

warmdown

parameters: {"warmdown_steps":600}

linear warmdown

parameters: {"warmdown_steps":3000}

Other

other

Reduced batch size to increase update count during training.

parameters: {"tokens_per_batch":{"before":196000,"after":98000}}

Novel Contributions

Batch reduction to increase update count
MLP multiplier increased to 4
RoPE base increased to 100k
8-layer model with BigramHash vocabulary scaled to 65k
EMA replacing SWA
Int6 MLP quantization
Stride-64 sliding evaluation
1% magnitude pruning
Single 20GB GPU training within 600s wall-clock constraint