PR #355
openAdd non-record BigramHash4096 + MLP992 + LR0.08 + Slide64 submission
by josusanmartinView on GitHub
val_bpb
1.1929
Architecture
Transformer
Optimizer
—
Artifact Size
16,179,102 bytes
Training Techniques
Architecture
BigramHash
Adds a hashed bigram embedding side channel to the model.
parameters: {"buckets":4096,"dim":64}
weight tying
Uses tied input/output embeddings.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP width reduction
Uses a narrower feed-forward network than the naive baseline.
parameters: {"mlp_hidden":992}
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
int8
bits: 8
scope: model weights
fp16
bits: 16
scope: tok_emb.weight
Compression
zlib
level: null
Novel Contributions
- CUDA variant of the baseline trainer for an 8xH100 run
- BigramHash(4096,64) side channel
- MLP_HIDDEN=992 narrower FFN
- MATRIX_LR=0.08 higher matrix learning rate
- Sliding-window evaluation with stride 64
- fp16 tied-embedding export
- Non-record submission targeting track_non_record_16mb due to artifact size over the cap