PR #355

open

Add non-record BigramHash4096 + MLP992 + LR0.08 + Slide64 submission

by josusanmartinView on GitHub

val_bpb

1.1929

Architecture

Transformer

Optimizer

—

Artifact Size

16,179,102 bytes

Training Techniques

Architecture

BigramHash

Adds a hashed bigram embedding side channel to the model.

parameters: {"buckets":4096,"dim":64}

weight tying

Uses tied input/output embeddings.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

MLP width reduction

Uses a narrower feed-forward network than the naive baseline.

parameters: {"mlp_hidden":992}

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

int8

bits: 8

scope: model weights

fp16

bits: 16

scope: tok_emb.weight

Compression

zlib

level: null

CUDA variant of the baseline trainer for an 8xH100 run
BigramHash(4096,64) side channel
MLP_HIDDEN=992 narrower FFN
MATRIX_LR=0.08 higher matrix learning rate
Sliding-window evaluation with stride 64
fp16 tied-embedding export
Non-record submission targeting track_non_record_16mb due to artifact size over the cap