PR #1762
openNon-record: Mac mini M4 16GB, no H100s, still golfing (val_bpb=1.5200)
by frido22View on GitHub
val_bpb
1.5200
Architecture
Transformer
Optimizer
—
Artifact Size
15,749,267 bytes
Training Techniques
Architecture
weight tying
Tied embeddings with learned logit bias and logit gain.
parameters: null
BigramHash
Rank-64 previous-token bigram adapter in the output path.
parameters: {"rank":64}
depth recurrence
Two recurrent decoder-tail blocks with learned residual gates.
parameters: {"blocks":2}
Weight Averaging
EMA
parameters: {"scope":"tail tensors"}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Quant-aware endgame with periodic roundtrip blending near wallclock stop.
parameters: null
other
Transpose-aware handling for int8 per-row quantization of mlp.fc.weight.
parameters: null
Novel Contributions
- Improved the previous Mac mini non-record result from 1.56720003 to 1.51996743 BPB.
- Added packaging-safe optional mlx imports so the records-folder script passes CPU smoke import without mlx installed.
- Used a compact SP1024 9x512 KV4 Apple Silicon / MLX family with recurrent tail blocks and a rank-64 previous-token bigram adapter.
- Reallocated float budget toward recurrent-tail attention geometry instead of a broader fp16 tail rescue.
- Applied quant-aware endgame blending and EMA on sensitive tail tensors.