PR #1762

open

Non-record: Mac mini M4 16GB, no H100s, still golfing (val_bpb=1.5200)

by frido22View on GitHub
val_bpb
1.5200
Architecture
Transformer
Optimizer
Artifact Size
15,749,267 bytes

Training Techniques

Architecture
weight tying
Tied embeddings with learned logit bias and logit gain.
parameters: null
BigramHash
Rank-64 previous-token bigram adapter in the output path.
parameters: {"rank":64}
depth recurrence
Two recurrent decoder-tail blocks with learned residual gates.
parameters: {"blocks":2}
Weight Averaging
EMA
parameters: {"scope":"tail tensors"}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Quant-aware endgame with periodic roundtrip blending near wallclock stop.
parameters: null
other
Transpose-aware handling for int8 per-row quantization of mlp.fc.weight.
parameters: null

Novel Contributions

  • Improved the previous Mac mini non-record result from 1.56720003 to 1.51996743 BPB.
  • Added packaging-safe optional mlx imports so the records-folder script passes CPU smoke import without mlx installed.
  • Used a compact SP1024 9x512 KV4 Apple Silicon / MLX family with recurrent tail blocks and a rank-64 previous-token bigram adapter.
  • Reallocated float budget toward recurrent-tail attention geometry instead of a broader fp16 tail rescue.
  • Applied quant-aware endgame blending and EMA on sensitive tail tensors.