PR #1762

open

Non-record: Mac mini M4 16GB, no H100s, still golfing (val_bpb=1.5200)

val_bpb

1.5200

Architecture

Transformer

Optimizer

—

Artifact Size

15,749,267 bytes

Training Techniques

Architecture

weight tying

Tied embeddings with learned logit bias and logit gain.

parameters: null

BigramHash

Rank-64 previous-token bigram adapter in the output path.

parameters: {"rank":64}

depth recurrence

Two recurrent decoder-tail blocks with learned residual gates.

parameters: {"blocks":2}

Weight Averaging

EMA

parameters: {"scope":"tail tensors"}

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Quant-aware endgame with periodic roundtrip blending near wallclock stop.

parameters: null

other

Transpose-aware handling for int8 per-row quantization of mlp.fc.weight.

parameters: null

Improved the previous Mac mini non-record result from 1.56720003 to 1.51996743 BPB.
Added packaging-safe optional mlx imports so the records-folder script passes CPU smoke import without mlx installed.
Used a compact SP1024 9x512 KV4 Apple Silicon / MLX family with recurrent tail blocks and a rank-64 previous-token bigram adapter.
Reallocated float budget toward recurrent-tail attention geometry instead of a broader fp16 tail rescue.
Applied quant-aware endgame blending and EMA on sensitive tail tensors.