PR #2044

open

Non-record 10min/16MB: Future-State Encoder Planner (1xH100, val_bpb 1.39784)

by FF-GardenFnView on GitHub

val_bpb

1.3978

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,003,692 bytes

Training Techniques

Architecture

GQA

Grouped query attention in a recurrent transformer schedule.

parameters: null

depth recurrence

Recurrent transformer schedule with repeated processing over segments.

parameters: null

SmearGate

Smear gate used as part of the attention/activation routing.

parameters: null

BigramHash

Bigram-prior signals used to guide the model.

parameters: null

TrigramHash

Trigram CP prior builder used for auxiliary prior signals.

parameters: null

weight tying

Not explicitly stated in the submission text.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"geometry":"turbo4_aol","parallel":"banked parallel","manual_all_reduce_non_bank_params":true}

Quantization

int4

bits: 4

scope: exported model

Compression

zlib

level: null

Other

other

Explicit future-state encoder target: the model is trained to encode a forecast of the next-segment hidden state into a dedicated channel.

parameters: null

other

Latent planner conditioning.

parameters: null

other

Prefix-hybrid macro routing.

parameters: null

other

Future-embed MTP head.

parameters: null

other

Meta-preconditioner on local transforms.

parameters: null

other

Int4+zlib export for fitting under the 16MB artifact cap.

parameters: null

Novel Contributions

Explicit future-state encoder target
Latent planner conditioning
Prefix-hybrid macro routing
Future-embed MTP head
Recurrent GQA transformer schedule
Banked parallel Muon optimizer with turbo4_aol geometry
Int4+zlib export under the 16MB cap