PR #2044

open

Non-record 10min/16MB: Future-State Encoder Planner (1xH100, val_bpb 1.39784)

by FF-GardenFnView on GitHub
val_bpb
1.3978
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,003,692 bytes

Training Techniques

Architecture
GQA
Grouped query attention in a recurrent transformer schedule.
parameters: null
depth recurrence
Recurrent transformer schedule with repeated processing over segments.
parameters: null
SmearGate
Smear gate used as part of the attention/activation routing.
parameters: null
BigramHash
Bigram-prior signals used to guide the model.
parameters: null
TrigramHash
Trigram CP prior builder used for auxiliary prior signals.
parameters: null
weight tying
Not explicitly stated in the submission text.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"geometry":"turbo4_aol","parallel":"banked parallel","manual_all_reduce_non_bank_params":true}
Quantization
int4
bits: 4
scope: exported model
Compression
zlib
level: null
Other
other
Explicit future-state encoder target: the model is trained to encode a forecast of the next-segment hidden state into a dedicated channel.
parameters: null
other
Latent planner conditioning.
parameters: null
other
Prefix-hybrid macro routing.
parameters: null
other
Future-embed MTP head.
parameters: null
other
Meta-preconditioner on local transforms.
parameters: null
other
Int4+zlib export for fitting under the 16MB artifact cap.
parameters: null

Novel Contributions

  • Explicit future-state encoder target
  • Latent planner conditioning
  • Prefix-hybrid macro routing
  • Future-embed MTP head
  • Recurrent GQA transformer schedule
  • Banked parallel Muon optimizer with turbo4_aol geometry
  • Int4+zlib export under the 16MB cap