PR #2044
openNon-record 10min/16MB: Future-State Encoder Planner (1xH100, val_bpb 1.39784)
by FF-GardenFnView on GitHub
val_bpb
1.3978
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,003,692 bytes
Training Techniques
Architecture
GQA
Grouped query attention in a recurrent transformer schedule.
parameters: null
depth recurrence
Recurrent transformer schedule with repeated processing over segments.
parameters: null
SmearGate
Smear gate used as part of the attention/activation routing.
parameters: null
BigramHash
Bigram-prior signals used to guide the model.
parameters: null
TrigramHash
Trigram CP prior builder used for auxiliary prior signals.
parameters: null
weight tying
Not explicitly stated in the submission text.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"geometry":"turbo4_aol","parallel":"banked parallel","manual_all_reduce_non_bank_params":true}
Quantization
int4
bits: 4
scope: exported model
Compression
zlib
level: null
Other
other
Explicit future-state encoder target: the model is trained to encode a forecast of the next-segment hidden state into a dedicated channel.
parameters: null
other
Latent planner conditioning.
parameters: null
other
Prefix-hybrid macro routing.
parameters: null
other
Future-embed MTP head.
parameters: null
other
Meta-preconditioner on local transforms.
parameters: null
other
Int4+zlib export for fitting under the 16MB artifact cap.
parameters: null
Novel Contributions
- Explicit future-state encoder target
- Latent planner conditioning
- Prefix-hybrid macro routing
- Future-embed MTP head
- Recurrent GQA transformer schedule
- Banked parallel Muon optimizer with turbo4_aol geometry
- Int4+zlib export under the 16MB cap