val_bpb
1.3569
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,658,145 bytes
Training Techniques
Architecture
depth recurrence
Mirrored hourglass recurrent circuit with reused middle blocks and mirrored entry/exit tails (012 | 34567 | 34567 | 210).
parameters: {"unique_blocks":8,"effective_depth":16,"route_repeats":2}
weight tying
Factored tied embeddings / tied token interface to save artifact bytes while preserving lexical capacity.
parameters: {"model_dim":704,"factored_embed_dim":832}
attention modification
Only the first recurrent-core block keeps attention enabled; the remaining repeated middle blocks are MLP-only.
parameters: {"core_attention_block":3,"mlp_only_blocks":[4,5,6,7]}
other
LexLoRE / VocabMoE token-conditioned low-rank residual lexical adapters placed at the input and first loop entry.
parameters: {"layers":["input","loop_first"],"experts":16,"rank":2}
Quantization
int8 QAT
bits: 8
scope: all
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":2200}
Regularization
weight decay
parameters: {"weight_decay":0}
Novel Contributions
- MirrorLoop HRC hourglass recurrent circuit with mirrored input/output tails
- LexLoRE token-conditioned low-rank lexical adapters implemented via VocabMoE flags
- Train-time quantized forward from step 0, including embeddings
- Factored tied embeddings to fit within the 16MB artifact budget
- Single attention-capable core-entry block with MLP-only recurrent blocks
- LQER low-rank export repair for quantization error
- Explicit artifact-size-aware design targeting the decimal 16,000,000 byte cap