val_bpb
1.0668
Architecture
Transformer
Optimizer
—
Artifact Size
16,415,938 bytes
Training Techniques
Architecture
GQA
Grouped-query attention used in the transformer stack.
parameters: null
depth recurrence
Depth recurrence included in the model design.
parameters: null
SmearGate
Smear gate used as part of the attention/activation design.
parameters: null
Quantization
GPTQ
bits: null
scope: mixed
int5
bits: 5
scope: export-only fallback
Weight Averaging
EMA
parameters: null
Test-Time Training
score-first TTT
parameters: {"chunk_size":48,"lora_rank":80,"phases":3}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Other
other
Sparse attention gating used during training and evaluation.
parameters: {"gate_window":12}
other
LQER asymmetric correction enabled with low-rank factorization.
parameters: {"rank":4,"factor_bits":4}
Novel Contributions
- Non-record evidence submission showing a strong BPB result while remaining over the 16,000,000 byte cap.
- SP8192 apex stack run with grouped-query attention, depth recurrence, sparse attention gating, SmearGate, and LQER asymmetric correction.
- Documentation of failed lossless packaging rescues that did not reduce the artifact below cap.
- Under-cap int5 fallback export demonstrating the architecture can be packaged within the limit, albeit with worse quality.