val_bpb
1.1532
Architecture
Transformer
Optimizer
—
Artifact Size
15923834 bytes
Training Techniques
Architecture
XSA
Applied XSA to the last layers of the shared-weight Frugendorff-derived host.
parameters: {"last_n_layers":2}
weight tying
Used tied embeddings / shared-weight layout in the Frugendorff host family.
parameters: {"tie_embeddings":1}
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":10,"num_kv_heads":5}
RoPE
Used RoPE with reduced rotary dimensions.
parameters: {"rope_dims":16}
VE
Enabled VE in layers 2 and 3.
parameters: {"enabled":1,"dim":128,"layers":[2,3]}
Quantization
int6
bits: 6
scope: model weights
Compression
zstd
level: null
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT disabled
parameters: {"enabled":0}
Other
other
Late QAT with audit-safety switches and replay/distillation extras disabled for a cleaner audited path.
parameters: {"late_qat":true,"ttt_burst_enabled":0,"distill_enabled":0}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Non-record unlimited-compute 16MB submission
- Shared-weight Frugendorff-derived host
- XSA on the last 2 layers
- VE enabled with late QAT
- int6 + zstd export
- Replay/distillation extras disabled for a cleaner audited path
- Hard requirement on zstandard to avoid fallback compression