PR #773

open

Add non-record shared-weight Frugendorff submission

by siddhantparadoxView on GitHub
val_bpb
1.1532
Architecture
Transformer
Optimizer
Artifact Size
15923834 bytes

Training Techniques

Architecture
XSA
Applied XSA to the last layers of the shared-weight Frugendorff-derived host.
parameters: {"last_n_layers":2}
weight tying
Used tied embeddings / shared-weight layout in the Frugendorff host family.
parameters: {"tie_embeddings":1}
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":10,"num_kv_heads":5}
RoPE
Used RoPE with reduced rotary dimensions.
parameters: {"rope_dims":16}
VE
Enabled VE in layers 2 and 3.
parameters: {"enabled":1,"dim":128,"layers":[2,3]}
Quantization
int6
bits: 6
scope: model weights
Compression
zstd
level: null
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT disabled
parameters: {"enabled":0}
Other
other
Late QAT with audit-safety switches and replay/distillation extras disabled for a cleaner audited path.
parameters: {"late_qat":true,"ttt_burst_enabled":0,"distill_enabled":0}
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • Non-record unlimited-compute 16MB submission
  • Shared-weight Frugendorff-derived host
  • XSA on the last 2 layers
  • VE enabled with late QAT
  • int6 + zstd export
  • Replay/distillation extras disabled for a cleaner audited path
  • Hard requirement on zstandard to avoid fallback compression