PR #773

open

Add non-record shared-weight Frugendorff submission

by siddhantparadoxView on GitHub

val_bpb

1.1532

Architecture

Transformer

Optimizer

—

Artifact Size

15923834 bytes

Training Techniques

Architecture

XSA

Applied XSA to the last layers of the shared-weight Frugendorff-derived host.

parameters: {"last_n_layers":2}

weight tying

Used tied embeddings / shared-weight layout in the Frugendorff host family.

parameters: {"tie_embeddings":1}

KV head count

Used fewer KV heads than attention heads.

parameters: {"num_heads":10,"num_kv_heads":5}

RoPE

Used RoPE with reduced rotary dimensions.

parameters: {"rope_dims":16}

Enabled VE in layers 2 and 3.

parameters: {"enabled":1,"dim":128,"layers":[2,3]}

Quantization

int6

bits: 6

scope: model weights

Compression

zstd

level: null

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

TTT disabled

parameters: {"enabled":0}

Other

other

Late QAT with audit-safety switches and replay/distillation extras disabled for a cleaner audited path.

parameters: {"late_qat":true,"ttt_burst_enabled":0,"distill_enabled":0}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Non-record unlimited-compute 16MB submission
Shared-weight Frugendorff-derived host
XSA on the last 2 layers
VE enabled with late QAT
int6 + zstd export
Replay/distillation extras disabled for a cleaner audited path
Hard requirement on zstandard to avoid fallback compression