PR #1549
openNon-record: Frozen Random Backbone + Rank-304 LoRA Adapters (val_bpb 1.3220)
by dljr-githubView on GitHub
val_bpb
1.3220
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB
Training Techniques
Architecture
depth recurrence
Looped layers 3-5 reuse the same adapter weights across 3 passes to increase gradient signal.
parameters: {"layers":[3,4,5],"passes":3}
XSA
XSA applied across all layers.
parameters: null
Partial RoPE
RoPE applied partially rather than across the full head dimension.
parameters: {"dimensions":16}
U-Net skip connections
U-Net style skip connections added to the model.
parameters: null
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
MLP uses LeakyReLU(0.5)^2 activation.
parameters: {"negative_slope":0.5}
Quantization
GPTQ
bits: 6
scope: adapter matrices
Compression
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":256}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
AdamW
weight_decay: 0.095
momentum: null
other_params: {"embeddings_and_scalars":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Novel Contributions
- Frozen random backbone reconstructed from a deterministic seed at load time, requiring no serialized backbone weights
- Rank-304 LoRA adapters applied to all linear layers
- Depth recurrence on layers 3-5 with shared adapter weights across multiple passes
- GPTQ int6 quantization with brotli compression for adapter-only artifact serialization
- EMA disabled for adapters because it regresses performance by averaging adapter_B toward zero initialization