PR #622

closed

Submission: 1.0941 BPB by David Weyh

by UpsallaView on GitHub
val_bpb
1.0941
Architecture
Transformer
Optimizer
Adam
Artifact Size
14.99 MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
SmearGate
Custom gating mechanism used in the model architecture.
parameters: null
BigramHash
Custom bigram-based hashing component used in the model architecture.
parameters: null
OrthoInit
Orthogonal initialization used for model weights.
parameters: null
MLP3x
MLP hidden dimension expanded to 3x the model dimension.
parameters: {"hidden":1536}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"learning_rate":0.01}
Compression
zlib
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"epochs":2,"layers":["c_proj","mlp_proj"]}
Initialization
OrthoInit
Orthogonal initialization.
Sequence Length
sequence_length
train_length: null
eval_length: 512
Other
other
Document-level sequential evaluation and adaptation: validation documents are processed chronologically, with shorter documents evaluated zero-shot and longer documents chunked for per-document adaptation.
parameters: {"document_level":true,"sequential_processing":true}

Novel Contributions

  • Document-level LoRA test-time training on validation documents
  • Chronological chunk-wise adaptation within each document
  • LoRA injected into c_proj and mlp_proj layers
  • INT8 compression to fit the artifact under 16 MB
  • 10-layer, 512-dim transformer with SmearGate, BigramHash, and tied embeddings