PR #2013

open

Competition Submission - jj6 Eval Trick Stack - 1.0543 BPB

by WilbatronicView on GitHub
val_bpb
1.0543
Architecture
Transformer
Optimizer
Artifact Size
~28M params

Training Techniques

Architecture
attention
11-layer attention model
parameters: {"layers":11}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Quantization
int8
bits: 8
scope: all
Compression
lzma
level: null
Test-Time Training
TTT
parameters: {"rank":8}
Other
other
Residual signs
parameters: null
other
Outlier filtering
parameters: null
other
Stochastic eval
parameters: null
Regularization
logit softcap
parameters: {"temperature":1.02}

Novel Contributions

  • 11-layer attention architecture
  • 2048 sequence length evaluation
  • int8 quantization with lzma compression
  • Residual signs eval trick
  • TTT with rank 8
  • Outlier filtering
  • Stochastic evaluation
  • Logit temperature scaling at 1.02