PR #432
openAdd non-record 1x5090 autoresearch submission with two-campaign analysis
by jadechipView on GitHub
val_bpb
1.5295
Architecture
GPT
Optimizer
—
Artifact Size
9,190,936 bytes
Training Techniques
Quantization
int6
bits: 6
scope: MLP-only export / model weights with targeted fp16 exceptions
QAT
bits: null
scope: attention
Architecture
depth recurrence
Reduced repeated/shared compute and shifted capacity into unique late tail blocks; explored compact carrier plus deeper unique tail.
parameters: null
q_proj
Used low-rank q_proj on most blocks, with full-rank q_proj restored only on the final tail block.
parameters: null
tied embeddings
Kept tied embeddings in fp16 export on the final line.
parameters: null
Sequence Length
sequence_length
train_length: 960
eval_length: 1024
LR Schedule
short-to-full context warmup
parameters: null
Other
other
Shrank the global update shape from 4 x 30720 to 3 x 30720 and then 2 x 30720.
parameters: {"from":"4 x 30720","to":"2 x 30720"}
other
Disabled attention fake quant during training.
parameters: null
other
Delayed MLP fake quant until the full-context boundary.
parameters: null
Novel Contributions
- Git-native autoresearch loop that commits wins and reverts losers as search memory
- Two-campaign analysis showing a trajectory from 1.733958794 to 1.535119154 with a best numeric run of 1.528664372
- Use of low-rank q_proj on most blocks to buy compute for better-performing components
- Targeted precision spending, including full-rank q_proj only on the final tail block and fp16 tied embeddings
- Int6 MLP-only export to reclaim artifact budget for stronger tail capacity
- Short-to-full context warmup and staged fake-quant changes to improve training efficiency