PR #432

open

Add non-record 1x5090 autoresearch submission with two-campaign analysis

val_bpb

1.5295

Architecture

GPT

Optimizer

—

Artifact Size

9,190,936 bytes

Training Techniques

Quantization

int6

bits: 6

scope: MLP-only export / model weights with targeted fp16 exceptions

QAT

bits: null

scope: attention

Architecture

depth recurrence

Reduced repeated/shared compute and shifted capacity into unique late tail blocks; explored compact carrier plus deeper unique tail.

parameters: null

q_proj

Used low-rank q_proj on most blocks, with full-rank q_proj restored only on the final tail block.

parameters: null

tied embeddings

Kept tied embeddings in fp16 export on the final line.

parameters: null

Sequence Length

sequence_length

train_length: 960

eval_length: 1024

LR Schedule

short-to-full context warmup

parameters: null

Other

other

Shrank the global update shape from 4 x 30720 to 3 x 30720 and then 2 x 30720.

parameters: {"from":"4 x 30720","to":"2 x 30720"}

other

Disabled attention fake quant during training.

parameters: null

other

Delayed MLP fake quant until the full-context boundary.

parameters: null

Git-native autoresearch loop that commits wins and reverts losers as search memory
Two-campaign analysis showing a trajectory from 1.733958794 to 1.535119154 with a best numeric run of 1.528664372
Use of low-rank q_proj on most blocks to buy compute for better-performing components
Targeted precision spending, including full-rank q_proj only on the final tail block and fp16 tied embeddings
Int6 MLP-only export to reclaim artifact budget for stronger tail capacity
Short-to-full context warmup and staged fake-quant changes to improve training efficiency