val_bpb
1.1761
Architecture
Transformer
Optimizer
AdamW
Artifact Size
10MB
Training Techniques
Architecture
crawler bottleneck
Adds a fifth flat transformer layer on each side of the crawler bottleneck, changing the stack from 4F+1C+4F to 5F+1C+5F.
parameters: {"layers_per_side":5,"previous_layers_per_side":4,"bottleneck_layers":1}
shared TAP encoder connections
Uses shared TAP encoder connections to each crawler loop.
parameters: null
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Adds an extra flat transformer layer on each side of the crawler bottleneck
- Shares TAP encoder connections across crawler loops
- Reports sliding-window validation BPB results