4 votes

Big Bird: Transformers for Longer Sequences

Posted August 3, 2020 by skybrian

Tags: google, machine learning

https://arxiv.org/abs/2007.14062

Link information

This data is scraped automatically and may be incorrect.

Published: Jul 3 2020

1 comment

skybrian (OP)
August 3, 2020
Link
From the abstract:

From the abstract:

We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.