Are Sixteen Heads Really Better than One? | Paul Michel · Omer Levy · Graham Neubig |
Compositional De-Attention Networks | Yi Tay · Anh Tuan Luu · Aston Zhang · Shuohang Wang · Siu Cheung Hui |
Geometry-Aware Neural Rendering | Joshua Tobin · Wojciech Zaremba · Pieter Abbeel |
Image Captioning: Transforming Objects into Words | Simao Herdade · Armin Kappeler · Kofi Boakye · Joao Soares |
Learning by Abstraction: The Neural State Machine | Drew Hudson · Christopher Manning |
Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time | Karlis Freivalds · Emīls Ozoliņš · Agris Šostaks |
Novel positional encodings to enable tree-based transformers | Vighnesh Shiv · Chris Quirk |
Self-attention with Functional Time Representation Learning | Da Xu · Chuanwei Ruan · Evren Korpeoglu · Sushant Kumar · Kannan Achan |
Understanding Attention and Generalization in Graph Neural Networks | Boris Knyazev · Graham W Taylor · Mohamed Amer |