If they become dictionary lookups, attention becomes some sort of shuffle or resampling.
What then, is the essence of attention?
If they become dictionary lookups, attention becomes some sort of shuffle or resampling.
What then, is the essence of attention?
For me, it's a 1D gaussian that is being shifted left and right
For me, it's a 1D gaussian that is being shifted left and right
for end-to-end tests: entangle as many things as possible
for end-to-end tests: entangle as many things as possible