looking back what is the most beautiful or surprising idea in deep learning or AI in general that you've come across you've seen this field explode and grow in interesting ways just what what cool ideas like like we made you sit back and go hmm small big or small well the one that I've been thinking about recently the most probably is the the Transformer architecture um so basically uh neural networks have a lot of architectures that were trendy have come and gone for different sensory modalities like for Vision Audio text you would process them with different looking neural nuts and recently we've seen these convergence towards one architecture the Transformer and you can feed it video or you can feed it you know images or speech or text and it just gobbles it up and it's kind of like a bit of a general purpose uh computer that is also trainable and very efficient to run on our Hardware and so this paper came out in 2016 I want to say um attention is all you need attention is all you need you criticize the paper title in retrospect that it wasn't um it didn't foresee the bigness of the impact yeah that it was going to have yeah I'm not sure if the authors were aware of the impact that that paper would go on to have probably they weren't but I think they were aware of some of the motivations and design decisions behind the Transformer and they chose not to I think uh expand on it in that way in a paper and so I think they had an idea that there was more um than just the surface of just like oh we're just doing translation and here's a better architecture you're not just doing translation this is like a really cool differentiable optimizable efficient computer that you've proposed and maybe they didn't have all of that foresight but I think is really interesting isn't it funny sorry to interrupt that title is memeable that they went for such a profound idea they went with the I don't think anyone used that kind of title before right protection is all you need yeah it's like a meme or something basically it's not funny that one like uh maybe if it was a more serious title it wouldn't have the impact honestly I yeah there is an element of me that honestly agrees with you and prefers it this way yes if it was two grand it would over promise and then under deliver potentially so you want to just uh meme your way to greatness that should be a t-shirt so you you tweeted the Transformers the Magnificent neural network architecture because it is a general purpose differentiable computer it is simultaneously expressive in the forward pass optimizable via back propagation gradient descent and efficient High parallelism compute graph can you discuss some of those details expressive optimizable efficient yeah from memory or or in general whatever comes to your heart you want to have a general purpose computer that you can train on arbitrary problems uh like say the task of next word prediction or detecting if there's a cat in the image or something like that and you want to train this computer so you want to set its its weights and I think there's a number of design criteria that sort of overlap in the Transformer simultaneously that made it very successful and I think the authors were kind of uh deliberately trying to make this really uh powerful architecture and um so in a basically it's very powerful in the forward pass because it's able to express um very uh General computation as a sort of something that looks like message passing you have nodes and they all store vectors and these nodes get to basically look at each other and it's each other's vectors and they get to communicate and basically notes get to broadcast hey I'm looking for certain things and then other nodes get to broadcast hey these are the things I have those are the keys and the values so it's not just the tension yeah exactly Transformer is much more than just the attention component it's got many pieces architectural that went into it the residual connection of the way it's arranged there's a multi-layer perceptron in there the way it's stacked and so on um but basically there's a message passing scheme where nodes get to look at each other decide what's interesting and then update each other and uh so I think the um when you get to the details of it I think it's a very expressive function uh so it can express lots of different types of algorithms and forward paths not only that but the way it's designed with the residual connections layer normalizations the soft Max attention and everything it's also optimizable this is a really big deal because there's lots of computers that are powerful that you can't optimize or they're not easy to optimize using the techniques that we have which is back propagation and gradient and send these are first order methods very simple optimizers really and so um you also need it to be optimizable um and then lastly you want it to run efficiently in the hardware our Hardware is a massive throughput machine like gpus they prefer lots of parallelism so you don't want to do lots of sequential operations you want to do a lot of operations serially and the Transformer is designed with that in mind as well and so it's designed for our hardware and it's designed to both be very expressive in a forward pass but also very optimizable in the backward pass and you said that uh the residual connections support a kind of ability to learn short algorithms fast them first and then gradually extend them longer during training yeah what's what's the idea of learning short algorithms right think of it as a so basically a Transformer is a series of uh blocks right and these blocks have attention and a little multi-layer perceptron and so you you go off into a block and you come back to this residual pathway and then you go off and you come back and then you have a number of layers arranged sequentially and so the way to look at it I think is because of the residual pathway in the backward path the gradients uh sort of flow along it uninterrupted because addition distributes the gradient equally to all of its branches so the gradient from the supervision at the top uh just floats directly to the first layer and the all these residual connections are arranged so that in the beginning during initialization they contribute nothing to the residual pathway um so what it kind of looks like is imagine the Transformer is kind of like a uh python function like a death and um you get to do various kinds of like lines of code so you have a hundred layers deep uh Transformer typically they would be much shorter say 20. so you have 20 lines of code then you can do something in them and so think of during the optimization basically what it looks like is first you optimize the first line of code and then the second line of code can kick in and the third line of code can and I kind of feel like because of the residual pathway and the Dynamics of the optimization uh you can sort of learn a very short algorithm that gets the approximate tensor but then the other layers can sort of kick in and start to create a contribution and at the end of it you're you're optimizing over an algorithm that is uh 20 lines of code except these lines of code are very complex because it's an entire block of a transformer you can do a lot in there what's really interesting is that this Transformer architecture actually has been a remarkably resilient basically the Transformer that came out in 2016 is the Transformer you would use today except you reshuffle some of the layer norms the player normalizations have been reshuffled to a pre-norm formulation and so it's been remarkably stable but there's a lot of bells and whistles that people have attached to it and try to uh improve it I do think that basically it's a it's a big step in simultaneously optimizing for lots of properties of a desirable neural network architecture and I think people have been trying to change it but it's proven remarkably resilient but I do think that there should be even better architectures potentially but it's uh your you admire the resilience here yeah there's something profound about this architecture that at least so maybe we can everything could be turned into a uh into a problem that Transformers can solve currently definitely looks like the Transformers taking over Ai and you can feed basically arbitrary problems into it and it's a general differentiable computer and it's extremely powerful and uh this convergence in AI has been really interesting to watch uh for me personally what else do you think could be discovered here about Transformers like a surprising thing or or is it a stable um we're in a stable place is there something interesting we might discover about Transformers like aha moments maybe has to do with memory uh maybe knowledge representation that kind of stuff definitely the Zeitgeist today is just pushing like basically right now there's that guys is do not touch the Transformer touch everything else yes so people are scaling up the data sets making them much much bigger they're working on the evaluation making the evaluation much much bigger and uh um they're basically keeping the architecture unchanged and that's how we've um that's the last five years of progress in AI kind of
No comments yet. Be the first to comment!