00:00
Here we tackle back propagation the core algorithm behind how neural networks learn after a quick recap for where we are the first thing I'LL do is an intuitive walk through for what the algorithm is actually doing without。Without any reference to the formulas then for those of you who do want to dive into the math the next video goes into the calculus underlying all this if you watch the last two videos or if you just jumping in with the appropriate background you know what a neural network is and how it feeds forward information here we're doing the classic example of recognizing handwritten digits whose pixel values get Fed into the first layer of the network with seven hundred eighty four neurons and I'VE been showing a network with two hidden layers having just sixteen neurons each and an output layer of ten neurons indicating which digit the network is choosing as its answer I'm also expecting you to understand gradient descent as described in the last video and how what we mean by learning is that we want to find which weights and biases minimize a certain cost function as a quick reminder for the cost of a single training example what you do is take the output that the network gives along with the output that you wanted it To Give and you just ADD up the squares of the differences between each component doing this。
01:14
For all of your tens of thousands of training examples and averaging the results this gives you the total cost of the network and as if that's not enough to think about as described in the last video the thing that we're looking for is the negative gradient of this cost function which tells you how you need to change all of the weights and biases all of these connections so it to most efficiently decrease the cost。Back propagation the topic of this video is an algorithm for computing that crazy complicated gradient and the one idea from the last video that I really want you to hold firmly in your mind right now is it because thinking of the gradient vector as a direction in thirteen thousand dimensions is to put it lightly beyond the scope of our imaginations there's another way you can think about it the magnitude of each component here is telling you how sensitive the cost function is to each weight and bias。
02:09
For example let's say you go through the process I'm about to describe and you compute the negative gradient and the component associated with the weight on this edge here comes out To Be three point two while the component associated with this edge here comes out as zero point one the way you would interpret that is that the cost of the function is thirty two times more sensitive to changes in that first way so if you were to wiggle that value just a little bit it's going to cause some change to the cost and that change is thirty two times greater than what the same wiggle to that second weight would give personally when I was first learning about back propagation I think the most confusing aspect was just the notation and the index chasing of it all for once you won wrap what each part of this algorithm is really doing each individual effect that it's having is actually pretty intuitive it's just that there's a lot of little adjustments getting layered on top of each other so I'm going to start things off here with a complete disregard for the notation and just step。
03:10
Through those effects that each training example is having on the weights and biases because the cost function involves averaging a certain cost per example over all the tens of thousands of training examples the way that we adjust the weights and biases for a single gradient descent step also depends on every single example or rather in principle it should but for computational efficiency we're going to do a little trick later to keep you from needing to hit every single example for every single step in another case right now all we're going to do is focus our attention on one single example this image of a two what effect should this one training example have on how the weights and biases get adjusted let's see where at the point where the network is not well trained yet so the activations in the output are going to look pretty random maybe something like zero point five zero point eight zero point two on the not and we can't directly change those activations we only have influence on the weights and biases but it is helpful to keep track of which adjustments we wish should take place to that output lay。
04:10
And since we wanted to classify the image as it two we want that third value To Get nudged up while all of the others get nudged down moreover the sizes of these nudges should be proportional to how far away each current value is from its target value for example the increase to that number two neurons activation is in a sense more important than the decrease to the number eight neuron which is already pretty close to where it should be so zooming in further let's focus just on this one neuron the one who activation we wish to increase remember that activation is DeFined as a certain weighted sum of all of the activations in the previous layer plus a bias which is all then plugged into something like the sign pointquification function or a relu so there are three different avenues that can team up together to help increase that activation you can increase the bias you can increase the weights and you can change the activ。
05:10
S from the previous layer focusing just on how the weights should be adjusted notice how the weights actually have differing levels of influence the connections with the brightest neurons from the preceding layer have the biggest effect since those weights are multiplied by larger activation values so if you were to increase one of those weights it actually has a stronger influence on the ultimate cost function then increasing the weights of connections with dimber neurons at least as far as this one training example is concerned remember when we talk about gradient descent we don't just care about whether each component should get nudged up or down we care about which ones give you the most bang for your buck this by the way is at least somewhat reminiscent of a theory and neuroscienceence for how biological networks of neurons learn he and theory often summed up in the phrase neurons the fire together wire together here the biggest increases to weights the biggest strengthening of connections happens。
06:10
Between neurons which are the most active and the ones which we wish To Become more active in a sense the neurons that are firing while seeing a two get more strongly linked to those firing when thinking about it too To Be clear that really am not in a position to make statements one way or another about whether artificial networks of neurons behave anything like biological brains and this fires wire together idea comes with a couple meaningful astersterisks but taken as a very loosenalogy that do find it interesting to note anyway the third way that we can help increase this neurons activation is by changing all the activations in the previous layer namely if everything connected to that digit to neuron with a positive weight got brighter and if everything connected with a negative weight got dimmer than that digit to neuron would become more active and similar to the weight changes you're going To Get the most bang for your buck by seeking changes that are proportional to the size of the corresponding weights now of course。
07:10
We cannot directly influence those activations we only have control over the weights and biases but just as with the last layer it's helpful to just keep a note of what those desire changes are but keep in mind zooming out one step here this is only what that digit two output neuron wants remember we also want all of the other neurons in the last layer To Become less active and each of those other output neurons has its own thoughts about what should happen to that second to last layer so the desire of this digit to neuron is added together with the desires of all the other output neurons for what should happen to this second to last layer again in proportion to the corresponding weights and in proportion to how much each of those neurons needs to change this right here is where the idea of propagating backwards comes in by adding together all these desired effects you basically get a list of nudges that you want to happen to the second to last。
08:10
Player and once you have those you can rechearsively apply the same process to the relevant weights and biases that determine those values repeating the same process I just walk through and moving backwards through the network and zooming out a bit further remember that this is all just how a single training example wishes to nudge each one of those weights and biases if we only listen to what that two wanted the network would ultimately be incentiveized just to classify all images as a two so what you do is you go through this same back proper for every other training example recording how each of them would like to change the weights in the biasess and you average together those desired changes this collection here of the average nudges to each weight and bias is loosely speaking the negative gradient of the cost function referenced in the last video or at least something proion。
09:10
Al to it I say loosely speaking only because I have yet To Get quantitatively precise about those nudges but if you understood every change that I just referenced why some are proportionateally bigger than others and how they all need To Be added together you understand the mechanics for what back propagation is actually doing by the way in practice it takes computers an extremely long time to ADD up the influence of every single training example every single gradient descent instead so here's what's commonly done instead you randomly shuffle your training data and then divided into a whole bunch of mini batches let's say each one having a hundred training examples then you compute a step according to the mini batch it's not going To Be the actual gradient to the cost function which depends on all of the training data not this tiny subset so it's not the most efficient step downhill but each mini batch does give you a pretty good approximation and more importantly it gives you a significant computational speed up if you would。
10:10
Plot the trajectory of your network under the relevant cost surface it would be a little more like a drunk man stumbling aimlessly down a hill but taking quick step rather than a carefully calculating man determining the exact downhill direction of each step before taking a very slow and careful step in that direction this technique is referred to his s stachcastic gradient descent there's kind of a lot going on here so let's just summit up for ourselves shall we back propagation is the algorithm for determining how a single training example would like to nudge the weights and biases not just in terms of whether they should go up or down but in terms of what relative proportions to those changes cause the most rapid decrease to the cost a true gradient descent step would involve doing this for all your tens and thousands of training examples and averaging the desired changes that you get but that's computationly slow so instead you randomly subdivi the data into these mini batches and compute each step with respect to a mini batch。
11:10
Repeatedly going through all of the mini batches and making these adjustments you will converge towards a local minimum of the cost function which is to say your network is going to end up doing a really good job on the training examples so with all of that said every line of code that would go into implementing backprop actually corresponds with something that you have now seen at least an informal terms but sometimes knowing what the math does is only half the battle and just representing the damn thing is where get's all mudddle and confusing so for those of you who do want To Go deeper the next video goes through the same ideas that were just presented here but in terms of the underlying calculus which should hopefully make it a little more familiar as you see the topic and other resources before that one thing worth emphasizing is that for this algorithm to work and this goes for all sorts of machine learning beyond just neural networks you need a lot of training data in our case one thing that makes handwritten digits such a nice example is that there exists the mist database with so many examples。
12:10
Have been labeled by humans so a common challenge that those of you working in machine learning will be familiar with is just getting the labeled training data that you actually need whether that's having people labeled tens of thousands of images or whatever other data type you might be dealing with in this actually transitions really nicely to today's extremely relevant sponsor crowd flower which is a software platform where data scientists and machine learning teams can create training data。They allow you to upload text or audio or image data and have it annotated by real people you may have heard of the human in the loop approach before and this is essentially what we're talking about here leveraging human intellIgEnce to train machine intellIgEnce they employ a whole bunch of pretty smart quality control mechanisms to keep the data clean and accurate and they'VE helped to train test and tune thousands of data and AI project and what's most fun there's actually a three t shirt in this for you guys if you go to three b one b dot Co slash crowd flower or follow the link on screen and in the description you can create a free account and run a project and they'LL send you a free shirt once you'VE done the job and the shirts actually pretty cool I quite like it so thanks to crowdlower for supporting this video and thank you also to everyone on patrIoT helping support these videos。
13:32
是。
我来说两句