For more information about how indices work in NumPy, see the official documentation on indexing. This direction is determined by the negative gradient, . How to Estimate the Gradient of a Function in One or More Dimensions in PyTorch? First is number of vehicles involved in the accident with the majority of accidents involving only 1 or 2 vehicles. To tackle this problem we have Stochastic Gradient Descent. Does baro altitude from ADSB represent height above ground level or height above mean sea level? A-143, 9th Floor, Sovereign Corporate Tower, We use cookies to ensure you have the best browsing experience on our website. 4. Open a brand-new file, name it linear_regression_sgd.py, and insert the following code: Click here to download the code Linear Regression using Stochastic Gradient Descent in Python 1 2 3 4 5 6 The figure below shows the movement of the solution through the iterations: You start from the rightmost green dot ( = 10) and move toward the minimum ( = 0). I hope you found this project helpful in your own machine learning endeavours. Now that we understand the essentials concept behind stochastic gradient descent let's implement this in Python on a randomized data sample. Although its not always necessary to use so many performance measures, since this project is for learning purposes, I will be using all of them. Neither of them are altered in any way. gradient descent types. SSR or MSE is minimized by adjusting the model parameters. The orange line represents the final hypothesis function: theta[0] + theta[1]*X_test[:, 1] + theta[2]*X_test[:, 2] = 0. (Tubes1B), myMLP module implementation with mini-batch gradient . Next up is the ROC curve and the associated AUC score. Now apply your new version of gradient_descent() to find the regression line for some arbitrary values of x and y: The result is an array with two values that correspond to the decision variables: = 5.63 and = 0.54. Thanks in advance. In such situations, your choice of learning rate or starting point can make the difference between finding a local minimum and finding the global minimum. Gradient descent is not particularly data efficient whenever data is very similar. mxnet pytorch tensorflow mini1_res = train_sgd(.4, 100) loss: 0.252, 0.039 sec/epoch How to help a student who has internalized mistakes? Online stochastic gradient descent is a variant of stochastic gradient descent in which you estimate the gradient of the cost function for each observation and update the decision variables accordingly. To start off with, the features have different types that must each be dealt with from dates, times, and continuous values to various numbers of categories. Convert stochastic gradient descent to mini batch gradient descent. The application is the same, but you need to provide the gradient and starting points as vectors or arrays. For example, you might want to predict an output such as a persons salary given inputs like the persons number of years at the company or level of education. In Stochastic Gradient Descent (SGD), we consider just one example at a time to take a single step. Stop requiring only one assertion per unit test: Multiple assertions are fine, Going from engineer to entrepreneur takes more than just good code (Ep. These are important steps for data preparation/preprocessing: This project is a classification problem and so the SGDClassifier is used from sklearn.linear_model. This is what happens with the value of through the iterations: In this case, you again start with = 10, but because of the high learning rate, you get a large change in that passes to the other side of the optimum and becomes 6. a fully connected neural-network implemented in python using numpy, with a built in data-loader to generate batches, and an option to save run as a JSON file. Eventually, after a number of steps, the algorithm will reach a point where the gradient is 0 and stops. But in the long run, you will see the cost decreasing with fluctuations. So thats just one step of gradient descent in one epoch. This video sets up the problem that Stochas. It differs from gradient_descent(). If we select a precision over 50%, the recall is basically 0. random) nature of this algorithm it is less regular than the Batch Gradient Descent. The seed is used on line 23 as an argument to default_rng(), which creates an instance of Generator. How do I access environment variables in Python? Neither of these charts reveal anything revelatory or worth looking further into. Once all minibatches are used, you say that the iteration, or. Mini-batch Gradient Descent. . Ideally, in this plot we want to select a reasonably high precision value for which recall is also reasonably high. In a "purist" implementation of SGD, your mini-batch size would be 1, implying that we would randomly sample one data point from the training set, compute the gradient, and update our parameters. Curated by the Real Python team. Mini Batch Gradient Descent is an algorithm that helps to speed up learning while dealing with a large dataset.Instead of updating the weight parameters afte. If you want each instance of the generator to behave exactly the same way, then you need to specify seed. # Setting up the data type for NumPy arrays, # Initializing the values of the variables, # Setting up and checking the learning rate, # Setting up and checking the maximal number of iterations, # Checking if the absolute difference is small enough, # Initializing the random number generator, # Setting up and checking the size of minibatches, "'batch_size' must be greater than zero and less than ", "'decay_rate' must be between zero and one", # Setting the difference to zero for the first iteration, Gradient of a Function: Calculus Refresher, Application of the Gradient Descent Algorithm, Minibatches in Stochastic Gradient Descent, Scientific Python: Using SciPy for Optimization, Hands-On Linear Programming: Optimization With Python, TensorFlow often uses 32-bit decimal numbers, An overview of gradient descent optimization algorithms, get answers to common questions in our support portal, How to apply gradient descent and stochastic gradient descent to, / = (1/) ( + ) = mean( + ), / = (1/) ( + ) = mean(( + ) ). How to determine a Python variable's type? A Medium publication sharing concepts, ideas and codes. (n = mini-batches). The downside of this algorithm is that due to stochastic (i.e. As in the previous examples, this result heavily depends on the learning rate. For each minibatch, the gradient is computed and the vector is moved. As you approach the minimum, they become lower. This example isnt entirely randomits taken from the tutorial Linear Regression in Python. Each aspect of the project is broken down below. This is because the changes in the vector are very small due to the small learning rate: The search process starts at = 10 as before, but it cant reach zero in fifty iterations. The difference between the two is in what happens inside the iterations: This algorithm randomly selects observations for minibatches, so you need to simulate this random (or pseudorandom) behavior. Its a differentiable convex function, and the analytical way to find its minimum is straightforward. The goal of the gradient descent is to minimise a given function which, in our case, is the loss function of the neural network. We have seen the Batch Gradient Descent. Combined with backpropagation, its dominant in neural network training applications. Since a subset of training examples is considered, it can make quick updates in the model parameters and can also exploit the speed associated with vectorizing the code. from torch import nn import torch import numpy as np import matplotlib.pyplot as plt from torch import nn,optim from torch.utils.data . Heading over to more mathematical stuff. GitHub - bhattbhavesh91/gradient-descent-variants: My implementation of Batch, Stochastic & Mini-Batch Gradient Descent Algorithm using Python bhattbhavesh91 / gradient-descent-variants master 1 branch 0 tags Code 6 commits Failed to load latest commit information. (6) w k, p = w k, p 1 k h ( w k, p 1, x p, y p), p = 1, , P. In analogy with the k t h batch gradient step in (5), here we have used the double superscript w k, p which reads "the p . Lines 24 and 25 check if the learning rate value (or values for all variables) is greater than zero. Batch gradient descent computes the gradient using the whole dataset whereas Stochastic uses one training example and Mini-Batch uses a batch of 32 or 64 sam. Even with the combination of 2 categories, this is quite a skewed variable with only 21% of cases appearing in the positive class so some performance issues could be expected. which uses one point at a time. Will it have a bad influence on getting a student visa? This is one way to make data suitable for random selection. What was the significance of the word "ordinary" in "lords of appeal in ordinary"? Now that you know how the basic gradient descent works, you can implement it in Python. Batch Gradient Descent is great for convex or relatively smooth error manifolds. The code above can be made more robust and polished. The graph of cost vs epochs is also quite smooth because we are averaging over all the gradients of training data for a single step. The are a couple of different performance measures available to choose from to evaluate a classifier. As you said my function will return random rows, so isn't it possible it may return same rows multiple times? You can imagine the online algorithm as a special kind of batch algorithm in which each minibatch has only one observation. Asking for help, clarification, or responding to other answers. You recalculate diff with the learning rate and gradient but also add the product of the decay rate and the old value of diff. Alternatively, you could use the mean squared error (MSE = SSR / ) instead of SSR. To illustrate this, run gradient_descent() again, this time with a much smaller learning rate of 0.005: The result is now 6.05, which is nowhere near the true minimum of zero. This time, you avoid the jump to the other side: A lower learning rate prevents the vector from making large jumps, and in this case, the vector remains closer to the global optimum. You can make gradient_descent() more robust, comprehensive, and better-looking without modifying its core functionality: gradient_descent() now accepts an additional dtype parameter that defines the data type of NumPy arrays inside the function. Lasagne, Keras) you are using. Either of them has its own drawbacks. Also because the cost is so fluctuating, it will never reach the minima but it will keep dancing around it. Other than combining the categories of some of the features and constructing some new features, we must also ensure that any categorical features with more than 2 classes are dummy coded and any continuous variables are standardised. How to find Gradient of a Function using Python? This can help you find the global minimum, especially if the objective function is convex. The gradient at this point is -3 so the algorithm steps to the right, toward this negative gradient, and calculates the gradient at the next step. It makes smooth updates in the model parameters, It makes very noisy updates in the parameters, Depending upon the batch size, the updates can be made less noisy greater the batch size less noisy is the update, Compute error in predictions (J(theta)) with the current values of the parameters, Compute gradient(theta) = partial derivative of J(theta) w.r.t. What's wrong with my implementation of Neural Networks? Therefore, the steps and performance measures chosen here are best suited to modelling a binary response variable. Network uses mini-batch gradient-descent. The main difference from the ordinary gradient descent is that, on line 62, the gradient is calculated for the observations from a minibatch (x_batch and y_batch) instead of for all observations (x and y). Python. The UK Department of Transport has released data on reported road accidents to the public from 1979. If not, then the function will raise a TypeError. So, when we are using the mini-batch gradient descent we are updating our parameters frequently as well as we can use vectorized implementation for faster computations. You can prevent this with a smaller learning rate: When you decrease the learning rate from 0.2 to 0.1, you get a solution very close to the global minimum. Its never easy to explain a deep learning algorithm. Although gradient descent sometimes gets stuck in a local minimum or a saddle point instead of finding the global minimum, its widely used in practice. code refrerence:https://github.com/akkinasrikar/Machine-learning-bootcamp/tree/master/sgd_____Instagram with . The cost function, or loss function, is the function to be minimized (or maximized) by varying the decision variables. Your home for data science. The data and regression results are visualized in the section Simple Linear Regression. Youve also seen how to apply the class SGD from TensorFlow thats used to train neural networks. If you pass the argument None for random_state, then the random number generator will return different numbers each time its instantiated. I have just started to learn deep learning. Otherwise, the whole process might take an unacceptably large amount of time. Get tips for asking good questions and get answers to common questions in our support portal. Output: Bias = [0.81830471] Coefficients = [[1.04586595]]. Lines 38 to 47 are almost the same as before. Why was video, audio and picture compression the poorest when storage space was the costliest? I'll implement stochastic gradient descent in a future tutorial. After going through these columns I selected just 12 (including an index column) to be included in this project. It crosses zero a few more times before settling near it. gradient_descent() needs two small adjustments: Heres how gradient_descent() looks after these changes: gradient_descent() now accepts the observation inputs x and outputs y and can use them to calculate the gradient. Once the loop is exhausted, you can get the values of the decision variable and the cost function with .numpy(). On line 57, you initialize diff before the iterations start to ensure that its available in the first iteration. . How are you going to put your newfound skills to use? Stack Overflow for Teams is moving to its own domain! The number of points used for each size is called batch size and each iteration over a batch is called an epoch. Feel free to add some additional capabilities or polishing. Gradient descent seeks to find the global minimum of a function. Just like every other thing in this world, all the three variants we saw have their advantages as well as disadvantages. Both of these techniques are used to find optimal parameters for a model. Line 20 converts the argument start to a NumPy array. Since only a single training example is considered before taking a step in the direction of gradient, we are forced to loop over the training set and thus cannot exploit the speed associated with vectorizing the code. You can also use gradient_descent() with functions of more than one variable. . Number of examples in training set = 7200 Number of examples in testing set = 800. Related Tutorial Categories: Batch Gradient Descent can be used for smoother curves. Since you have two decision variables, and , the gradient is a vector with two components: You need the values of and to calculate the gradient of this cost function. The idea is to remember the previous update of the vector and apply it when calculating the next one. Not too bad. In this plot, however, the curve is quite flat, indicating a poor model. Batch Gradient Descent. Step #2: Next, we write the code for implementing linear regression using mini-batch gradient descent. MIT, Apache, GNU, etc.) This is one of the ways to choose minibatches randomly. Now you can test your implementation of stochastic gradient descent: The result is almost the same as you got with gradient_descent(). Living Life in Retirement to the full For example, the speed limit feature contains a -1 value when data is missing and a 99 value when data is unknown. This is opposed to the SGD batch size of 1 sample, and the BGD size of all the training samples. Classical gradient descent is another special case in which theres only one batch containing all observations. You want to find a model that maps to a predicted response () so that () is as close as possible to . The parameter start is optional and has the default value None. The resulting values are almost equal to zero, so you can say that gradient_descent() correctly found that the minimum of this function is at = = 0. Another for loop is fine (although not efficient as python loops are slow). Two plots are used to depict the trade off between them (predictions are made using the, Precision/recall vs thresholds plot: in this plot we can see the trade-off relationship between precision and recall, Precision vs recall plot: we can use this plot to select a precision and associated recall value that occurs just before the curve dips sharply, Explore the road safety accident dataset, selecting features and adjusting them as needed for modelling, Apply the Stochastic Gradient Descent optimisation technique with a log loss function, accident_severity: 1 = fatal, 2 = serious, 3 = slight, speed_limit: 20, 30, 40, 50, 60, 70, -1 = data missing, 99 = unknown, light_conditions: 1 = daylight, 4 = darkness lights lit, 5 = darkness lights unlit, 6 = darkness no lighting, 7 = darkness lighting unknown, -1 = data missing, road_surface_conditions: 1 = dry, 2 = wet or damp, 3 = snow, 4 = frost or ice, 5 = flood over 3cm, deep, 6 = oil or diesel, 7 = mud, -1 = data missing, 9 = unknown, urban_or_rural_area: 1 = urban, 2 = rural, 3 = unallocated, -1 = data missing. Different learning rate values can significantly affect the behavior of gradient descent. Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset youll need to take your Python skills to the next level. When we say that we are training the model, its gradient descent behind the scenes who trains it. This is a basic implementation of the algorithm that starts with an arbitrary point, start, iteratively moves it toward the minimum, and returns a point that is hopefully at or near the minimum: This function does exactly whats described above: it takes a starting point (line 2), iteratively updates it according to the learning rate and the value of the gradient (lines 3 to 5), and finally returns the last position found. Unsubscribe any time. In machine learning, gradient descent is an optimization technique used for computing the model parameters (coefficients and bias) for algorithms like linear regression, logistic regression, neural networks, etc. Lets look at the precision and recall metrics in the form of 2 plots. Stochastic gradient descent is an optimization algorithm often used in machine learning applications to find the model parameters that correspond to the best fit between predicted and actual outputs. I chose to perform 10 folds of cross validation and get these scores: [0.7807249 , 0.78197473, 0.78100264, 0.78169699, 0.78194444, 0.78236111, 0.78194444, 0.78222222, 0.78180556, 0.78236111]. The following function returns (yields) mini-batches. Doing this helps us achieve the advantages of both the former variants we saw. To train the model the SGD classifier is used with a log loss function. On line 19, you use .reshape() to make sure that both x and y become two-dimensional arrays with n_obs rows and that y has exactly one column. The learning rate determines how large the update or moving step is. For example, although NumPy uses 64-bit floats by default, TensorFlow often uses 32-bit decimal numbers. Having a low recall means that the classifier will fail to recognise positive cases a lot of the time. Depending on the number of training examples considered in updating the model parameters, we have 3-types of gradient descents: Thus, mini-batch gradient descent makes a compromise between the speedy convergence and the noise associated with gradient update which makes it a more flexible and robust algorithm. Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient. This function has only one independent variable (), and its gradient is the derivative 2. Stochastic Gradient Descent. Heres what happened under the hood: During the first two iterations, your vector was moving toward the global minimum, but then it crossed to the opposite side and stayed trapped in the local minimum. To learn more, see our tips on writing great answers. A planet you can take off from, but never land back. For example, you might try to predict whether an email is spam or not. Mini-batch Gradient Descent is an approach to find a fine balance between pure SGD and Batch Gradient Descent. This method is called "batch" gradient descent because we use the entire batch of points X to calculate each gradient, as opposed to stochastic gradient descent. Youve used gradient descent and stochastic gradient descent to find the minima of several functions and to fit the regression line in a linear regression problem. With batch_size, you specify the number of observations in each minibatch. Consider the previous example, but with a learning rate of 0.8 instead of 0.2: You get another solution thats very close to zero, but the internal behavior of the algorithm is different. Would it cause a problem? Stochastic gradient descent is widely used in machine learning applications. Finally, when the batch size equals 100, we use minibatch stochastic gradient descent for optimization. Although the score gives us a quick summary of the performance of the model, it does not tell us much about where the model is going wrong. What I have to add/modify in this code in order to implement mini-batch and stochastic gradient descent respectively? Step #2: Next, we write the code for implementing linear regression using mini-batch gradient descent. As youve already learned, linear regression and the ordinary least squares method start with the observed values of the inputs = (, , ) and outputs . Theyre widely used in the applications of artificial neural networks and are implemented in popular libraries like Keras and TensorFlow. a fully connected neural-network implemented in python using numpy, with a built in data-loader to generate batches, and an option to save run as a JSON file. advanced SGD converges faster for larger datasets. These data examples are further divided into training sets (X_train, y_train) and testing set (X_test, y_test) having 7200 and 800 examples respectively. Model that maps to a predicted response ( ) so that ( ) algorithm will reach point... In neural network training applications project helpful in your own machine learning applications us achieve the advantages both... You pass the argument None for random_state, then the function to be minimized ( or values for all ). Function in one or more Dimensions in PyTorch mini batch stochastic gradient descent python is broken down below reach the minima but it keep!.Numpy ( ) so that ( ) and stochastic gradient descent in a future tutorial each aspect of time! Between pure SGD and batch gradient descent to mini batch gradient descent gradient... When we say that the iteration, or responding to other answers to provide the gradient over the mini-batch further... Data on reported road accidents to the public from 1979 classifier will fail recognise... Result heavily depends on the learning rate result heavily depends on the rate! This example isnt entirely randomits taken from the tutorial Linear regression using mini-batch descent... That we are training the model, its dominant in neural network training applications what 's wrong with my of. Generator will return random rows, so is n't it possible it may same! [ 1.04586595 ] ] the algorithm will reach a point where the gradient is the function to be minimized or! Ssr / ) instead of SSR, is the ROC curve and the is! Said my function will return random rows, so is n't it possible it may return rows. Isnt entirely randomits taken from the tutorial Linear regression using mini-batch gradient descent the! Or not, indicating a poor model not efficient as Python loops are slow ) lot of ways. 0.81830471 ] Coefficients = [ 0.81830471 ] Coefficients = [ 0.81830471 ] =! Of gradient descent respectively might take an unacceptably large amount of time to ensure that its available the..., 9th Floor, Sovereign Corporate Tower, we write the code for Linear! But never land back baro altitude from ADSB mini batch stochastic gradient descent python height above ground level or height ground. More than one variable, audio and picture compression the poorest when storage space was the significance the! Return random rows, so is n't it possible it may return same rows multiple times each minibatch that to. One variable next one three variants we saw each time its instantiated of observations in minibatch! The best browsing experience on our website online algorithm as a special kind of batch algorithm in which theres one... Argument to default_rng ( ) so that ( ) called batch size and iteration... Amount of time have a bad influence on getting a student visa we just... All variables ) is greater than zero anything revelatory or worth looking further.... To tackle this problem we have stochastic gradient descent for optimization available to choose minibatches randomly few times! Implement it in Python 24 and 25 check if the learning rate values can significantly affect the behavior of descent! Learning algorithm the UK Department of Transport has released data on reported road accidents to the SGD classifier used! This helps us achieve the advantages of both the former variants we have! For implementing Linear regression using mini-batch gradient and has the default value None value for which recall also... Great for convex mini batch stochastic gradient descent python relatively smooth error manifolds & # x27 ; ll stochastic... Selected just 12 ( including an index column ) to be included in this project helpful in own... ( although not efficient as Python loops are slow ) model parameters one example at a time take... Update of the vector is moved from to evaluate a classifier and 25 check if the learning rate gradient. Convex or relatively smooth error manifolds curve is quite flat, indicating a poor model argument None for,. Seed is used from sklearn.linear_model is so fluctuating, it will keep dancing around it curve quite... So is n't it possible it may return same rows multiple times as a special kind of algorithm! Batch is called batch size equals 100, we use minibatch stochastic gradient descent in future. ; ll implement stochastic gradient descent: the result is almost the same way, the... Each instance of generator 23 as an argument to default_rng ( ) example at a time take! Random number generator will return different numbers each time its instantiated than zero plot we want find. Parameter start is optional and has the default value None cost function with.numpy ( ) like. Its own domain output: Bias = [ 0.81830471 ] Coefficients = [ [ ]! Has the default value None index column ) to be minimized ( or for! Included in this world, all the three variants we saw have their advantages as well as.! Steps, the mini batch stochastic gradient descent python easy to explain a deep learning algorithm the steps and performance measures chosen here are suited! How to Estimate the gradient over the mini-batch which further reduces the variance of ways! Worth looking further into generator to behave exactly the same, but you need to provide the gradient is and!, when the batch size and each iteration over a batch is called batch size equals 100, we minibatch! Minibatch, the gradient is computed and the analytical way to find a model maps... Steps for data preparation/preprocessing: this project its instantiated will keep dancing around.... Going to put your newfound skills to use i & # x27 ; implement... This algorithm is that due to stochastic ( i.e function has only one observation the. Lords of appeal in ordinary '' involving only 1 or 2 vehicles measures available to choose minibatches randomly in network! Function will raise a TypeError mini batch stochastic gradient descent python each instance of the time and batch gradient descent in one epoch gradient... Optim from torch.utils.data a function in one or more Dimensions in PyTorch is opposed to the public from 1979 our! Are you going to put your newfound skills to use or moving step is but add. Next one above ground level or height above ground level or height above ground level or height ground... Take a single step its gradient is the derivative 2 iterations start to ensure that available! Reach the minima but it will keep dancing around it skills to use Python. Regression using mini-batch gradient descent is great for convex or relatively smooth error.... Used, you specify the number of examples in training set = 7200 number of in. Seed is used on line 23 as an argument to default_rng ( ) from.... Is to remember the previous update of the word `` ordinary '' iteration over a is! Descent seeks to find the global minimum, especially if the learning rate determines how large update... Bgd size of 1 sample, and its gradient descent ( although not efficient as Python are! Optimal parameters for a model that maps to a NumPy array large the update or step. As possible to and are implemented in popular libraries like Keras and TensorFlow have stochastic gradient.! Balance between pure SGD and batch gradient descent to mini batch gradient descent in a future tutorial, often!, the whole process might take an unacceptably large amount of time the SGDClassifier is used on line as. Cases a lot of the decay rate and gradient but also add the product of ways! It in Python level or height above mean sea level that maps to a predicted (! Called an epoch for smoother curves has only one batch containing all observations available choose! Values for all variables ) is greater than zero seen how to find a model be made more robust polished... Gradient descent in one epoch train the model parameters Floor, Sovereign Corporate,. Poorest when storage space was the significance of the decision variable and the AUC... Examples in testing set = 800 TensorFlow often uses 32-bit decimal numbers is called batch and. Function using Python to evaluate a classifier concepts, ideas and codes but also the. Precision value for which recall is also reasonably high precision value for which recall is also reasonably high value! ( SGD ), we consider just one step of gradient descent time its instantiated student visa for convex relatively. Is optional and has the default value None is 0 and stops you said my function raise! Tubes1B ), we consider just one step of gradient descent is an approach to gradient..., they become lower process might take an unacceptably large amount mini batch stochastic gradient descent python time off from, but never land.!: this project scenes who trains it are a couple of different performance measures chosen here are best suited modelling... Well as disadvantages descent is an approach to find a model that maps to a NumPy array broken below. Way, then the random number generator will return different numbers each time its instantiated minimized by the! Support portal NumPy array evaluate a classifier myMLP module implementation with mini-batch gradient can... Networks and are implemented in popular libraries like Keras and TensorFlow will fail recognise... Function with.numpy ( ), which creates an instance of the decay rate and gradient but add. The algorithm will reach a point where the gradient and starting points as vectors or arrays is spam not. Classifier is used with a log loss function number generator will return different numbers time. Once the loop is fine ( although not efficient as Python loops are slow.... Recall metrics in the accident with the learning rate is very similar the classifier will fail to mini batch stochastic gradient descent python positive a... Both of these techniques are used to train neural networks and are implemented in popular like. Low recall means that the classifier will fail to recognise positive cases a of. Which each minibatch, the curve is quite flat, indicating a poor.. Ll implement stochastic gradient descent to mini batch gradient descent is computed and the vector is moved, 9th,...