Overfitting and underfitting

Overfitting And Underfitting:

Overfitting and Underfitting is a fundamental problem in building a model. Before going to know about the overfitting and underfitting. we want to know about

Training Error(error on train data)
Validation Error(error on test data)
Decision Surface
Accuracy
Error

Accuracy:The number of correct predictions predicated by your model to the total number of predictions

Accuracy = number of correct predictions/total number of predictions

Error:The number of false predictions by our model to the total number of predictions

Error = number of false predictions/total number of predictions

As we know whenever we build a machine model we divide the data set into two parts one is training data and another one is testing data.

Training Error:Now take the training data and train the model using training data, then test the model using training data only now find the accuracy. As we know accuracy so we can find easily error that error is called training error

Validation Error:Now take the training data and train the model using training data, then test the model using test data. Now find the accuracy. As we know accuracy so we can find easily error that error is called validation error

Decision Surface or Curve:

For Example :

I have a data set with two class labels one is negative and another one is positive .we need separate those class labels by using decision surface or curve

For better understanding of underfitting and overfitting we can use k-nearest neighbors.

k-nearest neighbors:

Consider a simple 2D- dataset. Imagine we have two types of points. One is the positive(+ve) and the other is negative(-ve). Given a query point(xq) we need to find out then K points near to the query point.

For example, consider k value to be 5. If k=5 we need to find out five points near to the query point. Those five points will consecutively form the nearest neighborhood Now arises the doubt of how to choose those points which make the neighborhood for the query point.

And the solution to this question is:
We can find those points by considering the distance between the neighborhood point and the query point. Now there are four types of distances:

Euclidean
Manhattan
Minikowshi
Hamming

As we know our main aim is to identify the class label for the query point.So, after finding the nearest neighbors of the query point we need to find out the class label that is assigned for majority points and then this class label can be assigned for the query point.

Consider the above example. Let us suppose the five points are{-ve,-ve,+ve,+ve,+ve} then the class label that is assigned for the majority of points is +ve and there fore the class label for the query point is also +ve.

K-NN Figure 1

Observe the above figure 1
See if k=3 there are four points in that one query point which is in red color and remaining three are positive(+ve) and negative(-ve) points which is in yellow and red color, now see our nearest neighbors are three points(class A,class B,class B) the majority is class B so class label to query point is class B
See if k=6 there are seven points in that one query point which is in red color and remaining six are positive(+ve) and negative(-ve) points which is in yellow and red color, now see our nearest neighbors are six points(class A,class A, class A, class A, class B,class B) the majority is class A so class label to query point is class A

Now we will see what is Overfitting and Underfitting

Figure 2

Overfitting refers to a model that was trained too much on the particulars of the training data (when the model learns the noise in the dataset). A model that is overfit will not perform well on new, unseen data. Overfitting is arguably the most common problem in applied machine learning and is especially troublesome because a model that appears to be highly accurate will actually perform poorly in the wild.

Figure 3

Underfitting typically refers to a model that has not been trained sufficiently. This could be due to insufficient training time or a model that was simply not trained properly. A model that is underfit will perform poorly on the training data as well as new, unseen data alike.

How to determine overfitting and undefitting

Figure 4

Now observe clearly Figure 4

If k increases the error on train data increases
If k value increases error on test data is decreases at point it starts increases

Overfitting:

If error on train data is small and error on test data is high then it is called overfitting because the train error is small so the model learns noise also then automatically error on test data is high because model does not predict the future data so it is called overfitting

Underfitting:

If error on train data is high and error on test data is high then it is called underfitting because the train error is high then automatically the model learns wrongly so it is called underfitting

Overfitting and underfitting

Monday, April 20, 2020