Overfitting And Underfitting:
Overfitting and Underfitting is a fundamental problem in building a model. Before going to know about the overfitting and underfitting. we want to know about- Training Error(error on train data)
- Validation Error(error on test data)
- Decision Surface
- Accuracy
- Error
Accuracy:The number of correct predictions predicated by your model to the total number of predictions
Accuracy = number of correct predictions/total number of predictions
Error:The number of false predictions by our model to the total number of predictions
Error = number of false predictions/total number of predictions
As we know whenever we build a machine model we divide the data set into two parts one is training data and another one is testing data.
Training Error:Now take the training data and train the model using training data, then test the model using training data only now find the accuracy. As we know accuracy so we can find easily error that error is called training error
Validation Error:Now take the training data and train the model using training data, then test the model using test data. Now find the accuracy. As we know accuracy so we can find easily error that error is called validation error
Decision Surface or Curve:
For Example :
I have a data set with two class labels one is negative and another one is positive .we need separate those class labels by using decision surface or curve
For better understanding of underfitting and overfitting we can use k-nearest neighbors.
k-nearest neighbors:
Consider a simple 2D- dataset. Imagine we have two types of points. One is the positive(+ve) and the other is negative(-ve). Given a query point(xq) we need to find out then K points near to the query point.
For example, consider k value to be 5. If k=5 we need to find out five points near to the query point. Those five points will consecutively form the nearest neighborhood Now arises the doubt of how to choose those points which make the neighborhood for the query point.
And the solution to this question is:
We can find those points by considering the distance between the neighborhood point and the query point. Now there are four types of distances:
- Euclidean
- Manhattan
- Minikowshi
- Hamming
As we know our main aim is to identify the class label for the query point.So, after finding the nearest neighbors of the query point we need to find out the class label that is assigned for majority points and then this class label can be assigned for the query point.
Consider the above example. Let us suppose the five points are{-ve,-ve,+ve,+ve,+ve} then the class label that is assigned for the majority of points is +ve and there fore the class label for the query point is also +ve.
Observe the above figure 1
See if k=3 there are four points in that one query point which is in red color and remaining three are positive(+ve) and negative(-ve) points which is in yellow and red color, now see our nearest neighbors are three points(class A,class B,class B) the majority is class B so class label to query point is class B
See if k=6 there are seven points in that one query point which is in red color and remaining six are positive(+ve) and negative(-ve) points which is in yellow and red color, now see our nearest neighbors are six points(class A,class A, class A, class A, class B,class B) the majority is class A so class label to query point is class A
Now we will see what is Overfitting and Underfitting
Figure 2 |
Overfitting refers to a model that was trained too much on the particulars of the training data (when the model learns the noise in the dataset). A model that is overfit will not perform well on new, unseen data. Overfitting is arguably the most common problem in applied machine learning and is especially troublesome because a model that appears to be highly accurate will actually perform poorly in the wild.
Figure 3 |
How to determine overfitting and undefitting
Figure 4 |
- If k increases the error on train data increases
- If k value increases error on test data is decreases at point it starts increases
Overfitting:
If error on train data is small and error on test data is high then it is called overfitting because the train error is small so the model learns noise also then automatically error on test data is high because model does not predict the future data so it is called overfitting
Underfitting:
If error on train data is high and error on test data is high then it is called underfitting because the train error is high then automatically the model learns wrongly so it is called underfitting