Cross-validation is commonly used in statistics and machine learning to assess the model performance make a more robust model and get reliable inferences from the data.
- To perform cross-validation:
- Divide the dataset into two parts: training and validation set or test set.
- Choose the number of folds (k) for cross-validation and split the training set into k equal subsets.
- Select a performance metric to evaluate our model.
- Repeat the following k times.
- Train the model using k-1 folds.
- Validate the model on the remaining fold and calculate the chosen performance metric each time.
- Calculate the average of the ‘k’ performance metric to assess the model’s overall performance.
- To illustrate this, I have created random data, performed cross-validation, and compared the evaluation metric using boxplots.
- I have plotted the box plot for the evaluation metric to compare before and after applying the cross-validation.
- Getting a lower median after applying the cross-validation is a positive sign. It indicates that on average, the model’s predictions have improved after cross-validation and it helped to reduce bias and resulted in a more accurate model.
- However, a higher IQR implies greater variability in the model’s performance across different subsets of the data. This might indicate that the model’s performance is less consistent, possibly due to the presence of outliers or differences in data quality between folds.
Bootstrap is a resampling technique used for estimating the sampling distribution of a statistic by repeatedly resampling from the observed data. It’s particularly useful when we have a limited dataset and want to make inferences about population parameters or assess the uncertainty of your estimates.
- To perform bootstrap resampling:
- Randomly pick ‘n’ data points (with replacement) from your dataset. Repeat this process many times (usually thousands).
- Compute the statistic you’re interested in (mean, median, etc.) for each of these new samples. This creates a distribution of the statistics.
- Use the distribution of the statistic to draw conclusions about the population parameter of interest or to understand the uncertainty associated with the estimate.
- ‘X’ indicates data points that are generated by bootstrapping.