Using Grid Search to Find Optimal Hyperparameters for Random Forests
Most of machine learning can be broken down into two major categories: classification and prediction. Classification is where you try to decide which category something belongs into. A couple of examples include “Is the picture a cat or a dog?” or “given these factors, does a person make over $50,000 per year?” Models like logistic regression, random forests, and XGBoost are among the most commonly used. Prediction is where you try to predict a value at a certain point in the future. Some examples are “What will be the cost of a stock in 3 months?” or “If we make these changes, how much will our sales increase?” Linear regression and general linear models (GLM) are a couple of the commonly used models. Today I am going to focus on classification with random forests. Check out this article for more information on the details of how random forests works.
One of the challenges of building any model is choosing the best hyperparameters. Hyperparameters are values you set when you are training, whereas parameters are values that are learned during training. For example, you can set the number of decision trees to use in a random forest. It can be tricky, though. Too few trees and your model may not be optimal and thus perform poorly. Too many trees and your model may result in overfitting and also increase computational time significantly. How do you find this balance without having to manually change each combination of hyperparameters and re-run training? Talk about labor intensive and, honestly, a huge waste of your time valuable time (see Time blog). Enter “Grid Search.”
Grid search is a method that can create lists of the hyperparameter values you want to try. You then let your script use all the combinations (or a random subset) of hyperparameters and find the best performing model according to some metric you use to measure the quality of your model, such as accuracy. So rather than manually trying each value, you start the script and let it run in the background. When it is finished, it provides a dictionary of the best hyperparameter values it found. You take those values, use them in your training model, and see how much it improves the model. If you don’t see much improvement, then it may be time to do some more feature engineering!
In this post, I show you how to use Python’s GridSearchCV method along with a RandomForestRegressor to perform an exhaustive grid search to find the optimal hyperparameters for a simple random forest model. My purpose is not to do an exhaustive analysis of the dataset in order to get the absolute best classification results, but rather to demonstrate the process and provide a template script you can use with your own data. Feel free to take this dataset, manipulate it, and see if you get better results. If you do, please be sure to share them.
The script, in the form of a Jupyter Notebook found on my Github, takes census data and tries to predict if the person in each row makes less or greater than $50,000 per year. The data I used came from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Census+Income. In the script, the following is done:
- Import train/test and validation census data files
- Use random forest with default parameters to predict income for each row, then run model against validation dataset
- Perform grid search for multiple hyperparameters to determine which best fit the data (produce highest accuracy)
- Use random forest with optimal parameters determined from grid search to predict income for each row
The script is straightforward and will hopefully allow you to be more productive in your work. Instead of a lot of manual labor, you can focus on the things you love about data science and making your business more efficient and profitable.
If you have any questions, comments, or improvements, let me know. Be sure to connect with me on LinkedIn and Twitter (@pacejohn)!