- The intention is to experiment different hyperparameters and variables to find the best performing (lowest RMSE) K-Nearest Neighbors
Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, re
- Columns were not labeled. Used the Attribute Information to extract column names using Regex, and inserted to column names.
- Replaced "?" values with np.nan
- Replaced car doors with numberical value
- Dropped all rows where target column ('price') is null.
- For the purpose of making every datapoint have the same scale, and so each feature is equally important, we normalize all numerical columns with min-max normalization (x-min)/(max-min)
- With initial value of k-neighbours of 5, K-Nearest Neighbors model is tried on all columns. Here is their RMSE performance: (y-axis: RMSE, x-axis: k)
- With first model training, k-value of 5 seems to be showing the earliest low RMSE with feature 'curb-weight'.
- Next we will try different number of combinations of features from the top 5 performing features (curb-weight, highway-mpg, city-mpg, length, width)
- Result:
- Lowest RMSE: Top 4 combined variables of k=5 had RMSE of 3022
- Top 3 combined variables of k=5 had RMSE of 3226
- Top 5 combined variables of k=5 had RMSE of 3367
- Top 2 combined variable of k=5 had RMSE of 3460