We decided to try a different type of model to see if that would give us better results. To start us off on this new path we decided to bin the salaries into ranges. We thought this would be the model since it wouldn’t be trying to predict an exact amount anymore. But how many bins should we use?
We started by looking at our range of salary trying to break that into equal chunks. In doing so we had nine bins. One thing we noticed was that the lower three bins were significantly larger than the last six. We decided to combine the last six bins into one so each bin had values in the thousands. From there we added a column that would be the bin code. For example, if you made less than $1 million your salary code was a zero.
| Salary Bin | Amount of Players in Bin | Binning Code |
|---|---|---|
| Less than $1 million | 7437 | 0 |
| $1 to $5 million | 5172 | 1 |
| $5 to $10 million | 1248 | 2 |
| More than $10 million | 1156 | 3 |
Now we have a classification data problem, so our algorithms will be classifiers, not regressors! We tried KMeans, KNN, RandomForest, ExtraTrees, and Support Vector Machine(SVM). Upshot: no classifier could achieve > 50% accuracy. Again we were disappointed with the results…