We found a dataset on kaggle.com of MLB position player statistics and salary data (adjusted for inflation) for 1985-2016. The dataset comprises 15023 observations with 29 features (4 ID features, 2 salary features, 4 fielding features, 19 offensive features). We focused on the adjusted salary as our dependent variable. After cleaning the dataset to remove some errant duplicates, we had 15014 observations, 25 independent variables, 1 dependent variable.
We had to assume that the creator of our Kaggle.com dataset applied the inflation adjustment correctly. We know that his original source for the data was a well-established baseball database created by Sean Lahman. The dataset had some oddities which required decisions:
We attempted to filter out every player's first 3 career years from the dataset to address this issue. When we aggregated the data in our improved linear regression the issue became moot, since we took the mean of all career year stats for every player. The salary range caused an issue as well … at the low end (after outlier removal) salary started at $136,734. At the high end, $39,810,209!
Of course when we plotted the histogram of salaries it was very “left-skewed” with a long rightward tail. Based on a suggestion in a Moneyball-themed post on Medium.com, we transformed the adjusted salary into its natural logarithm, thereby making the histogram distribution look more like a normal, Gaussian distribution.
Correlation matrix analysis revealed that the offensive features were more highly correlated with our dependent variable than any of the other features in the dataset, so we focused our efforts there. GS (games started), BB (walks), RBI (runs batted in), R (runs scored), HR (home runs), and InnOuts (inning outs, a measure of game time played) were the highest-correlated with ADJ Salary.


