This report explores a “tidy data set that contains 1,599 red wines with 11 variables on the chemical properties of that wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).” (Source Udacity Data Set options). There are 13 variables in the provided dataset including the observation # which is the same as the row_number.
No code was copied verbatim from digital sources (or non digital). However, extensive use of google, stack overflow, etc. was used for reference, ideas, and research purposes. There is considerable distance between these sources and the final report provided here.
Factors in Dataset
I have changed some of the default names in the dataset to be more compact. The long names make many of the plots difficult to read.
The dependent variable is ‘quality’. That is what factors within this dataset impact the quality of the wine. The independent variables are the balance of the factors.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fx.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ v.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid: num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ res.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ FSD : num 11 25 15 17 11 13 15 15 9 17 ...
## $ TSD : num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fx.acidity v.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## res.sugar chlorides FSD TSD
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Lets start by looking at histograms of all of the factors.
The histogram for fixed acidity shows a somewhat right skewed distribution with a long tail and quite a few outliers. We will need to trim those outliers when we work with this factor.
The histogram for violatile acidity shows a lot of outliers to the right. If those are trimmed the distribution is roughly normal. the peaking we show here disappears with a larger bin size. We will need to trim those outliers when we work with this factor.
This is a VERY unsual distribution. There are peaks at 0, .25, and .5. The distribution is very right skewed. There are a lot of outliers on the right also.
A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.
A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.
A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.
A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.
Finally a very normal looking distribution with just a few outliers on either side!
A very normal looking distribution with just a few outliers on either side!
A bit of a right tail, with a lot of outliers.
A right skewed distribution with a lot of outliers on both the left and right.
A fairly normal distribution, but obviously there are not a lot of bins. This is essentially a categorical variable.
I have reviewed all 12 of the features in the above histograms. The one feature that is intriguing is Citric Acid which is right skewed (non log scale). There are 3 peaks at 0, .25 and .5 with Citric Acid. This is interesting Quite often vintners will talk about flavours or smells that have to do with citrus fruits. We will keep this in mind as we explore this dataset.
In the multivariate section we will add two additional variables as factors based on quality. This is so that we can visualize the data with quality as an added dimension.
This is a very tidy dataset. There are some anomalies as noted above. Residual sugar, FSD, TSD, sulphates, citric acid, and alcohol are all long tailed (right skewed). Let’s put these on a log scale and see what they do.
Look for zeros in dataframe.
Except for the new categorical variables we created (as numerics) the only variable that has a lot of 0s (132) is citric.acid. This represents about 8.3% of the values for that column. Makes you wonder why but … I have seen worse. It is odd because the other columns do not have any 0 values. However, this presence of 0s is yanking down the mean and median for this variable. That may be signficant. It also may be truthful. That is there was no measurable presence of citrus acid. This could be the explanation for the reason why citric acid does not conform to the normal distribution on a log scale.
We should also ensure that there are no NAs or NaNs in the dataframe.
There are no nas (missing values) or nans (missing numbers). I am going to look for outliers now. Boxplots are typically the best way to do this.
There are a lot of outliers in this data. Especially with residual sugar and chlorides. We can’t remove the outliers for each observation because eventually this will make the dataset unusable as we accumulate nas. We do need to keep these outliers in mind every time we calculate means, modes, medians, linear models, show plots, etc. We will keep the dataset tidy for now.
Lets start with a visualization of correlations.
As you can see from the above plot, the correlated factors of interest, based on the quality column are: volatile acidity, citric acid, sulphates, and alcohol. I will also sort the correlations. This will ensure that I am not going to miss anything (e.g. multi collinearity) as far as correlations go.
There are two main issues. The first is multi collinearity. The second one is correlation. If a variable exhibits significant collinearity with another variable, we will then determine if it is correlated to quality. If it is not, it will be removed from the linear model which is one of the last sections in this project.
Let’s plot the factors that have the strongest collinearity. Those are (fx.acidity and citric.acid of .67), (v.acidity and citric.acid of .55), (TSD and FSD of .67), (alcohol and density of .5), and (citric.acid and pH of .54)
As expected very strong correlations between the factor pairs. This is of most interest when deciding on what factors to include in a model (e.g. linear). We discussed in the previous section which factors we would keep in the linear model. The only one that is up for debate is the citric acid vs volatile acidity pair. As indicated above, we are going to keep both, and see what the linear model does with one or both of them.
Factors correlated with quality
The top 4 factors most strongly correlated with quality were alcohol (.48), volatile acidity (-.39), sulphates (.25), and citric acid (.23). As we will see later, by subsetting the dataset to remove outliers, we were able to increase the correlation for sulphates considerably. Normally we only consider factors that are above .3. However, wine is subtle and sulphates (for long term wine quality protection) and citric acid are commonly sited as being critical to wine quality.
Let’s use the quality variable to group the data and compare the means and medians of these quality correlated factors in each of the 3-8 levels. We need to turn it into a true factor to start with.
Not surprising, it shows the largely expected linear relationships with each variable. Also it does show the effect of the outliers. Mean is being pushed up and pulled down by the outliers to make it clearly different in all of the quality rankings.
In order to see what effect outliers have on correlation we will do the following.
Correlation drops slightly for quality vs alcohol when you take out the outliers.
Correlation drops slightly for quality vs volatile acidity when you take out the outliers.
Correlation increases slightly for quality vs citric acid when you take out the outliers.
Wow, the correlation went up significantly (over 50%) when you took the outliers out of the sulphates.
Lets look at box plots with the most important correlated factors versus quality and utilizing the subset that reduces outliers.
If the factor is positively correlated with quality, then the upper right quadrant is populated by the higher averages for the factor. If negatively correlated (e.g. v.acidity) the higher averages are in the lower right quadrant. These plots clearly show that these factors are correlated with quality. The use of the vertical and horizontal mean lines gives the viewer a sense of where the quadrants are for quality effects for each of these factors.
In this next section of plots we want to highlight the correlated factors relationship with quality. When people choose wine, they WANT to avoid poor wines, MAY choose medium quality wines if they are price conscious, BUT … what they really want is a good quality wine. This dataset contains relatively few good quality wines (defined as levels 7(n of 199) and 8(n of 18)) out of a total of 1599 wines. In order to highlight the good quality wines and their relationship with the correlated factors, I have chosen to make the size of the scatter points larger as quality scores increase. This allows the reader to quickly see where the 7 and especially 8 scores reside.
When alcohol and citric acid are above their means, a cluster of high quality wine (quality scores of 7, 8) appear in the upper right hand quadrant of the scatter plot. This is expected since both of these factors are positively correlated with quality.
When alcohol is above its mean and volatile acidity is below its mean, a cluster of high quality wine (quality scores of 7, 8) appear in the lower right hand quadrant of the scatter plot. This is expected since volatile acidity is inversely related to quality scores.
When alcohol and sulphates are above their means, a cluster of high quality wine (quality scores of 7, 8) appear in the upper right hand quadrant of the scatter plot.
This plot is quite interesting. You can see the way the data clusters together. This makes sense since citric acid and volatile acidity are highly correlated. When volatile acidity is below its mean and citric acid is above its mean, a cluster of high quality wine (quality scores of 7, 8) appear in the upper left hand quadrant of the scatter plot. This is expected since volatile acidity is negatively correlated with quality and citric acid is positively correlated.
When volatile acidity is below its mean and sulphates is above its mean, a cluster of high quality wine (quality scores of 7, 8) appear in the upper left hand quadrant of the scatter plot. This is expected since volatile acidity is negatively correlated with quality and sulphates is positively correlated.
When citric acid and sulphates are above their means, a cluster of high quality wine (quality scores of 7, 8) appear in the upper right hand quadrant of the scatter plot. This is expected since both factors are positively correlated with quality.
These plots are quite instructive. They clearly show the effects of the correlations on quality AND each other. You can see clustering occurring when factors have collinearity. You can also see the blues and pinks clustering in the appropriate high quality quadrant consistently.
We are going to create a linear model (below). To avoid the large number of outliers in sulphates, we will use the df_sul subset that we created above. This subset resulted in sulphates correlation increasing from .25 to .39 which should be helpful to the model. None of the other subsets caused a large difference in the correlations for the other variables to quality.
To improve the model “win”, I am going to try to give it some help by creating a function that uses alcohol to provide a factor depending on the level of alcohol. This is crude obviously, but it is worth the adventure.
Now institute the above code in the linear regression model.
Results
Let’s plot this to visually show the difference between the actual results and the model’s predicted results.
As you can see it was a valiant effort but … the model is not able to pick out poor and good quality wines. It is “correct” 60% of the time. It looks like the model is like a friend of mine. “Never met a bad bottle in his life:)”
I also think this plot is the right plot. I tried geom_smooth but all that it did was make the model look good, because it averaged everything. That is the problem with this model. It is placing all of the quality scores closer to the mean of the distribution, than what the wine experts said. With wine we are mostly interested in the ends of the distribution. That is avoiding bad bottles, making do with medium bottles, and looking to enjoy excellent wines. This model would tell you that almost everything is OK, nothing is bad, and nothing is good. Almost the exact opposite of what we are looking for.
The reason why I chose the above plot was for the striking difference that this created in the correlation between sulphates and quality. Truly an excellent example of the deleterious effect outliers can have on potential conclusions. By subsetting the data to remove the outliers for sulphates, this also ultimately improved our linear model!
I chose this plot because it clearly shows the factors that are correlated with quality. This is accomplished by seeing the median of the boxplot move in the appropriate direction. The vertical and horizontal lines (means of factors) give the viewer an instant idea of what is going on with the data by creating implicit quality quadrants. I had toyed with the idea of coloring the quadrants. However, I like the clean visual look of this plot.
By plotting alcohol and citric acid on a scatter plot, colored by quality, the eye is drawn to the upper right quadrant where the blues congregate (quality scores 5-7) and the pinks are almost exclusively located (quality scores 7-8).
I learned something about wine. It seems, based on this exploration, that better quality wines, are associated with (relative to the mean), higher alcohol content, lower volatile acidity, higher citric acid, and higher sulphates. I would have guessed that some of these would have an effect, but … I would not have been very accurate about the effect. For example, I thought sulphates were just bad!
I was surprised by the dramatic effect on correlation that removing sulphate outliers had on correlation. A graphic lesson to learn about outliers. The log10 histograms were interesting the way they created normal distributions for several factors. The use of the “normal” and the log10 histograms were helpful in terms of identifying anomalies in the distributions (e.g. bimodal peaks). At one point I had created a bimodal categorical variable for Free Sulphur Dioxide (FSD). However, I was not able to use this (FSD_bmd) engineered factor in any useful way so I removed it. This says to me, just because a data anomaly is there, does not mean it is useful.
As always, there is the question about data. We simply do not know the quality of this dataset. It could be stellar, but I have no way of knowing. My major concern is that 3 expert wine tasters is FAR from the 30 that statistics would like us to use to approach the bell curve (hopefully). There are a lot of wine tasting competitions. Eventually one could build that large of a dataset (minimum of 30 expert tasters). Hopefully this would help with the linear model.
It is likely there is a strong correlation between the age of the wine and the sul group (sulphates, FSD, and TSD) that is important. Having the age of the wine added to the dataset, should be an important factor to consider when building a predictive model for wine quality.
This project is also a good example of where linear models are not sufficient for some purposes. The model I produced was truly poor at predicting quality in a useful way. Sommeliers have nothing to fear from linear models for some time! My assumption is that a method such as random forest would do a more useful prediction job. Hopefully we will have the opportunity to use some of these better models in the future for these types of datasets!