Red Wine Quality Dataset Exploration by Lindsay Moir

This report explores a “tidy data set that contains 1,599 red wines with 11 variables on the chemical properties of that wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).” (Source Udacity Data Set options). There are 13 variables in the provided dataset including the observation # which is the same as the row_number.

No code was copied verbatim from digital sources (or non digital). However, extensive use of google, stack overflow, etc. was used for reference, ideas, and research purposes. There is considerable distance between these sources and the final report provided here.

Factors in Dataset

I have changed some of the default names in the dataset to be more compact. The long names make many of the plots difficult to read.

  1. Fixed acidity (fx.acidity) - many red wine acids are stable and do not evaporate easily.
  2. Volatile acidity (v.acidity) - high levels of acetic acid (vinegar tasting) is not good.
  3. Citric acid (citric.acid) - creates hints of flavours to wine.
  4. Residual sugar (res.sugar) - remaining sugar after fermentation ends.
  5. Chlorides (chlorides) - salts in the wine.
  6. Free Sulphur Dioxide (FSD) - SO2 in free form prevents bacterial growth and oxidation of the wine.
  7. Total Sulphur Dioxide (TSD) - both free and bound versions. In high concentrations is unpleasant. 8, Density (density) - relative density to water. Depends on alcohol and sugar content.
  8. pH (pH) - standard acidic to basic scale
  9. Sulphates (sulphates) - added to wine to stop the wine from going bad. Can be counter to wine quality if too much is added.
  10. Alcohol (alcohol) - percent alcohol in the wine.
  11. Quality (quality) - from 3 wine experts in this case based on taste, smell, and look.

The dependent variable is ‘quality’. That is what factors within this dataset impact the quality of the wine. The independent variables are the balance of the factors.

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fx.acidity : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ v.acidity  : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid: num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ res.sugar  : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides  : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ FSD        : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ TSD        : num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density    : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH         : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates  : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol    : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality    : int  5 5 5 6 5 5 5 7 7 5 ...
##        X            fx.acidity      v.acidity       citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##    res.sugar        chlorides            FSD             TSD        
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00   Min.   :  6.00  
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00   1st Qu.: 22.00  
##  Median : 2.200   Median :0.07900   Median :14.00   Median : 38.00  
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87   Mean   : 46.47  
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00   3rd Qu.: 62.00  
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00   Max.   :289.00  
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
  1. X is the observation number and is identical to the row number. We will need X in the linear model section. X is an integer.
  2. Quality (the dependent variable) is numerical but … it is really a categorical variable that was created by expert wine tasters. The values range from 3-8 and the mean is 5.6 with the median being 6. The prediction model we will create, should fit better if quality is treated as a categorical factor. If we make this change the model will hopefully capture any non-linearity. However, it makes it more difficult to work with this variable. In order to have the best of both worlds, we will leave quality as numeric and add a categorical factor later as required that mirrors quality (e.g. quality_factor) However, for now, lets leave it as continuous and numeric.
  3. All other variables are numeric (except for X and quality) and are continuous.

What is/are the main feature(s) of interest in your datasset.

Lets start by looking at histograms of all of the factors.

The histogram for fixed acidity shows a somewhat right skewed distribution with a long tail and quite a few outliers. We will need to trim those outliers when we work with this factor.

The histogram for violatile acidity shows a lot of outliers to the right. If those are trimmed the distribution is roughly normal. the peaking we show here disappears with a larger bin size. We will need to trim those outliers when we work with this factor.

This is a VERY unsual distribution. There are peaks at 0, .25, and .5. The distribution is very right skewed. There are a lot of outliers on the right also.

A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.

A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.

A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.

A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.

Finally a very normal looking distribution with just a few outliers on either side!

A very normal looking distribution with just a few outliers on either side!

A bit of a right tail, with a lot of outliers.

A right skewed distribution with a lot of outliers on both the left and right.

A fairly normal distribution, but obviously there are not a lot of bins. This is essentially a categorical variable.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I have reviewed all 12 of the features in the above histograms. The one feature that is intriguing is Citric Acid which is right skewed (non log scale). There are 3 peaks at 0, .25 and .5 with Citric Acid. This is interesting Quite often vintners will talk about flavours or smells that have to do with citrus fruits. We will keep this in mind as we explore this dataset.

Did you create any new variables from existing variables in the dataset?

In the multivariate section we will add two additional variables as factors based on quality. This is so that we can visualize the data with quality as an added dimension.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

This is a very tidy dataset. There are some anomalies as noted above. Residual sugar, FSD, TSD, sulphates, citric acid, and alcohol are all long tailed (right skewed). Let’s put these on a log scale and see what they do.

  1. There is something going on with citric acid. Now, it is very skewed to the left. I do not know what it is but, .. I have done a quick correlation analysis and citric acid is one of the correlated factors. Hopefully the reason for the unusual distribution will become apparent as we proceed.
  2. Residual sugar still has a long tail to the right. This is left over sugar in the wine after fermentation. The more the sweeter.
  3. FSD has a small bimodal distribution with a peak at 8 and a lot of the data centered at 25 for the next part of the distribution. We could create a categorical variable around this. However, the drop off is not large and would disappear with a different bin size.
  4. Total Sulphur Dioxides (TSD) appear as a normal distribution on the log scale.
  5. Sulphates still has a long tail. Since vintners want some low levels of sulphates in their wine over extended periods, this is not surprising.
  6. Alcohol has a small bimodal distribution with the first being at 9.6 and the second at 10.1. However, this disappears when you adjust the bin sizes.

Look for zeros in dataframe.

Except for the new categorical variables we created (as numerics) the only variable that has a lot of 0s (132) is citric.acid. This represents about 8.3% of the values for that column. Makes you wonder why but … I have seen worse. It is odd because the other columns do not have any 0 values. However, this presence of 0s is yanking down the mean and median for this variable. That may be signficant. It also may be truthful. That is there was no measurable presence of citrus acid. This could be the explanation for the reason why citric acid does not conform to the normal distribution on a log scale.

We should also ensure that there are no NAs or NaNs in the dataframe.

There are no nas (missing values) or nans (missing numbers). I am going to look for outliers now. Boxplots are typically the best way to do this.

There are a lot of outliers in this data. Especially with residual sugar and chlorides. We can’t remove the outliers for each observation because eventually this will make the dataset unusable as we accumulate nas. We do need to keep these outliers in mind every time we calculate means, modes, medians, linear models, show plots, etc. We will keep the dataset tidy for now.

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Lets start with a visualization of correlations.

As you can see from the above plot, the correlated factors of interest, based on the quality column are: volatile acidity, citric acid, sulphates, and alcohol. I will also sort the correlations. This will ensure that I am not going to miss anything (e.g. multi collinearity) as far as correlations go.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are two main issues. The first is multi collinearity. The second one is correlation. If a variable exhibits significant collinearity with another variable, we will then determine if it is correlated to quality. If it is not, it will be removed from the linear model which is one of the last sections in this project.

  1. fx.acidity - Remove from the linear model. Fixed acidity and citric acid have a correlation coefficient -.672. In this instance fixed acidity has a correlation to quality of .124 and citric acid has a correlation of .226. We will keep citric acid and drop fixed acidity.
  2. v.acidity - Citric acid and volatile acidity have a correlation coefficient of -.55. We will keep both. We have already dropped fixed acidity and next to alcohol, volatile acidity has the highest correlation to quality.
  3. citric.acid - Include in the linear model. Citric acid and volatile acidity have a correlation coefficient of -.55. We will keep both. We have already droped fixed acidity and next to alchohol volatile acidity has the highest correlation to quality.
  4. res.sugar - There is little cross correlation. However, we will not include this in the linear model because its correlation with quality is so low.
  5. chlorides - There is little cross correlation. However, we will not include this in the linear model because its correlation with quality is so low.
  6. FSD and TSD - TSD (Total Sulphur Dioxide) and FSD (Free Sulphur Dioxide) have a .67 correlation factor with each other. For FSD the correlation to quality is -.0507. For TSD it is -.185. Basic chemistry says that FSD is a a result of TSD breaking down. Not sure if this is significant. However, based on the correlations to quality, it is somewhat moot for this analysis. Since their correlations to quality are so low we will not put either one of them in the linear model.
  7. density - The cross correlation here is between density and alcohol. We will leave alcohol in the linear model and density out. The correlation is -.5 between the two of them.
  8. pH - Citric acid is again running the gauntlet vs pH. We will drop pH. Its correlation coefficient to quality is only .0577.
  9. sulphates - There is little cross correlation. It also has one of the higher correlations to quality. It will be in the linear model.
  10. alcohol - The cross correlation is with density. We will leave alcohol in the linear model and as mentioned above remove density. The correlation is only -.5 between the two of them.
  11. quality - This is the dependent variable.

Let’s plot the factors that have the strongest collinearity. Those are (fx.acidity and citric.acid of .67), (v.acidity and citric.acid of .55), (TSD and FSD of .67), (alcohol and density of .5), and (citric.acid and pH of .54)

As expected very strong correlations between the factor pairs. This is of most interest when deciding on what factors to include in a model (e.g. linear). We discussed in the previous section which factors we would keep in the linear model. The only one that is up for debate is the citric acid vs volatile acidity pair. As indicated above, we are going to keep both, and see what the linear model does with one or both of them.

What was the strongest relationship you found?

Factors correlated with quality

The top 4 factors most strongly correlated with quality were alcohol (.48), volatile acidity (-.39), sulphates (.25), and citric acid (.23). As we will see later, by subsetting the dataset to remove outliers, we were able to increase the correlation for sulphates considerably. Normally we only consider factors that are above .3. However, wine is subtle and sulphates (for long term wine quality protection) and citric acid are commonly sited as being critical to wine quality.

Let’s use the quality variable to group the data and compare the means and medians of these quality correlated factors in each of the 3-8 levels. We need to turn it into a true factor to start with.

Not surprising, it shows the largely expected linear relationships with each variable. Also it does show the effect of the outliers. Mean is being pushed up and pulled down by the outliers to make it clearly different in all of the quality rankings.

In order to see what effect outliers have on correlation we will do the following.

  1. calculate the correlation coefficient for the subsetted data (outliers removed) vs quality for the four factors that are correlated with quality.
  2. Compare it to the none subsetted correlation to see if there is a difference.
  3. Create a scatter plot for each factor using the complete dataset and the subset to graphically show the effect of the outliers complets with a regression line.

Correlation drops slightly for quality vs alcohol when you take out the outliers.

Correlation drops slightly for quality vs volatile acidity when you take out the outliers.

Correlation increases slightly for quality vs citric acid when you take out the outliers.

Wow, the correlation went up significantly (over 50%) when you took the outliers out of the sulphates.

Lets look at box plots with the most important correlated factors versus quality and utilizing the subset that reduces outliers.

If the factor is positively correlated with quality, then the upper right quadrant is populated by the higher averages for the factor. If negatively correlated (e.g. v.acidity) the higher averages are in the lower right quadrant. These plots clearly show that these factors are correlated with quality. The use of the vertical and horizontal mean lines gives the viewer a sense of where the quadrants are for quality effects for each of these factors.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this next section of plots we want to highlight the correlated factors relationship with quality. When people choose wine, they WANT to avoid poor wines, MAY choose medium quality wines if they are price conscious, BUT … what they really want is a good quality wine. This dataset contains relatively few good quality wines (defined as levels 7(n of 199) and 8(n of 18)) out of a total of 1599 wines. In order to highlight the good quality wines and their relationship with the correlated factors, I have chosen to make the size of the scatter points larger as quality scores increase. This allows the reader to quickly see where the 7 and especially 8 scores reside.

When alcohol and citric acid are above their means, a cluster of high quality wine (quality scores of 7, 8) appear in the upper right hand quadrant of the scatter plot. This is expected since both of these factors are positively correlated with quality.

When alcohol is above its mean and volatile acidity is below its mean, a cluster of high quality wine (quality scores of 7, 8) appear in the lower right hand quadrant of the scatter plot. This is expected since volatile acidity is inversely related to quality scores.

When alcohol and sulphates are above their means, a cluster of high quality wine (quality scores of 7, 8) appear in the upper right hand quadrant of the scatter plot.

This plot is quite interesting. You can see the way the data clusters together. This makes sense since citric acid and volatile acidity are highly correlated. When volatile acidity is below its mean and citric acid is above its mean, a cluster of high quality wine (quality scores of 7, 8) appear in the upper left hand quadrant of the scatter plot. This is expected since volatile acidity is negatively correlated with quality and citric acid is positively correlated.

When volatile acidity is below its mean and sulphates is above its mean, a cluster of high quality wine (quality scores of 7, 8) appear in the upper left hand quadrant of the scatter plot. This is expected since volatile acidity is negatively correlated with quality and sulphates is positively correlated.

When citric acid and sulphates are above their means, a cluster of high quality wine (quality scores of 7, 8) appear in the upper right hand quadrant of the scatter plot. This is expected since both factors are positively correlated with quality.

Were there any interesting or surprising interactions between features?

These plots are quite instructive. They clearly show the effects of the correlations on quality AND each other. You can see clustering occurring when factors have collinearity. You can also see the blues and pinks clustering in the appropriate high quality quadrant consistently.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

We are going to create a linear model (below). To avoid the large number of outliers in sulphates, we will use the df_sul subset that we created above. This subset resulted in sulphates correlation increasing from .25 to .39 which should be helpful to the model. None of the other subsets caused a large difference in the correlations for the other variables to quality.

To improve the model “win”, I am going to try to give it some help by creating a function that uses alcohol to provide a factor depending on the level of alcohol. This is crude obviously, but it is worth the adventure.

Now institute the above code in the linear regression model.

Results

  1. This is an r^2 value of .36. That is only a R value of .6. A little better than a coin toss.
  2. The results are better with df_sul than df. It went from a .336 r^2 value to a .36 r^2 value.
  3. I ran the model WITHOUT volatile acidity. v.acidity and citric.acid have a correlation coefficient of -.55. r^2 dropped to .322. So, that was not helpful.
  4. I also experimented with leaving the sulphate outliers in and seeing if sulphates log10 was a better predictor. It was not.
  5. I ran the linear model with a number of the other low correlation factors and different subsets of the data trying to see if there was something that would improve the prediction value. None were able to improve the model over this r^2 valueof .36 (r = .6).
  6. I had created a bimodal categorical variable for Free Sulphur Dioxide (FSD). I used it to split the data. Nothing happened. It was not useful.

Let’s plot this to visually show the difference between the actual results and the model’s predicted results.

As you can see it was a valiant effort but … the model is not able to pick out poor and good quality wines. It is “correct” 60% of the time. It looks like the model is like a friend of mine. “Never met a bad bottle in his life:)”

I also think this plot is the right plot. I tried geom_smooth but all that it did was make the model look good, because it averaged everything. That is the problem with this model. It is placing all of the quality scores closer to the mean of the distribution, than what the wine experts said. With wine we are mostly interested in the ends of the distribution. That is avoiding bad bottles, making do with medium bottles, and looking to enjoy excellent wines. This model would tell you that almost everything is OK, nothing is bad, and nothing is good. Almost the exact opposite of what we are looking for.

Final Plots and Summary

Plot One - Outliers Can Cause Issues!

Description One

The reason why I chose the above plot was for the striking difference that this created in the correlation between sulphates and quality. Truly an excellent example of the deleterious effect outliers can have on potential conclusions. By subsetting the data to remove the outliers for sulphates, this also ultimately improved our linear model!

Plot Two - Correlated Factors and Effect On Quality

Description Two

I chose this plot because it clearly shows the factors that are correlated with quality. This is accomplished by seeing the median of the boxplot move in the appropriate direction. The vertical and horizontal lines (means of factors) give the viewer an instant idea of what is going on with the data by creating implicit quality quadrants. I had toyed with the idea of coloring the quadrants. However, I like the clean visual look of this plot.

Plot Three

Description Three - In the Pink

By plotting alcohol and citric acid on a scatter plot, colored by quality, the eye is drawn to the upper right quadrant where the blues congregate (quality scores 5-7) and the pinks are almost exclusively located (quality scores 7-8).

Reflection

I learned something about wine. It seems, based on this exploration, that better quality wines, are associated with (relative to the mean), higher alcohol content, lower volatile acidity, higher citric acid, and higher sulphates. I would have guessed that some of these would have an effect, but … I would not have been very accurate about the effect. For example, I thought sulphates were just bad!

I was surprised by the dramatic effect on correlation that removing sulphate outliers had on correlation. A graphic lesson to learn about outliers. The log10 histograms were interesting the way they created normal distributions for several factors. The use of the “normal” and the log10 histograms were helpful in terms of identifying anomalies in the distributions (e.g. bimodal peaks). At one point I had created a bimodal categorical variable for Free Sulphur Dioxide (FSD). However, I was not able to use this (FSD_bmd) engineered factor in any useful way so I removed it. This says to me, just because a data anomaly is there, does not mean it is useful.

As always, there is the question about data. We simply do not know the quality of this dataset. It could be stellar, but I have no way of knowing. My major concern is that 3 expert wine tasters is FAR from the 30 that statistics would like us to use to approach the bell curve (hopefully). There are a lot of wine tasting competitions. Eventually one could build that large of a dataset (minimum of 30 expert tasters). Hopefully this would help with the linear model.

It is likely there is a strong correlation between the age of the wine and the sul group (sulphates, FSD, and TSD) that is important. Having the age of the wine added to the dataset, should be an important factor to consider when building a predictive model for wine quality.

This project is also a good example of where linear models are not sufficient for some purposes. The model I produced was truly poor at predicting quality in a useful way. Sommeliers have nothing to fear from linear models for some time! My assumption is that a method such as random forest would do a more useful prediction job. Hopefully we will have the opportunity to use some of these better models in the future for these types of datasets!