Red Wine Quality Dataset Exploration by Lindsay Moir

This report explores a “tidy data set that contains 1,599 red wines with 11 variables on the chemical properties of that wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).” (Source Udacity Data Set options). There are 13 variables in the provided dataset including the observation # which is the same as the row_number.

No code was copied verbatim from digital sources (or non digital). However, extensive use of google, stack overflow, etc. was used for reference, ideas, and research purposes. There is considerable distance between these sources and the final report provided here.

Factors in Dataset

I have changed some of the default names in the dataset to be more compact. The long names make many of the plots difficult to read.

  1. Fixed acidity (fx.acidity) - many red wine acids are stable and do not evaporate easily.
  2. Volatile acidity (v.acidity) - high levels of acetic acid (vinegar tasting) is not good.
  3. Citric acid (citric.acid) - creates hints of flavours to wine.
  4. Residual sugar (res.sugar) - remaining sugar after fermentation ends.
  5. Chlorides (chlorides) - salts in the wine.
  6. Free Sulphur Dioxide (FSD) - SO2 in free form prevents bacterial growth and oxidation of the wine.
  7. Total Sulphur Dioxide (TSD) - both free and bound versions. In high concentrations is unpleasant. 8, Density (density) - relative density to water. Depends on alcohol and sugar content.
  8. pH (pH) - standard acidic to basic scale
  9. Sulphates (sulphates) - added to wine to stop the wine from going bad. Can be counter to wine quality if too much is added.
  10. Alcohol (alcohol) - percent alcohol in the wine.
  11. Quality (quality) - from 3 wine experts in this case based on taste, smell, and look.

The dependent variable is ‘quality’. That is what factors within this dataset impact the quality of the wine. The independent variables are the balance of the factors.

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fx.acidity : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ v.acidity  : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid: num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ res.sugar  : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides  : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ FSD        : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ TSD        : num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density    : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH         : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates  : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol    : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality    : int  5 5 5 6 5 5 5 7 7 5 ...
##        X            fx.acidity      v.acidity       citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##    res.sugar        chlorides            FSD             TSD        
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00   Min.   :  6.00  
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00   1st Qu.: 22.00  
##  Median : 2.200   Median :0.07900   Median :14.00   Median : 38.00  
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87   Mean   : 46.47  
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00   3rd Qu.: 62.00  
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00   Max.   :289.00  
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
  1. X is the observation number and is identical to the row number. We will need X in the linear model section. X is an integer.
  2. Quality (the dependent variable) is numerical but … it is really a categorical variable that was created by expert wine tasters. The values range from 3-8 and the mean is 5.6 with the median being 6. The prediction model we will create, should fit better if quality is treated as a categorical factor. If we make this change the model will hopefully capture any non-linearity. However, it makes it more difficult to work with this variable. In order to have the best of both worlds, we will leave quality as numeric and add a categorical factor later as required that mirrors quality (e.g. quality_factor) However, for now, lets leave it as continuous and numeric.
  3. All other variables are numeric (except for X and quality) and are continuous.

What is/are the main feature(s) of interest in your datasset.

Lets start by looking at histograms of all of the factors.

The histogram for fixed acidity shows a somewhat right skewed distribution with a long tail and quite a few outliers. We will need to trim those outliers when we work with this factor.

The histogram for violatile acidity shows a lot of outliers to the right. If those are trimmed the distribution is roughly normal. the peaking we show here disappears with a larger bin size. We will need to trim those outliers when we work with this factor.

This is a VERY unsual distribution. There are peaks at 0, .25, and .5. The distribution is very right skewed. There are a lot of outliers on the right also.

A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.

A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.

A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.

A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.

Finally a very normal looking distribution with just a few outliers on either side!

A very normal looking distribution with just a few outliers on either side!

A bit of a right tail, with a lot of outliers.

A right skewed distribution with a lot of outliers on both the left and right.

A fairly normal distribution, but obviously there are not a lot of bins. This is essentially a categorical variable.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I have reviewed all 12 of the features in the above histograms. The one feature that is intriguing is Citric Acid which is right skewed (non log scale). There are 3 peaks at 0, .25 and .5 with Citric Acid. This is interesting Quite often vintners will talk about flavours or smells that have to do with citrus fruits. We will keep this in mind as we explore this dataset.

Did you create any new variables from existing variables in the dataset?

In the multivariate section we will add two additional variables as factors based on quality. This is so that we can visualize the data with quality as an added dimension.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

This is a very tidy dataset. There are some anomalies as noted above. Residual sugar, FSD, TSD, sulphates, citric acid, and alcohol are all long tailed (right skewed). Let’s put these on a log scale and see what they do.