This report explores a “tidy data set that contains 1,599 red wines with 11 variables on the chemical properties of that wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).” (Source Udacity Data Set options). There are 13 variables in the provided dataset including the observation # which is the same as the row_number.

No code was copied verbatim from digital sources (or non digital). However, extensive use of google, stack overflow, etc. was used for reference, ideas, and research purposes. There is considerable distance between these sources and the final report provided here.

Factors in Dataset

I have changed some of the default names in the dataset to be more compact. The long names make many of the plots difficult to read.

- Fixed acidity (fx.acidity) - many red wine acids are stable and do not evaporate easily.
- Volatile acidity (v.acidity) - high levels of acetic acid (vinegar tasting) is not good.
- Citric acid (citric.acid) - creates hints of flavours to wine.
- Residual sugar (res.sugar) - remaining sugar after fermentation ends.
- Chlorides (chlorides) - salts in the wine.
- Free Sulphur Dioxide (FSD) - SO2 in free form prevents bacterial growth and oxidation of the wine.
- Total Sulphur Dioxide (TSD) - both free and bound versions. In high concentrations is unpleasant. 8, Density (density) - relative density to water. Depends on alcohol and sugar content.
- pH (pH) - standard acidic to basic scale
- Sulphates (sulphates) - added to wine to stop the wine from going bad. Can be counter to wine quality if too much is added.
- Alcohol (alcohol) - percent alcohol in the wine.
- Quality (quality) - from 3 wine experts in this case based on taste, smell, and look.

The dependent variable is ‘quality’. That is what factors within this dataset impact the quality of the wine. The independent variables are the balance of the factors.

```
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fx.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ v.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid: num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ res.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ FSD : num 11 25 15 17 11 13 15 15 9 17 ...
## $ TSD : num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
```

```
## X fx.acidity v.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## res.sugar chlorides FSD TSD
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
```

- X is the observation number and is identical to the row number. We will need X in the linear model section. X is an integer.
- Quality (the dependent variable) is numerical but … it is really a categorical variable that was created by expert wine tasters. The values range from 3-8 and the mean is 5.6 with the median being 6. The prediction model we will create, should fit better if quality is treated as a categorical factor. If we make this change the model will hopefully capture any non-linearity. However, it makes it more difficult to work with this variable. In order to have the best of both worlds, we will leave quality as numeric and add a categorical factor later as required that mirrors quality (e.g. quality_factor) However, for now, lets leave it as continuous and numeric.
- All other variables are numeric (except for X and quality) and are continuous.

Lets start by looking at histograms of all of the factors.

The histogram for fixed acidity shows a somewhat right skewed distribution with a long tail and quite a few outliers. We will need to trim those outliers when we work with this factor.

The histogram for violatile acidity shows a lot of outliers to the right. If those are trimmed the distribution is roughly normal. the peaking we show here disappears with a larger bin size. We will need to trim those outliers when we work with this factor.

This is a VERY unsual distribution. There are peaks at 0, .25, and .5. The distribution is very right skewed. There are a lot of outliers on the right also.

A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.

A relatively normal distribution with the exception of an exceptionally long tail to the right. Many outliers will need to be trimmed.

A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.

A right skewed distribution with a long tail and many outliers. Quite typical of this dataset.

Finally a very normal looking distribution with just a few outliers on either side!

A very normal looking distribution with just a few outliers on either side!

A bit of a right tail, with a lot of outliers.

A right skewed distribution with a lot of outliers on both the left and right.

A fairly normal distribution, but obviously there are not a lot of bins. This is essentially a categorical variable.

I have reviewed all 12 of the features in the above histograms. The one feature that is intriguing is Citric Acid which is right skewed (non log scale). There are 3 peaks at 0, .25 and .5 with Citric Acid. This is interesting Quite often vintners will talk about flavours or smells that have to do with citrus fruits. We will keep this in mind as we explore this dataset.

In the multivariate section we will add two additional variables as factors based on quality. This is so that we can visualize the data with quality as an added dimension.

This is a very tidy dataset. There are some anomalies as noted above. Residual sugar, FSD, TSD, sulphates, citric acid, and alcohol are all long tailed (right skewed). Let’s put these on a log scale and see what they do.