Correlation Analysis of Red Wine Quality

Introduction

This analysis was completed as a final project deliverable for my graduate certificate course in Data Analytics at Boston University's Metropolitan College. The course was the third of four, Data Analysis and Visualization (statistics, in R). This is a summary of the project, and you can review the code here.

Objective

Red wines vary considerably, some wines tasting more acidic, sweet, or higher in alcohol. In order to see what effect each element (acid, sugar, and alcohol) has on the wine, I analyzed data from an experiment where a reviewer tasted red variants from the Portuguese “Vinho Verde” wine and rated each wine on a scale of 1-10 (10 being the best).

The question at hand is whether the properties of the wine (acid, sugar, alcohol content) impact the resulting quality score.

Data

This research scenario uses this “Red Wine Quality” dataset: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.

Acid Sugar Alcohol Quality
7.4 1.9 9.4 5
7.8 2.6 9.8 5
7.8 2.3 9.8 5
11.2 1.9 9.8 6
7.4 1.9 9.4 5

The dataset used in this analysis provides quantitative metrics of the following:

1. Acid – acids involved with wine do not evaporate readily and can affect flavor
2. Sugar – the amount of sugar remaining after fermentation stops
3. Alcohol – the percent alcohol content of the wine
4. Quality – the score given to the wine by the person/people assessing the wine

Results

As a first step, I did a correlation analysis of the different metrics. First, I built the grid display shown below the code.

plot of chunk unnamed-chunk-6

 
Wine.JPG

There are two ways to look at these correlations: by correlation coefficient and by scatterplot.

Let's first look at the correlation coefficient: positive correlations are closer to 1 and negative correlations are closer to -1. The metrics mostly highly correlated are Alcohol and Quality, but at a coefficient of 0.48, the correlation is moderate at best. The rest of the metrics come in at 0.12 or less, so none are highly correlated.

Next, let's look at the scatterplots. While scatterplots can be a great tool to visually assess correlation, there are no trends that stand out within this data.

The ultimate question we want to answer is whether this is statistically significant. If we run a multiple linear regression, we see that Alcohol, Acid, and Sugar do have a statistically significant impact on Quality Score (to be more specific, the F-score is higher than the critical value).

Conclusion

When looking at the results as a whole, we can say that while all the factors do impact the Quality Score, Alcohol content is the factor that has the biggest impact overall.

If you want to choose a wine based purely on data trends, you may do well if you choose one with a higher alcohol content. But you may nevertheless find yourself disappointed.