WINE CLUSTERING

Introduce the Problem

The goal for this project is to get some experience working with clustering. I will try to explain what clustering is and how it works. I will also explain the 2 types of clustering I will demonstrate for the project. Therefore I will be working with a wine dataset. The problem I would like to solve in this project is to group wine into strong or weak groups. I am going to cluster the wine into two groups based on the strength of alcohol.

What is clustering?

Clustering is unsupervised machine learning where it divides data points into a number of groups such that the data points in the same group are similar to each other than the data points in the other group. Therefore we can simply say that the goal of clustering is to group data with similar traits and put them in one cluster.

K-means Clustering

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps :

Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and those in grey cluster using grey cross.
Re-assign each point to the closest cluster centroid : Note that only the data point at the bottom is assigned to the red cluster even though its closer to the centroid of grey cluster. Thus, we assign that data point into grey cluster
Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

Hierarchical Clustering

Hierarchical clustering aka agglomerative clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using dendrogram.

Difference between K Means and Hierarchical clustering

Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).
In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram

Introduce the data

I will be working with a wine data set I found on Kaggle. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. This dataset is adapted from the Wine Data Set from https://archive.ics.uci.edu/ml/datasets/wine by removing the information about the types of wine for unsupervised learning.

The dataset include features such as:

Alcohol - the percentage of alcohol
Malic acid -one of the main acids found in the acidity of grapes.
Ash- includes all relevant substances are absorbed during grape ripeness through the soil
Alcalinity of ash - capacity of water to resist acidification of the ash in wine
Magnesium
Total phenols - chemical compounds that affect the taste, feel and color of wine. Most of the phenols in wine come from the pulp, skin, seeds and stems of grapes
Flavanoids- group of plant metabolites thought to provide health benefits through cell signaling pathways and antioxidant effects
Nonflavanoid phenols
Proanthocyanins - with the capability to bind salivary proteins, these condensed tannins strongly influence the perceived astringency of the wine.
Color intensity
Hue
OD280/OD315 of diluted wines- a method for determining the protein concentration, which can determine the protein content of various wines.
Proline - the most abundant amino acid present in grape juice and wine.

Wine Clustering: Bio

DATA UNDERSTANDING/VISUALIZATION

Screen Shot 2022-04-09 at 1.10.04 PM.png

Now that we have covered what clustering means and what the two types of clustering are, let's try to understand the data I will work with to cluster. The best thing about this data set is it came ready to be used for this project since we are working with clustering the types of wine is removed. It looks like the wine-cluster data set includes 2314 data elements which are put in 13 columns and 178 rows.

So the next thing I do is to find some best features my wine dataset will be clustered with. In order to do so, I used pairplot to find a feature that related to Alcohol feature and chose Magnesium, Proline, color intensity, and Ash to work with. I created a scatter plot to see the relationship between the four features using Alcohol as hue and I was able to see Ash wasn’t the best to group the data based on alcohol strength. From the Magnesium vs Proline scatter plot, we can see that the lower the value of the two features the lower the alcohol strength. Same goes for the scatter plot between Magnesium vs color intensity and Proline vs color intensity.

Wine Clustering: Services

MAGNESIUM VS PROLINE

Screen Shot 2022-04-09 at 1.22.39 PM.png

From the above step, we can say that Magnesium, Proline and color intensity are good features of wine to group the wine in groups based on alcohol strength. Therefore when we model our dataset and when our wine is grouped, we will know which features played the important role for grouping.

For fun, I looked at a correlation heatmap to see how these features correlate with alcohol and the result was surprising. All three features I chose to be best for clustering are on different levels of correlation with Alcohol. From this experiment, I can say correlation has nothing to do with the modeling I will perform later on.

Wine Clustering: Text

PRE-PROCESSING

Before starting to cluster, I want to make sure that my data set is clean. The first step in my pre-processing is to check for null. Luckily I have no nulls. Next, I check the data types. All my data values are in floats and ints so we will not need to make any change to the data types.

Then I wanted to make sure that what I deduce about my data from the data understanding section is actually correct before I cluster so I decided to add one more column to my dataset with only two values of strong in alcohol and weak in alcohol. In order to do this, I must split my Alcohol dataset in two. I used the mean, which was 13, to find the threshold to split my alcohol. Now I have a new column with 1 and 0 value, one representing strong in alcohol and 0 representing weak in alcohol. With this additional column I redo my data visualization to prove to myself that I identified the right features that are in high probability to affect the clustering. My visualization looks as follows:

Wine Clustering: Text

Wine Clustering: Services

MAGNESIUM VS PROLINE

Screen Shot 2022-04-09 at 2.03.31 PM.png

In the above step, the benefit of using my new column with two features is that in my visualization, it is easier to look at the data split into two groups than the first visualization I did in the data understanding section. The new scatter plots prove my hypothesis that these features are good to group wine in two groups.

Now I can actually move to clustering my dataset.

Wine Clustering: Text

MODELING (CLUSTERING)

I will be experimenting with K-means clustering and Agglomerative clustering. And discuss my observation from each modeling techniques. I will start with k-means clustering. In my case I will not be doing the elbow diagram to find the k or the number of clusters to make since I want my data to be grouped into two, I will go ahead and create my model with 2 clusters.

MY VISUALIZATION LOOKS AS FOLLOWS FOR THE TWO DETERMINING FEATURES USING K-MEANS MODELING:

HERE IS WHAT MY AGGLOMERATIVE MODELING LOOKS LIKE:

AND HERE IS WHAT THE SCATTER PLOT LOOKS LIKE FOR THE 2 FEATURES USING THE AGGLOMERATIVE MODELING:

Screen Shot 2022-04-09 at 2.31.14 PM.png

Screen Shot 2022-04-16 at 11.35.15 AM.png

Screen Shot 2022-04-10 at 10.39.49 AM.png

Wine Clustering: Services

Screen Shot 2022-04-16 at 11.47.06 AM.png

Screen Shot 2022-04-16 at 11.40.00 AM.png

Screen Shot 2022-04-16 at 11.49.06 AM.png

Out of curiosity, I am going to try standarding/normalizing my data and see how it affects my modeling result. After standarding/normalizing my data, the agglomerative cluster looks different as seen above.

The model shows that clustering wine into 3 is better for my data set.

The agglomerative clustering using 2 clusters

The agglomerative clustering using 3 clusters

Wine Clustering: Services

STORYTELLING

(Clustering Analysis)

The goal of this project is to group wine based on its features into a strong or weak alcohol. The consistency between the two models ( before standardizing/normalizing ) is a good sign that there are some features valuable to clustering. Not being able to tell which features are valuable from the model makes clustering difficult to interpret. But based on the scatter plot similarity, I assume the valuable features being magnesium, proline, and color intensity determine the alcohol strength of wine.

Then after standardizing/normalizing the wine data, I used agglomerative clustering and it shows that clustering the dataset into two isn’t the best then looking at a visualization using 2 and 3 clustering, I found that the 2 clustering is good enough also.

From this project, I learned that clustering does not give us the features used to group the data which makes it hard for the interpretation so I had to use scatter plots for the models with the features I assumed to be the best to determine wine alcohol strength. So yes, I was able to cluster the wine based on its alcohol strength. But I also learned that even though I initially want my cluster to be into two groups, the normalized agglomerative clustering showed me that 3 clustering is better for my dataset.

Wine Clustering: Text

References

My Code:

https://github.com/LiyuGT/Project-4

Dataset:

https://www.kaggle.com/datasets/harrywang/wine-dataset-for-clustering

Clustering and types of clustering:

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/

Wine Clustering: Text

WINE CLUSTERING

DATA UNDERSTANDING/VISUALIZATION

MAGNESIUM VS PROLINE

From the above step, we can say that Magnesium, Proline and color intensity are good features of wine to group the wine in groups based on alcohol strength. Therefore when we model our dataset and when our wine is grouped, we will know which features played the important role for grouping.

PRE-PROCESSING

MAGNESIUM VS PROLINE

Now I can actually move to clustering my dataset.

MODELING (CLUSTERING)

MY VISUALIZATION LOOKS AS FOLLOWS FOR THE TWO DETERMINING FEATURES USING K-MEANS MODELING:

HERE IS WHAT MY AGGLOMERATIVE MODELING LOOKS LIKE:

AND HERE IS WHAT THE SCATTER PLOT LOOKS LIKE FOR THE 2 FEATURES USING THE AGGLOMERATIVE MODELING:

Out of curiosity, I am going to try standarding/normalizing my data and see how it affects my modeling result. After standarding/normalizing my data, the agglomerative cluster looks different as seen above.

​

The model shows that clustering wine into 3 is better for my data set.

The agglomerative clustering using 2 clusters

The agglomerative clustering using 3 clusters

STORYTELLING

References