HIGHER EDUCATION STUDENTS PERFORMANCE EVALUATION
INTRO TO PROBLEM:
According to Law insider dictionary, higher education student means any person enrolled at a higher education institution, including at short-cycle, bachelor, master or doctoral level or equivalent, or any person who has recently graduated from an institution. These groups of students' performance can be influenced by many different factors such as frequency of their class attendance, reading frequency, exam preparation methods, note taking, and paying attention in class(listening). In this project, I will be doing a machine learning prediction analysis. I will try to find out which factor really affects students' performance and if the factors can be used for prediction. I will be exploring questions like:
Can we predict the students' grade based on certain features?
Which feature is the best to classify students as a pass/fail student?
DATA
For this project, I will be looking at a dataset that is collected from the Faculty of Engineering and Faculty of Educational Sciences students in 2019. The dataset is from Kaggle. It will be used to predict student performance(passing/failing grade) by the end of a semester. This dataset contains 32 attributes for 145 students. The features I will be using will be exploring is as follows:
STUDENTID
STUDY_HRS: Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours)
READ_FREQ: Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often)
ATTEND: Attendance to classes (1: always, 2: sometimes, 3: never)
PREP_EXAM: Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never)
NOTES: Taking notes in classes: (1: never, 2: sometimes, 3: always)
LISTENS: Listening in classes: (1: never, 2: sometimes, 3: always)
PRE-PROCESSING
The dataset I decided to use requires some cleaning. Using the Panda library I opened the CSV file on my notebook to see my database. After carefully scrutinizing my dataset, I chose 8 columns (STUDENTID, STUDY_HRS, READ_FREQ, ATTEND, PREP_EXAM, NOTES, LISTENS, and GRADE) to work with and removed 25 columns that I will not need to predict if a student pass or fail. This is to make my dataset more manageable in size and to only use relevant data. The next step was to check if I have null value, luckily the dataset didn’t have any.
After doing some cleaning, I moved forward to making my data ready for classification. Since I wanted to predict students' performance, I picked “GRADE” as a target value and checked the value count. It turned out that I have 7 values with different counts. I figured that changing it to a binary value will make the classification better so I added one more column with column name “Pass/Fail”. This will split the grade into two: Pass will be “1” for those with grades D and above. And Fail will be “0” for those with “F”. I checked the Pass/Fail value count, while it is skewed it is not too bad. I figured I can use it as a target value.
Review Pre-processing: After I reached my modeling and evaluation stage, I learned that using “0” as a threshold made the precision and recall evaluation very bad. Therefore, I went back and changed my threshold to 4 which means those with grade “CB” and above are considered to pass and those with grade “CC” and below are said to fail.
I can now move forward to data understanding and visualization.
DATA UNDERSTANDING/VISUALIZATION

To understand my data set, I used seaborn to draw a pair plot. It helped me see the best relationship for my set of features. Unfortunately, my dataset doesn’t happen to have the best relationship as you can see on the picture but we can work with it.

To double check my work, I called the corr() on my dataset to see the relationship numerically.
It is clear that the correlation between my dataset is very low as the number is closer to 0 and very low. The highest correlation is 0.889 between GRADE and Pass/Fail.

To visualize the above correlation table easily, I used a heatmap from seaborn.
It still shows GRADE and Pass/Fail have the most correlation. This is because the Pass/Fail is created based on GRADE.
This will be bad when I proceed to the modeling section. Having GRADE within the train/test set meant the model is given the answer. Therefore I dropped the GRADE column.
The best to classify the students based on the heatmap correlation will be READ_FREQ and NOTES with Pass/Fail.
We can see the barplot of the relationship between NOTES and Pass/Fail, and READ_FREQ and Pass/Fail as follows:

Taking notes in classes: (1: never, 2: sometimes, 3: always)

Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often)
MODELING
DECISION TREE
For this project, I will be using Decision Tree because my data set is discrete. I preferred to use a decision tree because I want to understand exactly why the model is doing what it is doing with my data set. Decision trees will help me visualize my data and understand it easily.
The other major reason I prefer Decision Tree is because the algorithms will decide which feature and condition to use for classification. And in my case, this is an advantage I can use because my dataset features have a low correlation which makes it hard to do manually.
Using the Scikit-Learn library I will implement the decision tree models. In order to model my data set I must prepare my data set by splitting the data into train and test. I used 0.3 for test size.
Based on the decision tree, the most effective feature to classify the data is NOTES because it is put on the root node. The second best feature is READ_FREQ. The tree has a depth of 11 with 12 conditions considered.
The Decision Tree looked as follows:
DECISION TREE MODEL

(Look at my code to see full Decision Tree)
The decision tree uses a Gini Impurity metric for the initial model and it produces a 0.59 accuracy score. This means our model correctly classifies 59.0% of our testing data. I would like to see the accuracy of the model differ in different depths. Here is the result:

GINI IMPURITY
From the result, the accuracy score on different depths stayed in the range of 0.56-0.659. The lowest seems to happen at a depth 4.

ENTROPY
I wanted to try the Entropy metrics if it will make a difference in accuracy. The result looks like the following:
It looks like the entropy metrics are more accurate as the accuracy ranges stayed between 0.59-0.659. Depth 4 in these metrics happens to be the highest unlike with Gini index metrics.
Next, let's look at our classification report.
EVALUATION
0 represents the fail and 1 represents the pass. The precisions tell us how good our model is when we are predicting the passing students. In our model, the precision doesn’t seem to do better for the passing students and it was 0.5 and 0.62 for failing students. The recall tells us how good our model is in correctly predicting the passing student. In our model, the passing students were correctly predicted less than the failing students. And our f1 score is the average of precision and recall which in this case with a 0.70 the failing students are predicted better.

STORY TELLING
This project helped me gain an understanding of using a dataset to make predictions. Even though my dataset wasn’t the best to work with, it helped me learn how to prepare such a complex data set for modeling. I also gained a good insight into the decision trees algorithm and how the metrics (gini index and entropy) result in different accuracy scores. I knew the accuracy of my data would not be so great while I was on the data understanding/visualization stage. The heatmap made it easy to see the correlation between the features to predict the target. And it was clear that my dataset didn’t have the best correlation. Therefore, I wasn’t able to answer the questions with confidence. I was able to tell that the best feature to classify students as a passing or failing student was the note taking data even though the threshold was a grade of “CB” (“AA” - “CB” is pass and “CC-FF” was fail).

REFERENCE
Definition: https://www.lawinsider.com/dictionary/higher-education-student
DataSet from Kaggle: https://www.kaggle.com/csafrit2/higher-education-students-performance-evaluation
Evaluation method: https://towardsdatascience.com/how-to-best-evaluate-a-classification-model-2edb12bcc587
How to use Gini and Entropy metrics to see depth accuracy: https://github.com/calvinhathcock/classification-project/blob/main/classifcation.ipynb