Analysis of seven years of General Chemistry student data

Curriculum R

Trying to make sense of student performance and identifying possible predictors of academic success.

Xavier Prat-Resina https://pratresina.umn.edu (University of Minnesota Rochester)https://r.umn.edu
08-10-2018

Abstract

Between Fall 2010 and Spring 2018 I taught the two semester sequence of General Chemistry (GC). The way our curriculum was structured, these two semesters were usually taken during the sophomore year for students majoring in a Bachelor of Sciences in Health Sciences.

During all this time, while the chemistry content has not changed significantly, the forms of delivery and assessment have been evolving towards, hopefully, better pedagogies of engagment and towards a clearer assessment of learning objectives. Probably the most remarkable change was flipping the class with videos in the fall of 2014.

Overview Final Course grades

Let’s just look at how students have performed in the two GC semesters by looking at their final grade in different semesters.

IMPORTANT: We will see statistical significance between years and other demographics when analyzing the final percent grade. However, when we analyze the letter grade, those significances disappear. This is important because when a student is disengaged their score may be 60% or 5%, and while the means and medians may be affected, the letter grade analysis will not. Also, during the semester of Fall 2011 - Spring 2012 the laboratory was still a different course, this means that the criteria for a passing grade was not 70%, but lower.

Comparing means by semester

Show code
setwd("~/Gd/Research/StudentData/Discover")

#Load demographics for all years
allGC1 <- read.csv("./genchem1_nosummer_11_16.csv",header=TRUE)
allGC2 <- read.csv("./genchem2_11_17.csv",header=TRUE)
allGC1_ <- read.csv("./genchem1_nosummer_11_16_mergedsex.csv",header = TRUE)
allGC2_ <- read.csv("./genchem2_11_17_mergedsex.csv",header = TRUE)

library(psych)
mata<-describeBy(allGC1$TG_Total.Grade....,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1")
Table 1: GenChem1
group1 n mean sd median trimmed mad min max range
X11 Fall 2011 68 78.43 10.11 79.50 78.98 9.79 42.50 95.90 53.40
X12 Fall 2012 84 73.95 12.41 73.00 73.78 14.53 43.00 96.20 53.20
X13 Fall 2013 69 83.81 8.17 84.70 84.49 6.82 41.60 96.50 54.90
X14 Fall 2014 105 80.95 9.73 81.52 81.78 8.61 40.52 98.78 58.27
X15 Fall 2015 60 81.44 9.74 82.68 82.39 7.19 43.47 96.33 52.87
X16 Fall 2016 35 84.17 7.24 84.95 84.60 7.82 66.98 97.14 30.16
Show code
mata<-describeBy(allGC2$TG_Total.Grade....,allGC2$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2")
Table 1: GenChem2
group1 n mean sd median trimmed mad min max range
X11 Spring 2011 16 87.38 8.34 89.84 88.07 6.55 67.70 97.38 29.68
X12 Spring 2012 45 69.01 12.95 70.70 69.90 10.53 35.60 94.00 58.40
X13 Spring 2013 51 81.51 9.42 82.00 82.00 9.93 53.70 97.50 43.80
X14 Spring 2014 55 80.31 8.51 79.30 80.55 8.75 60.10 96.70 36.60
X15 Spring 2015 61 78.51 10.00 77.78 78.55 11.92 57.26 96.14 38.88
X16 Spring 2016 44 77.15 9.66 78.73 77.77 9.49 54.39 94.10 39.72
X17 Spring 2017 37 80.01 9.80 80.78 80.64 8.01 48.11 97.22 49.11

Graphically by semester

Show code
library(ggplot2)
ggplot(allGC1, aes(x=TG_Total.Grade...., fill=Semester))+geom_histogram()+ggtitle("GenChem1 by semester")
Show code
ggplot(allGC2, aes(x=TG_Total.Grade...., fill=Semester))+geom_histogram()+ggtitle("GenChem2")

Statiscal analysis by semester

Show code
a<- TukeyHSD( aov(allGC1$TG_Total.Grade.... ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. GenChem1 Grade among semesters")
Table 2: Anova. GenChem1 Grade among semesters
diff lwr upr p adj
Fall 2012-Fall 2011 -4.4779412 -9.1423644 0.186482 0.0681540
Fall 2013-Fall 2011 5.3836530 0.4976747 10.269631 0.0211987
Fall 2014-Fall 2011 2.5216131 -1.9292496 6.972476 0.5843701
Fall 2015-Fall 2011 3.0089288 -2.0556712 8.073529 0.5318824
Fall 2016-Fall 2011 5.7435885 -0.2048146 11.691992 0.0653478
Fall 2013-Fall 2012 9.8615942 5.2158876 14.507301 0.0000000
Fall 2014-Fall 2012 6.9995543 2.8138662 11.185242 0.0000345
Fall 2015-Fall 2012 7.4868700 2.6536537 12.320086 0.0001708
Fall 2016-Fall 2012 10.2215296 4.4688516 15.974208 0.0000082
Fall 2014-Fall 2013 -2.8620399 -7.2932841 1.569204 0.4353187
Fall 2015-Fall 2013 -2.3747242 -7.4220918 2.672643 0.7584349
Fall 2016-Fall 2013 0.3599354 -5.5738024 6.293673 0.9999781
Fall 2015-Fall 2014 0.4873157 -4.1401366 5.114768 0.9996654
Fall 2016-Fall 2014 3.2219753 -2.3589421 8.802893 0.5638475
Fall 2016-Fall 2015 2.7346596 -3.3470041 8.816323 0.7918982
Show code
a<- TukeyHSD( aov(allGC2$TG_Total.Grade.... ~ allGC2$Semester)) 
b<-as.data.frame(a$`allGC2$Semester`)
knitr::kable(b, caption = "Anova. GenChem2 Grade among semesters")
Table 2: Anova. GenChem2 Grade among semesters
diff lwr upr p adj
Spring 2012-Spring 2011 -18.3719243 -27.016530 -9.7273185 0.0000000
Spring 2013-Spring 2011 -5.8749308 -14.385113 2.6352511 0.3860347
Spring 2014-Spring 2011 -7.0680859 -15.504043 1.3678712 0.1676882
Spring 2015-Spring 2011 -8.8701902 -17.212128 -0.5282519 0.0288746
Spring 2016-Spring 2011 -10.2332149 -18.903549 -1.5628812 0.0094423
Spring 2017-Spring 2011 -7.3733537 -16.259708 1.5130002 0.1767662
Spring 2013-Spring 2012 12.4969935 6.422770 18.5712173 0.0000001
Spring 2014-Spring 2012 11.3038384 5.334050 17.2736265 0.0000009
Spring 2015-Spring 2012 9.5017341 3.665559 15.3379087 0.0000439
Spring 2016-Spring 2012 8.1387093 1.842068 14.4353503 0.0028667
Spring 2017-Spring 2012 10.9985706 4.407646 17.5894949 0.0000250
Spring 2014-Spring 2013 -1.1931551 -6.966573 4.5802632 0.9963637
Spring 2015-Spring 2013 -2.9952594 -8.630410 2.6398912 0.6966880
Spring 2016-Spring 2013 -4.3582841 -10.469068 1.7524994 0.3452669
Spring 2017-Spring 2013 -1.4984229 -7.912023 4.9151776 0.9928975
Spring 2015-Spring 2014 -1.8021043 -7.324522 3.7203134 0.9603189
Spring 2016-Spring 2014 -3.1651291 -9.172113 2.8418544 0.7053683
Spring 2017-Spring 2014 -0.3052678 -6.620048 6.0095122 0.9999993
Spring 2016-Spring 2015 -1.3630247 -7.237241 4.5111913 0.9931556
Spring 2017-Spring 2015 1.4968365 -4.691783 7.6854560 0.9914449
Spring 2017-Spring 2016 2.8598613 -3.764772 9.4844944 0.8600771
Show code
#install.packages("ggpubr")
library(ggpubr)
ggboxplot(allGC1, x = "Semester", y = "TG_Total.Grade....",  title = "Final grade in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
ggboxplot(allGC2, x = "Semester", y = "TG_Total.Grade....",  title = "Final grade in GC2",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC2$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Other grades besides final grade

.

Letter grades

I converted the letter grades into the 4-scale. The plot should only show 4, 3.66, 3.33, 3… but it seems to add more variability…

Show code
#need to load this other file, as it contains the letter grades
allGC1_bosco <- read.csv("~/Research/StudentData/XavierData/Clean/allGC1.csv",header = TRUE)

a<- allGC1_bosco$Final.letter
a <- gsub("A\\-", 3.667,a)
a <- gsub("A", 4.000,a)
a <- gsub("B\\+", 3.333,a)
a <- gsub("B\\-", 2.667,a)
a <- gsub("B", 3.000,a)
a <- gsub("C\\+", 2.333,a)
a <- gsub("C\\-", 1.667,a)
a <- gsub("C", 2.000,a)
a <- gsub("D\\+", 1.333,a)
a <- gsub("D", 1.000,a)
a <- gsub("F", 0.000,a)
a <- gsub("I", 0.000,a)
allGC1_bosco$Final.letter.number <- as.numeric(as.character(a))
ggboxplot(allGC1_bosco, x = "Semester", y = "Final.letter.number",  title = "Final letter grade in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1_bosco$Final.letter.number), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
#ggplot(data=allGC1_bosco,aes(x=Semester,y=Final.letter)) + geom_bar(stat="identity") + geom_bar(aes(fill = Final.letter))

Semester exams and previous exams

Show code
setwd("~/Gd/Research/StudentData/Discover")
#lets write the prepost file into the discover folder
prePost <- read.csv("/Users/xavier/Gd/Research/StudentData/ExamPrePost.csv",header=TRUE,sep = "\t")
source("~/Gd/Research/R/deid.R")
prePost <- deIdThis(prePost)
write.csv(prePost,file="prePost.csv")
prePost <- read.csv("./prePost.csv", header = TRUE)
prePost$inc1<-prePost$Grade1-prePost$Mid1
prePost$inc2<-prePost$Grade2-prePost$Mid2
prePost$inc3<-prePost$Grade3-prePost$Mid3
prePost$meanInc <- rowMeans( prePost[c('inc1','inc2','inc3')])
prePost$meanExam <- rowMeans( prePost[c('Grade1','Grade2','Grade3')])

The final exam is a second opportunity for students to improve their semester exams. Let’s measure how exams score and improvement evolved through the years.

Show code
ggboxplot(prePost, x = "Semester", y = "meanExam",  title = "Average grade in final exams",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(prePost$meanExam), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 105 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

This is plots the increment

Show code
ggboxplot(prePost, x = "Semester", y = "meanInc",  title = "Average increment from semester exams to final",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(prePost$meanInc), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 40 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

There’s something funky about some of these numbers. Fall 2014 doesn’t seem to apply the >40% rule, which I actually implemented.

So let’s check that I obtain the same result if I plot grade exams from BoSCO data

Show code
allGC1_bosco$meanExam <-  rowMeans( allGC1_bosco[c('Exam1','Exam2','Exam3')], na.rm=TRUE)

ggboxplot(allGC1_bosco, x = "Semester", y = "meanExam",  title = "Average grade in final exams (Bosco source)",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1_bosco$meanExam), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 105 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Predictors of performance in Chemistry

There are different variables that we want to look at. Performance factors such as ACT scores or GPA or High School rank , as well as demographic factors such as ethnicity and first-year generation.

Math ACT is a good predictor

Show code
mata<-describeBy(allGC1$DEM_ACT.MATH,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "ACT Math - Fall sophomore")
Table 3: ACT Math - Fall sophomore
group1 n mean sd median trimmed mad min max range
X11 Fall 2011 64 24.75 3.50 24.5 24.67 3.71 18 33 15
X12 Fall 2012 70 24.41 4.01 24.0 24.29 4.45 17 34 17
X13 Fall 2013 66 25.14 2.82 25.0 25.26 2.97 18 31 13
X14 Fall 2014 101 25.47 3.21 26.0 25.40 2.97 17 34 17
X15 Fall 2015 57 24.72 3.19 24.0 24.70 2.97 18 32 14
X16 Fall 2016 32 24.94 2.64 25.0 24.96 2.97 19 30 11
Show code
mata<-describeBy(allGC2$DEM_ACT.MATH,allGC2$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "ACT Math - Spring sophomore")
Table 3: ACT Math - Spring sophomore
group1 n mean sd median trimmed mad min max range
X11 Spring 2011 16 25.88 4.18 25.5 25.79 3.71 19 34 15
X12 Spring 2012 42 25.57 3.47 25.0 25.56 2.97 18 33 15
X13 Spring 2013 44 26.23 4.15 26.0 26.17 4.45 18 34 16
X14 Spring 2014 52 24.90 2.76 25.0 25.17 2.97 18 29 11
X15 Spring 2015 58 26.10 3.36 26.0 26.04 2.97 19 34 15
X16 Spring 2016 42 25.10 3.27 25.0 25.00 2.97 18 32 14
X17 Spring 2017 33 25.27 2.47 26.0 25.30 2.97 21 30 9

We see that the second semester is a subselection of the first semester with a higher ACT math score. Therefore, we can just use GenChem1 for the analysis.

Was math ACT different through the years?

As we can see below. There is no significant difference in ACT throughout the years

Show code
a<- TukeyHSD( aov(allGC1$DEM_ACT.MATH ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. ACTMath among semesters")
Table 4: Anova. ACTMath among semesters
diff lwr upr p adj
Fall 2012-Fall 2011 -0.3357143 -1.9769059 1.3054774 0.9919345
Fall 2013-Fall 2011 0.3863636 -1.2784117 2.0511390 0.9856326
Fall 2014-Fall 2011 0.7153465 -0.8007861 2.2314792 0.7559117
Fall 2015-Fall 2011 -0.0307018 -1.7589701 1.6975666 1.0000000
Fall 2016-Fall 2011 0.1875000 -1.8670492 2.2420492 0.9998340
Fall 2013-Fall 2012 0.7220779 -0.9060719 2.3502278 0.8010990
Fall 2014-Fall 2012 1.0510608 -0.4247621 2.5268838 0.3217601
Fall 2015-Fall 2012 0.3050125 -1.3880045 1.9980295 0.9955408
Fall 2016-Fall 2012 0.5232143 -1.5017715 2.5482001 0.9767972
Fall 2014-Fall 2013 0.3289829 -1.1730225 1.8309883 0.9889559
Fall 2015-Fall 2013 -0.4170654 -2.1329539 1.2988231 0.9823133
Fall 2016-Fall 2013 -0.1988636 -2.2430100 1.8452827 0.9997726
Fall 2015-Fall 2014 -0.7460483 -2.3181344 0.8260379 0.7513444
Fall 2016-Fall 2014 -0.5278465 -2.4528701 1.3971770 0.9699093
Fall 2016-Fall 2015 0.2182018 -1.8779779 2.3143814 0.9996831
Show code
ggboxplot(allGC1, x = "Semester", y = "DEM_ACT.MATH",  title = "ACT Math in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$DEM_ACT.MATH, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 40) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Correlation models: ACT vs GenChem1

Show code
#par(mfrow = c(1, 2))
plot(allGC1$TG_Total.Grade....,allGC1$DEM_ACT.MATH,main="GenChem1")
a<- lm(allGC1$DEM_ACT.MATH~allGC1$TG_Total.Grade.... )
abline(a)
Show code
r2a<-summary(a)$r.squared

plot(allGC2$TG_Total.Grade....,allGC2$DEM_ACT.MATH,main="GenChem2")
a<-lm(allGC2$DEM_ACT.MATH~allGC2$TG_Total.Grade.... )
abline(a)
Show code
r2b<-summary(a)$r.squared

We obtain a r-squared for both 0.2042773 and 0.1569021, respectively. We need to find a better predictor. Let’s see cumulative GPA before enrolling

Previous GPA is a better predictor

While ACT.Math historically seems to correlate well, since we’re teaching sophomores, previous GPA is even a better predictor

Show code
mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1")
Table 5: GenChem1
group1 n mean sd median trimmed mad min max range
X11 Fall 2011 65 3.11 0.46 3.14 3.13 0.52 1.88 3.93 2.05
X12 Fall 2012 79 2.94 0.53 2.94 2.95 0.59 1.44 3.98 2.54
X13 Fall 2013 69 3.18 0.44 3.21 3.19 0.53 1.84 3.98 2.14
X14 Fall 2014 105 3.00 0.48 2.98 3.00 0.47 1.33 4.00 2.67
X15 Fall 2015 60 3.00 0.39 3.00 2.98 0.33 2.18 3.97 1.79
X16 Fall 2016 35 3.12 0.48 3.26 3.14 0.42 2.06 4.00 1.94
Show code
mata<-describeBy(allGC2$DEM_Cumulative.GPA,allGC2$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2")
Table 5: GenChem2
group1 n mean sd median trimmed mad min max range
X11 Spring 2011 16 3.46 0.43 3.54 3.47 0.46 2.73 4.00 1.27
X12 Spring 2012 43 3.18 0.39 3.19 3.19 0.40 2.33 3.95 1.62
X13 Spring 2013 50 3.20 0.48 3.24 3.24 0.39 2.05 3.97 1.92
X14 Spring 2014 53 3.25 0.44 3.27 3.26 0.50 2.18 3.98 1.80
X15 Spring 2015 61 3.18 0.46 3.19 3.18 0.49 2.19 4.00 1.81
X16 Spring 2016 44 3.07 0.41 3.05 3.06 0.36 2.13 3.96 1.83
X17 Spring 2017 36 3.23 0.40 3.26 3.25 0.34 2.13 4.00 1.87

Was Incoming GPA different through the years?

Show code
a<- TukeyHSD( aov(allGC1$DEM_Cumulative.GPA ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. Entering GPA among semesters")
Table 6: Anova. Entering GPA among semesters
diff lwr upr p adj
Fall 2012-Fall 2011 -0.1679124 -0.3930252 0.0572004 0.2709961
Fall 2013-Fall 2011 0.0728361 -0.1595233 0.3051956 0.9469705
Fall 2014-Fall 2011 -0.1069817 -0.3191411 0.1051777 0.7000993
Fall 2015-Fall 2011 -0.1160769 -0.3567413 0.1245875 0.7384580
Fall 2016-Fall 2011 0.0137802 -0.2680571 0.2956175 0.9999925
Fall 2013-Fall 2012 0.2407485 0.0192443 0.4622527 0.0242003
Fall 2014-Fall 2012 0.0609307 -0.1392812 0.2611426 0.9531219
Fall 2015-Fall 2012 0.0518354 -0.1783657 0.2820366 0.9874925
Fall 2016-Fall 2012 0.1816926 -0.0912643 0.4546495 0.3999188
Fall 2014-Fall 2013 -0.1798178 -0.3881444 0.0285087 0.1350707
Fall 2015-Fall 2013 -0.1889130 -0.4262055 0.0483794 0.2048113
Fall 2016-Fall 2013 -0.0590559 -0.3380193 0.2199075 0.9905679
Fall 2015-Fall 2014 -0.0090952 -0.2266461 0.2084557 0.9999966
Fall 2016-Fall 2014 0.1207619 -0.1416144 0.3831382 0.7750540
Fall 2016-Fall 2015 0.1298571 -0.1560608 0.4157750 0.7847500
Show code
ggboxplot(allGC1, x = "Semester", y = "DEM_Cumulative.GPA",  title = "Entering GPA in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$DEM_Cumulative.GPA, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Correlation models: Prev. GPA vs GenChem grades

When we plot previous GPA (typically first year GPA) against final grade

Show code
#par(mfrow = c(1, 2))
plot(allGC1$TG_Total.Grade....,allGC1$DEM_Cumulative.GPA,main="GenChem1")
a<-lm(allGC1$DEM_Cumulative.GPA~allGC1$TG_Total.Grade.... )
abline(a)
Show code
r2a<-summary(a)$r.squared
plot(allGC2$TG_Total.Grade....,allGC2$DEM_Cumulative.GPA,main="GenChem2")
a<-lm(allGC2$DEM_Cumulative.GPA~allGC2$TG_Total.Grade.... )
abline(a)
Show code
r2b<-summary(a)$r.squared

In this case we obtain better r-squared for both 0.656591 and 0.5840838, respectively

Is Highschool performance relevant?

For large schools, highschool(HS) ranking can be used as a better measurement than HS GPA. Also, HS-GPA is currently unavailable :). The units are given in percentile, so the higher the better

Show code
mata<-describeBy(allGC1$DEM_HS.Rank,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1")
Table 7: GenChem1
group1 n mean sd median trimmed mad min max range
X11 Fall 2011 60 79.18 14.08 80.5 80.48 15.57 46 97 51
X12 Fall 2012 65 73.57 17.01 76.0 74.98 16.31 20 99 79
X13 Fall 2013 57 81.21 12.33 81.0 82.04 13.34 47 99 52
X14 Fall 2014 86 79.43 16.62 84.0 81.56 13.34 26 99 73
X15 Fall 2015 51 81.27 13.83 86.0 82.93 8.90 37 99 62
X16 Fall 2016 25 82.28 11.75 85.0 82.95 10.38 60 98 38

Was Highschool performance different through the years?

Show code
a<- TukeyHSD( aov(allGC1$DEM_HS.Rank ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. HS ranking among semesters")
Table 8: Anova. HS ranking among semesters
diff lwr upr p adj
Fall 2012-Fall 2011 -5.6141026 -13.2615244 2.033319 0.2876124
Fall 2013-Fall 2011 2.0271930 -5.8736280 9.928014 0.9774141
Fall 2014-Fall 2011 0.2468992 -6.9383845 7.432183 0.9999987
Fall 2015-Fall 2011 2.0911765 -6.0444897 10.226843 0.9772356
Fall 2016-Fall 2011 3.0966667 -7.0718166 13.265150 0.9527517
Fall 2013-Fall 2012 7.6412955 -0.1100688 15.392660 0.0558988
Fall 2014-Fall 2012 5.8610018 -1.1596093 12.881613 0.1616772
Fall 2015-Fall 2012 7.7052790 -0.2853243 15.695882 0.0659401
Fall 2016-Fall 2012 8.7107692 -1.3420278 18.763566 0.1318975
Fall 2014-Fall 2013 -1.7802938 -9.0761070 5.515519 0.9819290
Fall 2015-Fall 2013 0.0639835 -8.1694637 8.297431 1.0000000
Fall 2016-Fall 2013 1.0694737 -9.1774107 11.316358 0.9996775
Fall 2015-Fall 2014 1.8442772 -5.7052249 9.393779 0.9818377
Fall 2016-Fall 2014 2.8497674 -6.8561055 12.555640 0.9594958
Fall 2016-Fall 2015 1.0054902 -9.4235429 11.434523 0.9997815
Show code
ggboxplot(allGC1, x = "Semester", y = "DEM_HS.Rank",  title = "Highschool Rank in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$DEM_HS.Rank, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Fall2012 seems to stand out again.

Correlation models: HS rank vs GenChem grades

Show code
#par(mfrow = c(1, 2))
plot(allGC1$TG_Total.Grade....,allGC1$DEM_HS.Rank,main="GenChem1")
a<-lm(allGC1$DEM_HS.Rank~allGC1$TG_Total.Grade.... )
abline(a)
Show code
r2a<-summary(a)$r.squared
plot(allGC2$TG_Total.Grade....,allGC2$DEM_HS.Rank,main="GenChem2")
a<-lm(allGC2$DEM_HS.Rank~allGC2$TG_Total.Grade.... )
abline(a)
Show code
r2b<-summary(a)$r.squared

Fairly poor r-squared for both 0.1667476 and 0.0552476, respectively

Demographics

Given the good correlation given above between previous GPA and final grade, let’s then analyze how students of different demographics perform in chemistry when compared to their incoming GPA. In other words, instead of comparing how first-generation vs non-first-generation do, it is more interesting to see how considering their college readiness (as desribed by GPA) how they did in GenChem

Gender

Look at how previous GPA and GenChem grades is among selfidentified genders. There was no data besides male and female.

Show code
#there are some underfined that mess up the graphs
onlyMF_gc1<- allGC1_[complete.cases(allGC1_$Sex),]
onlyMF_gc2<- allGC2_[complete.cases(allGC2_$Sex),]
mata<-describeBy(onlyMF_gc1$DEM_Cumulative.GPA,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA and Sex")
Table 9: 1st year GPA and Sex
group1 n mean sd median trimmed mad min max range
X11 F 286 3.08 0.45 3.06 3.08 0.48 1.88 4 2.12
X12 M 117 3.00 0.53 3.04 3.01 0.49 1.33 4 2.67
Show code
mata<-describeBy(onlyMF_gc1$DEM_ACT.MATH,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "ACT math and Sex")
Table 9: ACT math and Sex
group1 n mean sd median trimmed mad min max range
X11 F 270 24.61 3.09 25 24.61 2.97 17 34 17
X12 M 111 25.82 3.71 26 25.83 2.97 17 34 17
Show code
mata<-describeBy(onlyMF_gc1$DEM_HS.Rank,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "HS rank and Sex")
Table 9: HS rank and Sex
group1 n mean sd median trimmed mad min max range
X11 F 239 81.02 14.21 84.0 82.64 11.86 25 99 74
X12 M 96 74.40 16.42 76.5 75.46 17.05 20 99 79

From the above, we can see that females come to GenChem with very slightly higher GPA and remarkably better HS ranking, but with a lower ACT-math score. Also, males have a broader range of values and higher standard deviation, this tell us that male performance may not be treated as a single group, and it may require a further finer classification. In any case, How will these three factors affect their performance in GenChem? The number of students may not be exactly the same because not all students have ACT or HS data.

Show code
mata<-describeBy(onlyMF_gc1$TG_Total.Grade....,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1 grade and Sex")
Table 10: GenChem1 grade and Sex
group1 n mean sd median trimmed mad min max range
X11 F 289 79.74 9.84 81.0 80.49 9.49 42.50 98.78 56.28
X12 M 120 80.88 12.12 82.1 82.36 11.15 40.52 97.14 56.62
Show code
mata<-describeBy(onlyMF_gc2$TG_Total.Grade....,onlyMF_gc2$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2 grade and Sex")
Table 10: GenChem2 grade and Sex
group1 n mean sd median trimmed mad min max range
X11 F 196 78.18 10.51 79.28 78.88 9.62 38.20 97.50 59.30
X12 M 102 80.31 10.19 80.57 80.76 13.00 56.39 97.22 40.83

Comparing the two genders each year

While it may look like males do better than females, even though females came with better GPA and HS ranking, there is actually no significant difference when compared the two groups in general.

Show code
#install.packages("ggpubr")
library(ggpubr)
p <- ggboxplot(onlyMF_gc1, x = "Sex", y = "TG_Total.Grade....", color = "Sex", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups

However, when the two groups are compared each semester we notice that Fall 2011 is the only semester with a significant difference between genders.

Show code
p <- ggboxplot(onlyMF_gc1, x = "Semester.x", y = "TG_Total.Grade....", color = "Sex", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=Sex),label="p.format") #default is wilcox for comparing non-parametric two groups

Performance by each gender through the years

Before we jump into conclusions, however, we may need to look into how the females in Fall 2011 performed compared to other semester’s females.

Show code
#selecting females
onlyF_gc1 <- onlyMF_gc1[onlyMF_gc1$Sex=="F",]
onlyM_gc1 <- onlyMF_gc1[onlyMF_gc1$Sex=="M",]

ggboxplot(onlyF_gc1, x = "Semester.x", y = "TG_Total.Grade....",  title = "Females in GC1",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyF_gc1$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
ggboxplot(onlyM_gc1, x = "Semester.x", y = "TG_Total.Grade....",  title = "Males in GC1",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyM_gc1$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
ggboxplot(onlyF_gc1, x = "Semester.x", y = "DEM_Cumulative.GPA",  title = "Incoming GPA for females",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyF_gc1$DEM_Cumulative.GPA, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
ggboxplot(onlyM_gc1, x = "Semester.x", y = "DEM_Cumulative.GPA",  title = "Incoming GPA for males",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyM_gc1$DEM_Cumulative.GPA, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
ggboxplot(onlyF_gc1, x = "Semester.x", y = "DEM_HS.Rank",  title = "HS Ranking for females",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyF_gc1$DEM_HS.Rank, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
Show code
ggboxplot(onlyM_gc1, x = "Semester.x", y = "DEM_HS.Rank",  title = "HS Ranking for males",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyM_gc1$DEM_HS.Rank, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

We saw that females had performed significantly lower in Fall2011, and almost significantly higher in Fall2013 than males. However, we see that these differences may also be explained by the differences with the incoming GPAs, but not by HS ranking. Also, many students lack HS Ranking so the statistics may be lacking.

Ethnicity

Let’s compare the GPA before enrolling in GenChem for students selfidentified ethnicity.

Show code
mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$DEM_Student.of.Color,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA and Student of Color: Y/N")
Table 11: 1st year GPA and Student of Color: Y/N
group1 n mean sd median trimmed mad min max range
X11 N 340 3.09 0.47 3.11 3.10 0.47 1.33 4.00 2.67
X12 Y 73 2.84 0.46 2.78 2.82 0.43 1.90 3.97 2.07
Show code
mata<-describeBy(allGC1$TG_Total.Grade....,allGC1$DEM_Student.of.Color,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1 grade and Student of Color: Y/N")
Table 11: GenChem1 grade and Student of Color: Y/N
group1 n mean sd median trimmed mad min max range
X11 N 348 81.06 10.20 82.34 82.13 8.65 40.52 98.78 58.27
X12 Y 73 74.67 10.48 75.30 74.58 12.04 43.47 96.02 52.56
Show code
mata<-describeBy(allGC2$TG_Total.Grade....,allGC2$DEM_Student.of.Color,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2 grade and Student of Color: Y/N")
Table 11: GenChem2 grade and Student of Color: Y/N
group1 n mean sd median trimmed mad min max range
X11 N 260 79.13 10.33 79.74 79.75 10.53 43.7 97.5 53.8
X12 Y 49 74.44 12.76 72.95 75.31 9.87 35.6 96.7 61.1
Show code
require(gridExtra)
plotA <-ggplot(allGC1, aes(x=TG_Total.Grade...., fill=DEM_Student.of.Color)) + geom_histogram() + ggtitle("GenChem1 by Ethnicity")
plotB <-ggplot(allGC1, aes(x=DEM_Cumulative.GPA, fill=DEM_Student.of.Color)) + geom_histogram() + ggtitle("Prev GPA by Ethnicity")
grid.arrange(plotA,plotB)

Statistical analysis of performance in GenChem1 by ethnicity

Show code
p <- ggboxplot(allGC1, x = "DEM_Student.of.Color", y = "TG_Total.Grade....", color = "DEM_Student.of.Color", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups
Show code
p <- ggboxplot(allGC1, x = "Semester", y = "TG_Total.Grade....", color = "DEM_Student.of.Color", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_Student.of.Color),label="p.format") #default is wilcox for comparing non-parametric two groups

Different ethnicities

Show code
mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$DEM_Ethnicity,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA for different ethnicities")
Table 12: 1st year GPA for different ethnicities
group1 n mean sd median trimmed mad min max range
X11 0 NaN NA NA NaN NA Inf -Inf -Inf
X12 Am. Indian 5 2.74 0.36 2.62 2.74 0.43 2.33 3.23 0.90
X13 Asian 38 2.88 0.47 2.81 2.86 0.36 1.90 3.97 2.07
X14 Black 29 2.96 0.50 2.89 2.95 0.56 2.12 3.95 1.83
X15 Hawaiian 1 3.11 NA 3.11 3.11 0.00 3.11 3.11 0.00
X16 Hispanic 13 2.78 0.38 2.70 2.76 0.36 2.27 3.47 1.20
X17 NS 3 2.72 0.66 2.37 2.72 0.07 2.32 3.48 1.16
X18 White 324 3.10 0.47 3.11 3.11 0.47 1.33 4.00 2.67

We can also run an anova among different ethnicities, but in any case it’s hard to do statistics on such small numbers maybe only black and asian are large enough to be compared with whites.

Show code
TukeyHSD( aov(allGC1$TG_Total.Grade.... ~ allGC1$DEM_Ethnicity))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = allGC1$TG_Total.Grade.... ~ allGC1$DEM_Ethnicity)

$`allGC1$DEM_Ethnicity`
                           diff         lwr       upr     p adj
Am. Indian-          -1.2520311 -24.3127791 21.808717 0.9999998
Asian-                1.8944333 -17.0079964 20.796863 0.9999879
Black-                0.2271630 -18.9237458 19.378072 1.0000000
Hawaiian-             6.1165333 -30.3457108 42.578777 0.9996050
Hispanic-            -0.4325654 -20.6581794 19.793049 1.0000000
NS-                  -4.9680000 -30.7507001 20.814700 0.9990190
White-                5.7148568 -12.5997033 24.029417 0.9807262
Asian-Am. Indian      3.1464644 -11.8319308 18.124860 0.9982867
Black-Am. Indian      1.4791941 -13.8115804 16.769969 0.9999905
Hawaiian-Am. Indian   7.3685644 -27.2225576 41.959686 0.9981266
Hispanic-Am. Indian   0.8194657 -15.7975718 17.436503 0.9999999
NS-Am. Indian        -3.7159689 -26.7767169 19.344779 0.9996975
White-Am. Indian      6.9668879  -7.2624335 21.196209 0.8117453
Black-Asian          -1.6672703  -9.3686684  6.034128 0.9979230
Hawaiian-Asian        4.2221000 -27.7474084 36.191609 0.9999204
Hispanic-Asian       -2.3269987 -12.4081536  7.754156 0.9968822
NS-Asian             -6.8624333 -25.7648630 12.039996 0.9553407
White-Asian           3.8204235  -1.4689372  9.109784 0.3535646
Hawaiian-Black        5.8893703 -26.2276802 38.006421 0.9992900
Hispanic-Black       -0.6597284 -11.1994223  9.879965 0.9999995
NS-Black             -5.1951630 -24.3460719 13.955746 0.9915567
White-Black           5.4876938  -0.6305412 11.605929 0.1158453
Hispanic-Hawaiian    -6.5490987 -39.3183386 26.220141 0.9987569
NS-Hawaiian         -11.0845333 -47.5467775 25.377711 0.9834226
White-Hawaiian       -0.4016765 -32.0271526 31.223800 1.0000000
NS-Hispanic          -4.5354346 -24.7610486 15.690179 0.9974027
White-Hispanic        6.1474222  -2.7829165 15.077761 0.4184543
White-NS             10.6828568  -7.6317033 28.997417 0.6360504

First generation students

Let’s compare the GPA before enrolling in GenChem for 1st generation vs the rest. Notice for how many people we have information (a total of 421 students in Genchem1 and 309 in GenChem2)

Show code
mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$DEM_First.Generation,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA and 1st generation: Y/N")
Table 13: 1st year GPA and 1st generation: Y/N
group1 n mean sd median trimmed mad min max range
X11 20 2.85 0.58 2.70 2.79 0.45 1.88 4.00 2.12
X12 N 248 3.05 0.48 3.09 3.07 0.50 1.33 3.98 2.65
X13 Y 145 3.07 0.44 3.02 3.06 0.43 1.90 4.00 2.10
Show code
mata<-describeBy(allGC1$TG_Total.Grade....,allGC1$DEM_First.Generation,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1 grade and 1st generation: Y/N")
Table 13: GenChem1 grade and 1st generation: Y/N
group1 n mean sd median trimmed mad min max range
X11 24 72.63 14.01 73.62 72.63 15.20 42.50 97.14 54.64
X12 N 252 79.81 10.51 81.07 80.73 10.29 40.52 97.07 56.55
X13 Y 145 81.41 9.36 82.18 82.20 8.03 53.00 98.78 45.78
Show code
mata<-describeBy(allGC2$TG_Total.Grade....,allGC2$DEM_First.Generation,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2 grade and 1st generation: Y/N")
Table 13: GenChem2 grade and 1st generation: Y/N
group1 n mean sd median trimmed mad min max range
X11 16 78.97 11.79 79.30 79.09 14.52 60.1 96.14 36.04
X12 N 187 78.26 10.96 79.10 78.94 11.12 38.2 97.50 59.30
X13 Y 106 78.52 10.64 79.42 79.24 10.49 35.6 96.70 61.10
Show code
ggplot(allGC1, aes(x=TG_Total.Grade...., fill=DEM_First.Generation )) + geom_histogram() + ggtitle("GenChem1 by First Generation")
Show code
ggplot(allGC2, aes(x=TG_Total.Grade...., fill=DEM_First.Generation))+geom_histogram()+ggtitle("GenChem2 by First Generation")

Statistical analysis of performance 1st generation in GenChem1

Show code
p <- ggboxplot(allGC1, x = "DEM_First.Generation", y = "TG_Total.Grade....", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups
Show code
p <- ggboxplot(allGC1, x = "Semester", y = "TG_Total.Grade....", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_First.Generation),label="p.format") #default is wilcox for comparing non-parametric two groups
Show code
p <- ggboxplot(allGC2, x = "Semester", y = "TG_Total.Grade....", color = "DEM_First.Generation", palette = "jco", add = "jitter")  + rotate_x_text(angle = 45)
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_First.Generation),label="p.format") #default is wilcox for comparing non-parametric two groups

First generation students seem to do slightly better or the same than the rest. Are they coming in with equal preparation? We can look at HS rank to try to answer that.

Show code
p <- ggboxplot(allGC1, x = "DEM_First.Generation", y = "DEM_HS.Rank", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups
Show code
p <- ggboxplot(allGC1, x = "Semester", y = "DEM_HS.Rank", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_HS.Rank),label="p.format") #default is wilcox for comparing non-parametric two groups

It seems that the first generation students are already better prepared than the non-first generation.

Citation

For attribution, please cite this work as

Prat-Resina (2018, Aug. 10). Prat-Resina's blog: Analysis of seven years of General Chemistry student data. Retrieved from https://xavierprat.github.io/Blog/posts/analysis_seven_years_genchem/

BibTeX citation

@misc{prat-resina2018analysis,
  author = {Prat-Resina, Xavier},
  title = {Prat-Resina's blog: Analysis of seven years of General Chemistry student data},
  url = {https://xavierprat.github.io/Blog/posts/analysis_seven_years_genchem/},
  year = {2018}
}