Prat-Resina's blog: Analysis of seven years of General Chemistry student data

Xavier Prat-Resina

Abstract

Between Fall 2010 and Spring 2018 I taught the two semester sequence of General Chemistry (GC). The way our curriculum was structured, these two semesters were usually taken during the sophomore year for students majoring in a Bachelor of Sciences in Health Sciences.

During all this time, while the chemistry content has not changed significantly, the forms of delivery and assessment have been evolving towards, hopefully, better pedagogies of engagment and towards a clearer assessment of learning objectives. Probably the most remarkable change was flipping the class with videos in the fall of 2014.

Overview Final Course grades

Let’s just look at how students have performed in the two GC semesters by looking at their final grade in different semesters.

IMPORTANT: We will see statistical significance between years and other demographics when analyzing the final percent grade. However, when we analyze the letter grade, those significances disappear. This is important because when a student is disengaged their score may be 60% or 5%, and while the means and medians may be affected, the letter grade analysis will not. Also, during the semester of Fall 2011 - Spring 2012 the laboratory was still a different course, this means that the criteria for a passing grade was not 70%, but lower.

Comparing means by semester

Show code

setwd("~/Gd/Research/StudentData/Discover")

#Load demographics for all years
allGC1 <- read.csv("./genchem1_nosummer_11_16.csv",header=TRUE)
allGC2 <- read.csv("./genchem2_11_17.csv",header=TRUE)
allGC1_ <- read.csv("./genchem1_nosummer_11_16_mergedsex.csv",header = TRUE)
allGC2_ <- read.csv("./genchem2_11_17_mergedsex.csv",header = TRUE)

library(psych)
mata<-describeBy(allGC1$TG_Total.Grade....,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1")

Table 1: GenChem1
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Fall 2011	68	78.43	10.11	79.50	78.98	9.79	42.50	95.90	53.40
X12	Fall 2012	84	73.95	12.41	73.00	73.78	14.53	43.00	96.20	53.20
X13	Fall 2013	69	83.81	8.17	84.70	84.49	6.82	41.60	96.50	54.90
X14	Fall 2014	105	80.95	9.73	81.52	81.78	8.61	40.52	98.78	58.27
X15	Fall 2015	60	81.44	9.74	82.68	82.39	7.19	43.47	96.33	52.87
X16	Fall 2016	35	84.17	7.24	84.95	84.60	7.82	66.98	97.14	30.16

Show code

mata<-describeBy(allGC2$TG_Total.Grade....,allGC2$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2")

Table 1: GenChem2
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Spring 2011	16	87.38	8.34	89.84	88.07	6.55	67.70	97.38	29.68
X12	Spring 2012	45	69.01	12.95	70.70	69.90	10.53	35.60	94.00	58.40
X13	Spring 2013	51	81.51	9.42	82.00	82.00	9.93	53.70	97.50	43.80
X14	Spring 2014	55	80.31	8.51	79.30	80.55	8.75	60.10	96.70	36.60
X15	Spring 2015	61	78.51	10.00	77.78	78.55	11.92	57.26	96.14	38.88
X16	Spring 2016	44	77.15	9.66	78.73	77.77	9.49	54.39	94.10	39.72
X17	Spring 2017	37	80.01	9.80	80.78	80.64	8.01	48.11	97.22	49.11

Graphically by semester

Show code

library(ggplot2)
ggplot(allGC1, aes(x=TG_Total.Grade...., fill=Semester))+geom_histogram()+ggtitle("GenChem1 by semester")

Show code

ggplot(allGC2, aes(x=TG_Total.Grade...., fill=Semester))+geom_histogram()+ggtitle("GenChem2")

Statiscal analysis by semester

Show code

a<- TukeyHSD( aov(allGC1$TG_Total.Grade.... ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. GenChem1 Grade among semesters")

Table 2: Anova. GenChem1 Grade among semesters
	diff	lwr	upr	p adj
Fall 2012-Fall 2011	-4.4779412	-9.1423644	0.186482	0.0681540
Fall 2013-Fall 2011	5.3836530	0.4976747	10.269631	0.0211987
Fall 2014-Fall 2011	2.5216131	-1.9292496	6.972476	0.5843701
Fall 2015-Fall 2011	3.0089288	-2.0556712	8.073529	0.5318824
Fall 2016-Fall 2011	5.7435885	-0.2048146	11.691992	0.0653478
Fall 2013-Fall 2012	9.8615942	5.2158876	14.507301	0.0000000
Fall 2014-Fall 2012	6.9995543	2.8138662	11.185242	0.0000345
Fall 2015-Fall 2012	7.4868700	2.6536537	12.320086	0.0001708
Fall 2016-Fall 2012	10.2215296	4.4688516	15.974208	0.0000082
Fall 2014-Fall 2013	-2.8620399	-7.2932841	1.569204	0.4353187
Fall 2015-Fall 2013	-2.3747242	-7.4220918	2.672643	0.7584349
Fall 2016-Fall 2013	0.3599354	-5.5738024	6.293673	0.9999781
Fall 2015-Fall 2014	0.4873157	-4.1401366	5.114768	0.9996654
Fall 2016-Fall 2014	3.2219753	-2.3589421	8.802893	0.5638475
Fall 2016-Fall 2015	2.7346596	-3.3470041	8.816323	0.7918982

Show code

a<- TukeyHSD( aov(allGC2$TG_Total.Grade.... ~ allGC2$Semester)) 
b<-as.data.frame(a$`allGC2$Semester`)
knitr::kable(b, caption = "Anova. GenChem2 Grade among semesters")

Table 2: Anova. GenChem2 Grade among semesters
	diff	lwr	upr	p adj
Spring 2012-Spring 2011	-18.3719243	-27.016530	-9.7273185	0.0000000
Spring 2013-Spring 2011	-5.8749308	-14.385113	2.6352511	0.3860347
Spring 2014-Spring 2011	-7.0680859	-15.504043	1.3678712	0.1676882
Spring 2015-Spring 2011	-8.8701902	-17.212128	-0.5282519	0.0288746
Spring 2016-Spring 2011	-10.2332149	-18.903549	-1.5628812	0.0094423
Spring 2017-Spring 2011	-7.3733537	-16.259708	1.5130002	0.1767662
Spring 2013-Spring 2012	12.4969935	6.422770	18.5712173	0.0000001
Spring 2014-Spring 2012	11.3038384	5.334050	17.2736265	0.0000009
Spring 2015-Spring 2012	9.5017341	3.665559	15.3379087	0.0000439
Spring 2016-Spring 2012	8.1387093	1.842068	14.4353503	0.0028667
Spring 2017-Spring 2012	10.9985706	4.407646	17.5894949	0.0000250
Spring 2014-Spring 2013	-1.1931551	-6.966573	4.5802632	0.9963637
Spring 2015-Spring 2013	-2.9952594	-8.630410	2.6398912	0.6966880
Spring 2016-Spring 2013	-4.3582841	-10.469068	1.7524994	0.3452669
Spring 2017-Spring 2013	-1.4984229	-7.912023	4.9151776	0.9928975
Spring 2015-Spring 2014	-1.8021043	-7.324522	3.7203134	0.9603189
Spring 2016-Spring 2014	-3.1651291	-9.172113	2.8418544	0.7053683
Spring 2017-Spring 2014	-0.3052678	-6.620048	6.0095122	0.9999993
Spring 2016-Spring 2015	-1.3630247	-7.237241	4.5111913	0.9931556
Spring 2017-Spring 2015	1.4968365	-4.691783	7.6854560	0.9914449
Spring 2017-Spring 2016	2.8598613	-3.764772	9.4844944	0.8600771

Show code

#install.packages("ggpubr")
library(ggpubr)
ggboxplot(allGC1, x = "Semester", y = "TG_Total.Grade....",  title = "Final grade in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

ggboxplot(allGC2, x = "Semester", y = "TG_Total.Grade....",  title = "Final grade in GC2",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC2$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Other grades besides final grade

Letter grades

I converted the letter grades into the 4-scale. The plot should only show 4, 3.66, 3.33, 3… but it seems to add more variability…

Show code

#need to load this other file, as it contains the letter grades
allGC1_bosco <- read.csv("~/Research/StudentData/XavierData/Clean/allGC1.csv",header = TRUE)

a<- allGC1_bosco$Final.letter
a <- gsub("A\\-", 3.667,a)
a <- gsub("A", 4.000,a)
a <- gsub("B\\+", 3.333,a)
a <- gsub("B\\-", 2.667,a)
a <- gsub("B", 3.000,a)
a <- gsub("C\\+", 2.333,a)
a <- gsub("C\\-", 1.667,a)
a <- gsub("C", 2.000,a)
a <- gsub("D\\+", 1.333,a)
a <- gsub("D", 1.000,a)
a <- gsub("F", 0.000,a)
a <- gsub("I", 0.000,a)
allGC1_bosco$Final.letter.number <- as.numeric(as.character(a))
ggboxplot(allGC1_bosco, x = "Semester", y = "Final.letter.number",  title = "Final letter grade in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1_bosco$Final.letter.number), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

#ggplot(data=allGC1_bosco,aes(x=Semester,y=Final.letter)) + geom_bar(stat="identity") + geom_bar(aes(fill = Final.letter))

Semester exams and previous exams

Show code

setwd("~/Gd/Research/StudentData/Discover")
#lets write the prepost file into the discover folder
prePost <- read.csv("/Users/xavier/Gd/Research/StudentData/ExamPrePost.csv",header=TRUE,sep = "\t")
source("~/Gd/Research/R/deid.R")
prePost <- deIdThis(prePost)
write.csv(prePost,file="prePost.csv")
prePost <- read.csv("./prePost.csv", header = TRUE)
prePost$inc1<-prePost$Grade1-prePost$Mid1
prePost$inc2<-prePost$Grade2-prePost$Mid2
prePost$inc3<-prePost$Grade3-prePost$Mid3
prePost$meanInc <- rowMeans( prePost[c('inc1','inc2','inc3')])
prePost$meanExam <- rowMeans( prePost[c('Grade1','Grade2','Grade3')])

The final exam is a second opportunity for students to improve their semester exams. Let’s measure how exams score and improvement evolved through the years.

Show code

ggboxplot(prePost, x = "Semester", y = "meanExam",  title = "Average grade in final exams",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(prePost$meanExam), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 105 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

This is plots the increment

Show code

ggboxplot(prePost, x = "Semester", y = "meanInc",  title = "Average increment from semester exams to final",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(prePost$meanInc), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 40 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

There’s something funky about some of these numbers. Fall 2014 doesn’t seem to apply the >40% rule, which I actually implemented.

So let’s check that I obtain the same result if I plot grade exams from BoSCO data

Show code

allGC1_bosco$meanExam <-  rowMeans( allGC1_bosco[c('Exam1','Exam2','Exam3')], na.rm=TRUE)

ggboxplot(allGC1_bosco, x = "Semester", y = "meanExam",  title = "Average grade in final exams (Bosco source)",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1_bosco$meanExam), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 105 ) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Predictors of performance in Chemistry

There are different variables that we want to look at. Performance factors such as ACT scores or GPA or High School rank , as well as demographic factors such as ethnicity and first-year generation.

Math ACT is a good predictor

Show code

mata<-describeBy(allGC1$DEM_ACT.MATH,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "ACT Math - Fall sophomore")

Table 3: ACT Math - Fall sophomore
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Fall 2011	64	24.75	3.50	24.5	24.67	3.71	18	33	15
X12	Fall 2012	70	24.41	4.01	24.0	24.29	4.45	17	34	17
X13	Fall 2013	66	25.14	2.82	25.0	25.26	2.97	18	31	13
X14	Fall 2014	101	25.47	3.21	26.0	25.40	2.97	17	34	17
X15	Fall 2015	57	24.72	3.19	24.0	24.70	2.97	18	32	14
X16	Fall 2016	32	24.94	2.64	25.0	24.96	2.97	19	30	11

Show code

mata<-describeBy(allGC2$DEM_ACT.MATH,allGC2$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "ACT Math - Spring sophomore")

Table 3: ACT Math - Spring sophomore
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Spring 2011	16	25.88	4.18	25.5	25.79	3.71	19	34	15
X12	Spring 2012	42	25.57	3.47	25.0	25.56	2.97	18	33	15
X13	Spring 2013	44	26.23	4.15	26.0	26.17	4.45	18	34	16
X14	Spring 2014	52	24.90	2.76	25.0	25.17	2.97	18	29	11
X15	Spring 2015	58	26.10	3.36	26.0	26.04	2.97	19	34	15
X16	Spring 2016	42	25.10	3.27	25.0	25.00	2.97	18	32	14
X17	Spring 2017	33	25.27	2.47	26.0	25.30	2.97	21	30	9

We see that the second semester is a subselection of the first semester with a higher ACT math score. Therefore, we can just use GenChem1 for the analysis.

Was math ACT different through the years?

As we can see below. There is no significant difference in ACT throughout the years

Show code

a<- TukeyHSD( aov(allGC1$DEM_ACT.MATH ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. ACTMath among semesters")

Table 4: Anova. ACTMath among semesters
	diff	lwr	upr	p adj
Fall 2012-Fall 2011	-0.3357143	-1.9769059	1.3054774	0.9919345
Fall 2013-Fall 2011	0.3863636	-1.2784117	2.0511390	0.9856326
Fall 2014-Fall 2011	0.7153465	-0.8007861	2.2314792	0.7559117
Fall 2015-Fall 2011	-0.0307018	-1.7589701	1.6975666	1.0000000
Fall 2016-Fall 2011	0.1875000	-1.8670492	2.2420492	0.9998340
Fall 2013-Fall 2012	0.7220779	-0.9060719	2.3502278	0.8010990
Fall 2014-Fall 2012	1.0510608	-0.4247621	2.5268838	0.3217601
Fall 2015-Fall 2012	0.3050125	-1.3880045	1.9980295	0.9955408
Fall 2016-Fall 2012	0.5232143	-1.5017715	2.5482001	0.9767972
Fall 2014-Fall 2013	0.3289829	-1.1730225	1.8309883	0.9889559
Fall 2015-Fall 2013	-0.4170654	-2.1329539	1.2988231	0.9823133
Fall 2016-Fall 2013	-0.1988636	-2.2430100	1.8452827	0.9997726
Fall 2015-Fall 2014	-0.7460483	-2.3181344	0.8260379	0.7513444
Fall 2016-Fall 2014	-0.5278465	-2.4528701	1.3971770	0.9699093
Fall 2016-Fall 2015	0.2182018	-1.8779779	2.3143814	0.9996831

Show code

ggboxplot(allGC1, x = "Semester", y = "DEM_ACT.MATH",  title = "ACT Math in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$DEM_ACT.MATH, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 40) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Correlation models: ACT vs GenChem1

Show code

#par(mfrow = c(1, 2))
plot(allGC1$TG_Total.Grade....,allGC1$DEM_ACT.MATH,main="GenChem1")
a<- lm(allGC1$DEM_ACT.MATH~allGC1$TG_Total.Grade.... )
abline(a)

Show code

r2a<-summary(a)$r.squared

plot(allGC2$TG_Total.Grade....,allGC2$DEM_ACT.MATH,main="GenChem2")
a<-lm(allGC2$DEM_ACT.MATH~allGC2$TG_Total.Grade.... )
abline(a)

Show code

r2b<-summary(a)$r.squared

We obtain a r-squared for both 0.2042773 and 0.1569021, respectively. We need to find a better predictor. Let’s see cumulative GPA before enrolling

Previous GPA is a better predictor

While ACT.Math historically seems to correlate well, since we’re teaching sophomores, previous GPA is even a better predictor

Show code

mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1")

Table 5: GenChem1
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Fall 2011	65	3.11	0.46	3.14	3.13	0.52	1.88	3.93	2.05
X12	Fall 2012	79	2.94	0.53	2.94	2.95	0.59	1.44	3.98	2.54
X13	Fall 2013	69	3.18	0.44	3.21	3.19	0.53	1.84	3.98	2.14
X14	Fall 2014	105	3.00	0.48	2.98	3.00	0.47	1.33	4.00	2.67
X15	Fall 2015	60	3.00	0.39	3.00	2.98	0.33	2.18	3.97	1.79
X16	Fall 2016	35	3.12	0.48	3.26	3.14	0.42	2.06	4.00	1.94

Show code

mata<-describeBy(allGC2$DEM_Cumulative.GPA,allGC2$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2")

Table 5: GenChem2
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Spring 2011	16	3.46	0.43	3.54	3.47	0.46	2.73	4.00	1.27
X12	Spring 2012	43	3.18	0.39	3.19	3.19	0.40	2.33	3.95	1.62
X13	Spring 2013	50	3.20	0.48	3.24	3.24	0.39	2.05	3.97	1.92
X14	Spring 2014	53	3.25	0.44	3.27	3.26	0.50	2.18	3.98	1.80
X15	Spring 2015	61	3.18	0.46	3.19	3.18	0.49	2.19	4.00	1.81
X16	Spring 2016	44	3.07	0.41	3.05	3.06	0.36	2.13	3.96	1.83
X17	Spring 2017	36	3.23	0.40	3.26	3.25	0.34	2.13	4.00	1.87

Was Incoming GPA different through the years?

Show code

a<- TukeyHSD( aov(allGC1$DEM_Cumulative.GPA ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. Entering GPA among semesters")

Table 6: Anova. Entering GPA among semesters
	diff	lwr	upr	p adj
Fall 2012-Fall 2011	-0.1679124	-0.3930252	0.0572004	0.2709961
Fall 2013-Fall 2011	0.0728361	-0.1595233	0.3051956	0.9469705
Fall 2014-Fall 2011	-0.1069817	-0.3191411	0.1051777	0.7000993
Fall 2015-Fall 2011	-0.1160769	-0.3567413	0.1245875	0.7384580
Fall 2016-Fall 2011	0.0137802	-0.2680571	0.2956175	0.9999925
Fall 2013-Fall 2012	0.2407485	0.0192443	0.4622527	0.0242003
Fall 2014-Fall 2012	0.0609307	-0.1392812	0.2611426	0.9531219
Fall 2015-Fall 2012	0.0518354	-0.1783657	0.2820366	0.9874925
Fall 2016-Fall 2012	0.1816926	-0.0912643	0.4546495	0.3999188
Fall 2014-Fall 2013	-0.1798178	-0.3881444	0.0285087	0.1350707
Fall 2015-Fall 2013	-0.1889130	-0.4262055	0.0483794	0.2048113
Fall 2016-Fall 2013	-0.0590559	-0.3380193	0.2199075	0.9905679
Fall 2015-Fall 2014	-0.0090952	-0.2266461	0.2084557	0.9999966
Fall 2016-Fall 2014	0.1207619	-0.1416144	0.3831382	0.7750540
Fall 2016-Fall 2015	0.1298571	-0.1560608	0.4157750	0.7847500

Show code

ggboxplot(allGC1, x = "Semester", y = "DEM_Cumulative.GPA",  title = "Entering GPA in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$DEM_Cumulative.GPA, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Correlation models: Prev. GPA vs GenChem grades

When we plot previous GPA (typically first year GPA) against final grade

Show code

#par(mfrow = c(1, 2))
plot(allGC1$TG_Total.Grade....,allGC1$DEM_Cumulative.GPA,main="GenChem1")
a<-lm(allGC1$DEM_Cumulative.GPA~allGC1$TG_Total.Grade.... )
abline(a)

Show code

r2a<-summary(a)$r.squared
plot(allGC2$TG_Total.Grade....,allGC2$DEM_Cumulative.GPA,main="GenChem2")
a<-lm(allGC2$DEM_Cumulative.GPA~allGC2$TG_Total.Grade.... )
abline(a)

Show code

r2b<-summary(a)$r.squared

In this case we obtain better r-squared for both 0.656591 and 0.5840838, respectively

Is Highschool performance relevant?

For large schools, highschool(HS) ranking can be used as a better measurement than HS GPA. Also, HS-GPA is currently unavailable :). The units are given in percentile, so the higher the better

Show code

mata<-describeBy(allGC1$DEM_HS.Rank,allGC1$Semester,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1")

Table 7: GenChem1
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	Fall 2011	60	79.18	14.08	80.5	80.48	15.57	46	97	51
X12	Fall 2012	65	73.57	17.01	76.0	74.98	16.31	20	99	79
X13	Fall 2013	57	81.21	12.33	81.0	82.04	13.34	47	99	52
X14	Fall 2014	86	79.43	16.62	84.0	81.56	13.34	26	99	73
X15	Fall 2015	51	81.27	13.83	86.0	82.93	8.90	37	99	62
X16	Fall 2016	25	82.28	11.75	85.0	82.95	10.38	60	98	38

Was Highschool performance different through the years?

Show code

a<- TukeyHSD( aov(allGC1$DEM_HS.Rank ~ allGC1$Semester)) 
b<-as.data.frame(a$`allGC1$Semester`)
knitr::kable(b, caption = "Anova. HS ranking among semesters")

Table 8: Anova. HS ranking among semesters
	diff	lwr	upr	p adj
Fall 2012-Fall 2011	-5.6141026	-13.2615244	2.033319	0.2876124
Fall 2013-Fall 2011	2.0271930	-5.8736280	9.928014	0.9774141
Fall 2014-Fall 2011	0.2468992	-6.9383845	7.432183	0.9999987
Fall 2015-Fall 2011	2.0911765	-6.0444897	10.226843	0.9772356
Fall 2016-Fall 2011	3.0966667	-7.0718166	13.265150	0.9527517
Fall 2013-Fall 2012	7.6412955	-0.1100688	15.392660	0.0558988
Fall 2014-Fall 2012	5.8610018	-1.1596093	12.881613	0.1616772
Fall 2015-Fall 2012	7.7052790	-0.2853243	15.695882	0.0659401
Fall 2016-Fall 2012	8.7107692	-1.3420278	18.763566	0.1318975
Fall 2014-Fall 2013	-1.7802938	-9.0761070	5.515519	0.9819290
Fall 2015-Fall 2013	0.0639835	-8.1694637	8.297431	1.0000000
Fall 2016-Fall 2013	1.0694737	-9.1774107	11.316358	0.9996775
Fall 2015-Fall 2014	1.8442772	-5.7052249	9.393779	0.9818377
Fall 2016-Fall 2014	2.8497674	-6.8561055	12.555640	0.9594958
Fall 2016-Fall 2015	1.0054902	-9.4235429	11.434523	0.9997815

Show code

ggboxplot(allGC1, x = "Semester", y = "DEM_HS.Rank",  title = "Highschool Rank in GC1",
          color = "Semester", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(allGC1$DEM_HS.Rank, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Fall2012 seems to stand out again.

Correlation models: HS rank vs GenChem grades

Show code

#par(mfrow = c(1, 2))
plot(allGC1$TG_Total.Grade....,allGC1$DEM_HS.Rank,main="GenChem1")
a<-lm(allGC1$DEM_HS.Rank~allGC1$TG_Total.Grade.... )
abline(a)

Show code

r2a<-summary(a)$r.squared
plot(allGC2$TG_Total.Grade....,allGC2$DEM_HS.Rank,main="GenChem2")
a<-lm(allGC2$DEM_HS.Rank~allGC2$TG_Total.Grade.... )
abline(a)

Show code

r2b<-summary(a)$r.squared

Fairly poor r-squared for both 0.1667476 and 0.0552476, respectively

Demographics

Given the good correlation given above between previous GPA and final grade, let’s then analyze how students of different demographics perform in chemistry when compared to their incoming GPA. In other words, instead of comparing how first-generation vs non-first-generation do, it is more interesting to see how considering their college readiness (as desribed by GPA) how they did in GenChem

Gender

Look at how previous GPA and GenChem grades is among selfidentified genders. There was no data besides male and female.

Show code

#there are some underfined that mess up the graphs
onlyMF_gc1<- allGC1_[complete.cases(allGC1_$Sex),]
onlyMF_gc2<- allGC2_[complete.cases(allGC2_$Sex),]
mata<-describeBy(onlyMF_gc1$DEM_Cumulative.GPA,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA and Sex")

Table 9: 1st year GPA and Sex
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	F	286	3.08	0.45	3.06	3.08	0.48	1.88	4	2.12
X12	M	117	3.00	0.53	3.04	3.01	0.49	1.33	4	2.67

Show code

mata<-describeBy(onlyMF_gc1$DEM_ACT.MATH,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "ACT math and Sex")

Table 9: ACT math and Sex
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	F	270	24.61	3.09	25	24.61	2.97	17	34	17
X12	M	111	25.82	3.71	26	25.83	2.97	17	34	17

Show code

mata<-describeBy(onlyMF_gc1$DEM_HS.Rank,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "HS rank and Sex")

Table 9: HS rank and Sex
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	F	239	81.02	14.21	84.0	82.64	11.86	25	99	74
X12	M	96	74.40	16.42	76.5	75.46	17.05	20	99	79

From the above, we can see that females come to GenChem with very slightly higher GPA and remarkably better HS ranking, but with a lower ACT-math score. Also, males have a broader range of values and higher standard deviation, this tell us that male performance may not be treated as a single group, and it may require a further finer classification. In any case, How will these three factors affect their performance in GenChem? The number of students may not be exactly the same because not all students have ACT or HS data.

Show code

mata<-describeBy(onlyMF_gc1$TG_Total.Grade....,onlyMF_gc1$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1 grade and Sex")

Table 10: GenChem1 grade and Sex
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	F	289	79.74	9.84	81.0	80.49	9.49	42.50	98.78	56.28
X12	M	120	80.88	12.12	82.1	82.36	11.15	40.52	97.14	56.62

Show code

mata<-describeBy(onlyMF_gc2$TG_Total.Grade....,onlyMF_gc2$Sex,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2 grade and Sex")

Table 10: GenChem2 grade and Sex
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	F	196	78.18	10.51	79.28	78.88	9.62	38.20	97.50	59.30
X12	M	102	80.31	10.19	80.57	80.76	13.00	56.39	97.22	40.83

Comparing the two genders each year

While it may look like males do better than females, even though females came with better GPA and HS ranking, there is actually no significant difference when compared the two groups in general.

Show code

#install.packages("ggpubr")
library(ggpubr)
p <- ggboxplot(onlyMF_gc1, x = "Sex", y = "TG_Total.Grade....", color = "Sex", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups

However, when the two groups are compared each semester we notice that Fall 2011 is the only semester with a significant difference between genders.

Show code

p <- ggboxplot(onlyMF_gc1, x = "Semester.x", y = "TG_Total.Grade....", color = "Sex", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=Sex),label="p.format") #default is wilcox for comparing non-parametric two groups

Performance by each gender through the years

Before we jump into conclusions, however, we may need to look into how the females in Fall 2011 performed compared to other semester’s females.

Show code

#selecting females
onlyF_gc1 <- onlyMF_gc1[onlyMF_gc1$Sex=="F",]
onlyM_gc1 <- onlyMF_gc1[onlyMF_gc1$Sex=="M",]

ggboxplot(onlyF_gc1, x = "Semester.x", y = "TG_Total.Grade....",  title = "Females in GC1",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyF_gc1$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

ggboxplot(onlyM_gc1, x = "Semester.x", y = "TG_Total.Grade....",  title = "Males in GC1",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyM_gc1$TG_Total.Grade....), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

ggboxplot(onlyF_gc1, x = "Semester.x", y = "DEM_Cumulative.GPA",  title = "Incoming GPA for females",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyF_gc1$DEM_Cumulative.GPA, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

ggboxplot(onlyM_gc1, x = "Semester.x", y = "DEM_Cumulative.GPA",  title = "Incoming GPA for males",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyM_gc1$DEM_Cumulative.GPA, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 5) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

ggboxplot(onlyF_gc1, x = "Semester.x", y = "DEM_HS.Rank",  title = "HS Ranking for females",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyF_gc1$DEM_HS.Rank, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

Show code

ggboxplot(onlyM_gc1, x = "Semester.x", y = "DEM_HS.Rank",  title = "HS Ranking for males",
          color = "Semester.x", add = "jitter", legend="none") + rotate_x_text(angle = 45) +  
  geom_hline( yintercept = mean(onlyM_gc1$DEM_HS.Rank, na.rm = TRUE), linetype = 2) + 
  stat_compare_means(method = "anova", label.y = 110) +
  stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")

We saw that females had performed significantly lower in Fall2011, and almost significantly higher in Fall2013 than males. However, we see that these differences may also be explained by the differences with the incoming GPAs, but not by HS ranking. Also, many students lack HS Ranking so the statistics may be lacking.

Ethnicity

Let’s compare the GPA before enrolling in GenChem for students selfidentified ethnicity.

Show code

mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$DEM_Student.of.Color,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA and Student of Color: Y/N")

Table 11: 1st year GPA and Student of Color: Y/N
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	N	340	3.09	0.47	3.11	3.10	0.47	1.33	4.00	2.67
X12	Y	73	2.84	0.46	2.78	2.82	0.43	1.90	3.97	2.07

Show code

mata<-describeBy(allGC1$TG_Total.Grade....,allGC1$DEM_Student.of.Color,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1 grade and Student of Color: Y/N")

Table 11: GenChem1 grade and Student of Color: Y/N
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	N	348	81.06	10.20	82.34	82.13	8.65	40.52	98.78	58.27
X12	Y	73	74.67	10.48	75.30	74.58	12.04	43.47	96.02	52.56

Show code

mata<-describeBy(allGC2$TG_Total.Grade....,allGC2$DEM_Student.of.Color,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2 grade and Student of Color: Y/N")

Table 11: GenChem2 grade and Student of Color: Y/N
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11	N	260	79.13	10.33	79.74	79.75	10.53	43.7	97.5	53.8
X12	Y	49	74.44	12.76	72.95	75.31	9.87	35.6	96.7	61.1

Show code

require(gridExtra)
plotA <-ggplot(allGC1, aes(x=TG_Total.Grade...., fill=DEM_Student.of.Color)) + geom_histogram() + ggtitle("GenChem1 by Ethnicity")
plotB <-ggplot(allGC1, aes(x=DEM_Cumulative.GPA, fill=DEM_Student.of.Color)) + geom_histogram() + ggtitle("Prev GPA by Ethnicity")
grid.arrange(plotA,plotB)

Statistical analysis of performance in GenChem1 by ethnicity

Show code

p <- ggboxplot(allGC1, x = "DEM_Student.of.Color", y = "TG_Total.Grade....", color = "DEM_Student.of.Color", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups

Show code

p <- ggboxplot(allGC1, x = "Semester", y = "TG_Total.Grade....", color = "DEM_Student.of.Color", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_Student.of.Color),label="p.format") #default is wilcox for comparing non-parametric two groups

Different ethnicities

Show code

mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$DEM_Ethnicity,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA for different ethnicities")

Table 12: 1st year GPA for different ethnicities
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11		0	NaN	NA	NA	NaN	NA	Inf	-Inf	-Inf
X12	Am. Indian	5	2.74	0.36	2.62	2.74	0.43	2.33	3.23	0.90
X13	Asian	38	2.88	0.47	2.81	2.86	0.36	1.90	3.97	2.07
X14	Black	29	2.96	0.50	2.89	2.95	0.56	2.12	3.95	1.83
X15	Hawaiian	1	3.11	NA	3.11	3.11	0.00	3.11	3.11	0.00
X16	Hispanic	13	2.78	0.38	2.70	2.76	0.36	2.27	3.47	1.20
X17	NS	3	2.72	0.66	2.37	2.72	0.07	2.32	3.48	1.16
X18	White	324	3.10	0.47	3.11	3.11	0.47	1.33	4.00	2.67

We can also run an anova among different ethnicities, but in any case it’s hard to do statistics on such small numbers maybe only black and asian are large enough to be compared with whites.

Show code

TukeyHSD( aov(allGC1$TG_Total.Grade.... ~ allGC1$DEM_Ethnicity))

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = allGC1$TG_Total.Grade.... ~ allGC1$DEM_Ethnicity)

$`allGC1$DEM_Ethnicity`
                           diff         lwr       upr     p adj
Am. Indian-          -1.2520311 -24.3127791 21.808717 0.9999998
Asian-                1.8944333 -17.0079964 20.796863 0.9999879
Black-                0.2271630 -18.9237458 19.378072 1.0000000
Hawaiian-             6.1165333 -30.3457108 42.578777 0.9996050
Hispanic-            -0.4325654 -20.6581794 19.793049 1.0000000
NS-                  -4.9680000 -30.7507001 20.814700 0.9990190
White-                5.7148568 -12.5997033 24.029417 0.9807262
Asian-Am. Indian      3.1464644 -11.8319308 18.124860 0.9982867
Black-Am. Indian      1.4791941 -13.8115804 16.769969 0.9999905
Hawaiian-Am. Indian   7.3685644 -27.2225576 41.959686 0.9981266
Hispanic-Am. Indian   0.8194657 -15.7975718 17.436503 0.9999999
NS-Am. Indian        -3.7159689 -26.7767169 19.344779 0.9996975
White-Am. Indian      6.9668879  -7.2624335 21.196209 0.8117453
Black-Asian          -1.6672703  -9.3686684  6.034128 0.9979230
Hawaiian-Asian        4.2221000 -27.7474084 36.191609 0.9999204
Hispanic-Asian       -2.3269987 -12.4081536  7.754156 0.9968822
NS-Asian             -6.8624333 -25.7648630 12.039996 0.9553407
White-Asian           3.8204235  -1.4689372  9.109784 0.3535646
Hawaiian-Black        5.8893703 -26.2276802 38.006421 0.9992900
Hispanic-Black       -0.6597284 -11.1994223  9.879965 0.9999995
NS-Black             -5.1951630 -24.3460719 13.955746 0.9915567
White-Black           5.4876938  -0.6305412 11.605929 0.1158453
Hispanic-Hawaiian    -6.5490987 -39.3183386 26.220141 0.9987569
NS-Hawaiian         -11.0845333 -47.5467775 25.377711 0.9834226
White-Hawaiian       -0.4016765 -32.0271526 31.223800 1.0000000
NS-Hispanic          -4.5354346 -24.7610486 15.690179 0.9974027
White-Hispanic        6.1474222  -2.7829165 15.077761 0.4184543
White-NS             10.6828568  -7.6317033 28.997417 0.6360504

First generation students

Let’s compare the GPA before enrolling in GenChem for 1st generation vs the rest. Notice for how many people we have information (a total of 421 students in Genchem1 and 309 in GenChem2)

Show code

mata<-describeBy(allGC1$DEM_Cumulative.GPA,allGC1$DEM_First.Generation,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "1st year GPA and 1st generation: Y/N")

Table 13: 1st year GPA and 1st generation: Y/N
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11		20	2.85	0.58	2.70	2.79	0.45	1.88	4.00	2.12
X12	N	248	3.05	0.48	3.09	3.07	0.50	1.33	3.98	2.65
X13	Y	145	3.07	0.44	3.02	3.06	0.43	1.90	4.00	2.10

Show code

mata<-describeBy(allGC1$TG_Total.Grade....,allGC1$DEM_First.Generation,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem1 grade and 1st generation: Y/N")

Table 13: GenChem1 grade and 1st generation: Y/N
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11		24	72.63	14.01	73.62	72.63	15.20	42.50	97.14	54.64
X12	N	252	79.81	10.51	81.07	80.73	10.29	40.52	97.07	56.55
X13	Y	145	81.41	9.36	82.18	82.20	8.03	53.00	98.78	45.78

Show code

mata<-describeBy(allGC2$TG_Total.Grade....,allGC2$DEM_First.Generation,mat=TRUE,digits = 2)
knitr::kable(mata[,c(2,4,5,6,7,8,9,10,11,12)] ,  caption = "GenChem2 grade and 1st generation: Y/N")

Table 13: GenChem2 grade and 1st generation: Y/N
	group1	n	mean	sd	median	trimmed	mad	min	max	range
X11		16	78.97	11.79	79.30	79.09	14.52	60.1	96.14	36.04
X12	N	187	78.26	10.96	79.10	78.94	11.12	38.2	97.50	59.30
X13	Y	106	78.52	10.64	79.42	79.24	10.49	35.6	96.70	61.10

Show code

ggplot(allGC1, aes(x=TG_Total.Grade...., fill=DEM_First.Generation )) + geom_histogram() + ggtitle("GenChem1 by First Generation")

Show code

ggplot(allGC2, aes(x=TG_Total.Grade...., fill=DEM_First.Generation))+geom_histogram()+ggtitle("GenChem2 by First Generation")

Statistical analysis of performance 1st generation in GenChem1

Show code

p <- ggboxplot(allGC1, x = "DEM_First.Generation", y = "TG_Total.Grade....", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups

Show code

p <- ggboxplot(allGC1, x = "Semester", y = "TG_Total.Grade....", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_First.Generation),label="p.format") #default is wilcox for comparing non-parametric two groups

Show code

p <- ggboxplot(allGC2, x = "Semester", y = "TG_Total.Grade....", color = "DEM_First.Generation", palette = "jco", add = "jitter")  + rotate_x_text(angle = 45)
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_First.Generation),label="p.format") #default is wilcox for comparing non-parametric two groups

First generation students seem to do slightly better or the same than the rest. Are they coming in with equal preparation? We can look at HS rank to try to answer that.

Show code

p <- ggboxplot(allGC1, x = "DEM_First.Generation", y = "DEM_HS.Rank", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means() #default is wilcox for comparing non-parametric two groups

Show code

p <- ggboxplot(allGC1, x = "Semester", y = "DEM_HS.Rank", color = "DEM_First.Generation", palette = "jco", add = "jitter")
#p + stat_compare_means(method = "t.test")
p + stat_compare_means(aes(group=DEM_HS.Rank),label="p.format") #default is wilcox for comparing non-parametric two groups

It seems that the first generation students are already better prepared than the non-first generation.

Analysis of seven years of General Chemistry student data

Abstract

Overview Final Course grades

Comparing means by semester

Graphically by semester

Statiscal analysis by semester

Other grades besides final grade

Letter grades

Semester exams and previous exams

Predictors of performance in Chemistry

Math ACT is a good predictor

Was math ACT different through the years?

Correlation models: ACT vs GenChem1

Previous GPA is a better predictor

Was Incoming GPA different through the years?

Correlation models: Prev. GPA vs GenChem grades

Is Highschool performance relevant?

Was Highschool performance different through the years?

Correlation models: HS rank vs GenChem grades

Demographics

Gender

Comparing the two genders each year

Performance by each gender through the years

Ethnicity

Statistical analysis of performance in GenChem1 by ethnicity

Different ethnicities

First generation students

Statistical analysis of performance 1st generation in GenChem1

Citation