Assessment Instruments for SFSU NOVA Course

Action Research for SFSU's NASA-NOVA Course:
Planetary Climate Change
January 18, 2002

Dr. Dave Dempsey
Prof. of Meteorology
Dept. of Geosciences
College of Science
ddempsey@sundog.sfsu.edu

Dr. Kathleen O'Sullivan
Professor
Dept. of Secondary Education
College of Education
kaosul@sfsu.edu

Dr. Lisa White
Assoc. Prof. of Geology
Dept. of Geosciences
College of Science
lwhite@sfsu.edu

I. Introduction

Action research for GEOL/METR 302, "Planetary Climate Change", comprised three types of assessment described in section II below, including:

Attitudes about science [draft mostly complete]
Scientific reasoning [draft mostly complete]
Concepts about climate [draft mostly complete]

These assessments were administered during the first week of classes ("pre-tests") and again in the last week or two of the semester ("post-tests"), in the following six classes:

GEOL/METR 310, "Planetary Climate Change" (our NOVA course, taught by Dempsey and White in Fall, 2000; hereafter referred to as GM310.00)
GEOL/METR 310 (taught by Dempsey and Dr. Matt LaForce in Fall, 2001; hereafter referred to as GM310.01)
GEOL/METR 103, "Introduction to Oceanography Lab" (David Morris, Fall 2000; hereafter referred to as GM103)
GEOL 302, "The Violent Planet" (Dr. Erwin Seibel, Fall 2000; hereafter referred to as G302)
METR 302, "The Violent Atmosphere and Oceans" (Dr. Erwin Seibel, Fall 2000, hereafter referred to as M302.S)
METR 302, "The Violent Atmosphere and Oceans" (Dr. John Monteverdi, Fall 2000; hereafter referred to as M302.M)

The last four of these courses are introductory, general education (GE) courses for non-majors. They served as controls for our NOVA course. The number of students who completed both pre- and post-tests in each class was as follows:

Class	GM310.00 (NOVA course)	GM310.01 (NOVA course)	GM103	G302	M302S	M302M
Number of Students Scored	4	4	6	19	18	5

II. Assessments

A. Attitudes about Science

Question asked: Do students develop a better attitude about science than in existing courses?

Assessment instrument:

Attitude assessment [hypertext link].

Student response sheet [hypertext link].

Description: This assessment is based on the Test of Science-Related Attitudes (TOSRA) (Fraser, 1981). It comprises a set of statements that fall into one of four general categories:

Societal aspects of science
Science as inquiry ("doing science")
Nature and history of science
Nature and study of climate

The statements in the first two categories are taken directly from the TOSRA. We wrote the statements in the third category based very closely on the "Nature and History of Science" content standards from the National Resource Council's National Science Education Standards (NSES). We wrote the statements in the fourth category to find out what students feel or understand about the nature of climate, who studies it, and how it is studied.
For each statement, students are asked to indicate whether they strongly agree, agree, don't know how they feel, disagree, or strongly disagree. Statements are paired to increase the reliability of the scores, with the statements in each pair saying basically the same thing but one phrased positively and the other negatively. For each of the four categories of statements, there are 10 statements in 5 pairs, for a total of 40 statements. The sequence of 40 statements is organized so that the first four statements are drawn from the four categories in the order listed above, and each set of four statements thereafter is similarly selected and ordered. There is no systematic ordering of statements based on whether they are positively or negatively phrased.
Assessment procedure: We gave each student the attitude assessment and a response sheet. Instructions for recording responses on the response sheet are printed on both the attitude assessment and on the answer sheet itself, but we explained them orally using transparencies and an overhead projector to illustrate. They key point to emphasize to students is that on the response sheet, the statement numbers (and five possible responses for each statement) are organized in four columns, which appear in increasing numerical order from left-to-right across columns rather than top-to-bottom within each column. We gave students enough time to complete the assessment (roughly 15 minutes).
We administered the assessment on the first day of class (the "pre-test") and again in the last week or so of the semester (the "post-test"). We scored only assessments completed by students on both dates, and we scored both sets responses only after the end of the semester. (To match each pair of pre- and post-test responses, we used the last four digits of each student's nine-digit student number, which each student was supposed to write on the response form. We did not score single, unmatched assessments.) To decrease the likelihood of miscoring, two people independently scored every response and any major differences between the two sets of scores were resolved by rescoring.
Scoring rubric: Because of the way the statements are ordered on the attitude assessment (in groups of 4, one per cagegory of statements), and the way the responses are organized into four columns on the response sheet, all responses to statements in Category 1 appear in the first column of the response sheet, responses in Category 2 appear in the second column, etc. For positively-phrased statements, five points are assigned to "strongly agree" responses, four points to "agree", etc. For negatively-phrased statements, one point is assigned to "strongly agree" responses, two points to "agree", etc. For each student's response sheet, we scored responses and summed them in each column separately. (A transparent template mimicking the student response sheet but with appropriate point values replacing "SA", "A", "N", "D", and "SD", can be created and overlaid on each student response sheet to facilitate manual scoring.) The maximum possible score for each column (that is, each category) is 50 and the minimum is 10.
Analysis procedure: We calculated post-test minus pre-test difference scores for each student for each category. Three types of statistical tests were performed on each set of difference scores, with a 95% level of significance chosen in advance as the standard for accepting or rejecting null hypotheses:
- An F-test of the null hypothesis that all six classes had the same mean pre/post-test difference scores, and a similar F-test in which the two GM310 classes were lumped together. To test the homogeneity of the control classes, we performed a second F-test of the null hypothesis that the four GE classes (that is, excluding GM310) had the same mean post/pre-test difference scores. To test the homogeneity of the two GM310 classes, we performed a t-test of the difference of their respective means.
- A t-test of the null hypothesis that there was no difference between the mean pre/post-test difference score for the GM310 classes combined, and the mean difference score for the control group (e.i., all four GE courses lumped together). Similar tests were performed for the individual GM310 classes vs. the control group.
- t-tests of each of the null hypotheses that the mean pre/post-test difference score for all students lumped together, for all students in the four GE classes lumped together, for the GM310 classes lumped together, and for each of the six classes separately, was not significantly different from zero.
The first two types of tests above were also performed on the pre-test scores alone, to test hypotheses that the students in GM310 were initially no different from students in the four GE classes, that students in the four GE classes did not differ significantly from each other, and that the students in the two GM310 classes did not differ significantly from each other (at least as measured by these assessments).

Results:

Hypothesis Tested	Type of Test	Score	Result of Test (95% significance level)
(a) Mean scores are the same among all six classes.	F-test	pre-test only	Accept: all four categories
(a) Mean scores are the same among all six classes.	F-test	post/pre test difference	Accept: Categories 1, 2, 3; Reject: Category 4
(b) Mean scores are the same among all classes, with GM310 classes lumped together.	F-test	pre-test only	Accept: all four categories
	F-test	post/pre test difference	Accept: Categories 1, 2; Reject: Categories 3, 4
(c) Mean scores are the same among the four GE classes.	F-test	pre-test only	Accept: all four categories
(c) Mean scores are the same among the four GE classes.	F-test	post/pre test difference	Accept: Categories 1, 2, 4; Reject: Category 3
(d) Mean scores for GM310 lumped together and for all four GE classes lumped together, are the same.	t-test	pre-test only	Accept: all four categories
	t-test	post/pre test difference	Accept: Categories 2 and 3; Reject: Categories 1 and 4 (GM310 higher)
(e) Mean scores for all classes lumped together are zero.	t-test	post/pre test difference	Accept: Categories 1, 2 and 3; Reject: Category 4 (positive)
(f) Mean scores for the four GE classes lumped together are zero.	t-test	post/pre test difference	Accept: all four categories
(g) Mean scores for the two GM310 classes lumped together are zero.	t-test	post/pre test difference	Accept: Categories 2, 3 Reject: Categories 1, 4
(h) Mean scores for each course separately are zero.	t-test	post/pre test difference	Reject: GM310.00, Cat. 4 (positive); GM310.01, Cat. 4 (positive); G302, Cat. 4 (positive); M302S, Cat. 3 (negative); Accept: all others

Interpretation: [Not done yet]

B. Scientific Reasoning

Question asked: Do students learn to reason scientifically better than they do in existing courses?

Assessment instrument:

Reasoning assessment (two problems) [hypertext link]

Description: Both of these scientific reasoning problems involve the results of four experimental trials involving two or three factors, respectively. The problems are as follows:

In the first problem, eight slugs are placed in the center of a cage, and a new kind of bait is placed on on one side of the cage. The underlying soil type comprises either sand, moss, or equal areas of sand and moss with the bait lying entirely on one or the other. Students are asked whether or not the slugs respond to the bait and to the two soil types, and four possible answers are provided. They are to explain their choice.

In the second problem, four groups of 100 monkeys are each given one of three different supplements, or a combination of two of the three supplements, in their food for one month. The average weight gain is recorded. Students are asked about what can be concluded about the effect that one of the supplements has on the weight gain, and are provided four possible choices. They are asked to explain their reasoning.

Both of these problems raise issues involving control of multiple factors. It should be noted that in the geosciences, such control is relatively rare, and students in all five classes were unlikely to encounter a problem requiring interpretation of experiments that involve controls or in which controls are possible.
Assessment procedure: We distributed the two problems to students and briefly described them, noting that students should not only chose one of the four possible answers provided but explain their reasoning as best they could. We allowed students about 15 minutes to solve complete this assessment.
Problem solutions and scoring rubric: Each of the two problems is worth five points, one point for chosing the correct answer from among the four possibilities provided and four points for the reasoning supporting the choice. (In some cases it is possible to earn partial credit for an explanation supporting an incorrect choice.)
- First problem: Correct answer is (c), "both bait and soil condition". Cases (i) and (iv) provide controls for soil condition, since the soil is uniform in each case. In both cases, the slugs are clearly attracted to the new bait. Case (iii) shows that if the bait lies on sand, and if moss underlies the other half of the cage, then half of the slugs will either be more attracted by the moss soil condition than by the bait on sand, or be repelled by the sand toward the moss in spite of the attraction of the bait, clearly demonstrating that the slugs respond to soil condition as well as the bait. Case (ii) reinforces the conclusion that slugs find moss more attractive (or less repulsive) than sand. Note that the data do not support the conclusion that slugs respond more strongly to bait than to soil condition; in case (iii), the only one in which slugs had to choose between bait and the preferred soil condition, half the slugs responded more strongly to soil condition than to bait and half responded more strongly to bait than to soil condition.
  Many students chose (a), "bait but not soil condition". They received no credit for that choice, but if their explanation cited evidence that slugs were attracted to the bait (by referring explicitly to individual experimental cases) they received partial credit (typically 1 point). Students who made the correct choice and generally justified it logically (by referring explicitly to appropriate experimental cases) but claimed that slugs responded more strongly to bait than to soil condition received 4 out of a possible 5 points. Students who chose the correct answer but did not explain their reasoning by referring explicitly to appropriate experimental cases, could receive only partial credit. Students who chose the correct answer but gave an explanation inconsistent with that choice received only 1 point (for the correct answer).
- Second problem: the probable correct answer is is (b), "Supplement B decreases weight gain under some conditions". Under some conditions, choice (d), "It is not possible to tell whether Supplement B influences weight gain from the information given", could be a justifiable answer.
  Ignoring statistical uncertainties such as physiological differences between monkeys among the four groups, differences in the amount of food each group ate, differences in the levels of activity of monkeys among the groups, etc., then Group 4 acts like a control for Group 3 because the two are identical except that Group 3 received supplement B and Group 4 did not. Both groups gained weight, but supplement B appears to have reduced the weight gain relative to what would have happened without supplement B, the basis for choosing (b). Group 2 has no bearing on the question, since there was no group fed with no supplements at all, the appropriate control group for Group 2. Group 1 provides no useful information about the question.
  Because of the uncertainties about comparability of monkeys among groups, activity levels, total food consumption, etc., it is possible to argue that with the information given, it is in principle not possible to tell with sufficient certainty whether supplement B influences weight gain. Hence, we gave full credit to students who chose (d) and justified the choice by citing some of these other possible effects on weight gain. However, most students who chose (d) did not cite these additional uncertainties, so we did not give them any credit at all (not even 1 point for choosing (d)). For example, it was relatively common for students to claim that we can't tell about the effects of supplement B because we don't know the initial mean weights of monkeys in each group. It was also common for students simply to restate (d) without offering any reasoning. Students who selected (d) because of the absence of a control group for Group 2 (that is, a group fed with no supplements at all) received 3 points out of 5 possible.
  Most students chose (a), "Supplement B increases weight gain under some conditions", probably confusing "causes weight gain" with "increases weight gain", noting the results of Group 2 and either comparing it to results of Group 3 or failing to realize that a control group that ate food without any supplements at all would be needed to tell whether supplement B causes weight gain. Of course, because the question asks not about causing weight gain but rather causing differences in weight gain, this line of reasoning is based on a misreading of the question in the first place and these students received no credit at all. Some students chose (a) based on a comparison of Groups 3 and 4; we gave these answers 0 points.
  Students who correctly chose (b) and noted correctly that Group 4 acts as a control group for Group 3, but referred to Group 2 as part of their reasoning, were given 4 out of 5 possible points (because Group 2 contributes no information to the question).
Analysis procedure: Null hypotheses posed and tests performed were the same as in the assessment of attitudes about science.
Results: All hypotheses tested were accepted.
Interpretation: [Not done yet]

C. Climate Concepts

Question asked: Do students learn connections among geosciences and concepts about climate and climate change better than in existing courses?

Assessment instrument:

Concept map. [hypertext link]

Description: The concept map assessment comprised (1) a set of five key prompting questions about climate; (2) a reminder about the nature of hierarchical concept maps; and (3) a box with the word "climate" printed in it, with lots of white space below. Prompted by the key questions, students were asked to organize their preexisting knowledge about climate in a hierarchical concept map, starting from he labeled box. The key prompting questions were:

What is climate?
How is climate studied?
What factors determine climate?
How has climate differed in the past?
What can cause climate to change?

These questions were similar to those that we used as the basis for organizing the content and progression of our NOVA course, "Planetary Climate Change".
Assessment procedure: We provided each student with a copy of the assessment instrument, then spent five minutes or so explaining what a hierarchical concept map is, using an example (of the earth's water cycle) on an overhead transparency to illustrate the idea. The generic strategy for creating any hierarchical concept map are to (1) put the main topic in a box at the top of the map; (2) put related, more specific subordinate topics in boxes below, connected to the topmost box by lines; (3) label the lines with (mostly) verbs or prepositions to specify the nature of the relation between the connected topics; (4) iterate for increasingly more specific subordinate topics. Cross-connections between different branches of subordinate topics are possible and examples shown. (For a primer on concept maps and their use as a field-tested assessment technique, see the National Institute for Science Education's description of concept maps.)
We then went over the instructions provided on the assessment instrument and asked the students to construct a hierarchical concept map in which they organized their own knowledge about climate, prompted by five key questions about climate. After 15-30 minutes, we collected the students' concept maps. (Especially at the beginning of the semester, 15 minutes was plenty of time because most students knew almost nothing about the subject. At the end of the semester, if the students had learned much about the subject at all, 15-30 minutes seemed to be enough time for the students to show a noticeable improvement if any such improvement was forthcoming--the exception rather than the rule in the control groups!)
Scoring rubric: We deemed two components of the concepts maps credit-worthy: (1) appropriate topics in boxes; and (2) logical, labeled lines of connections between topics. We constructed our own version of an acceptable concept map based on the five key prompting questions, identified a dozen important topics distributed among these questions, and assigned half a point to each topic for a subtotal of 6 points; and we assigned half a point to each of up to four logically coherent, labeled line of connections between topics (generally corresponding to any of the five key prompting questions about climate) for a subtotal of 4 points. The total points possible was 10 points.
A crude outline of the lines of connections associated with the key prompting questions (italics) and topics (bold-face) that we deemed important (and hence credit-worthy) looked something like this:
To increase scoring consistency, two of us--a meteorologist (Dempsey) with no previous experience scoring concept maps and some limited experience using concept maps as an instructional tool, and a secondary-science-education faculty member (O'Sullivan) with extensive experience scoring concept maps and using them for instruction but with limited knowledge about climate--scored student concept maps independently. In occasional cases where our scores differed markedly we consulted and rescored, but in general we simply averaged our two scores and performed our statistical analysis on the averaged scores.

Results:

Hypothesis Tested	Type of Test	Score	Result of Test (95% significance level)
(a) Mean scores are the same among all six classes	F-test	pre-test only	Reject
(a) Mean scores are the same among all six classes	F-test	post/pre test difference	Reject
(b) Mean scores are the same among all classes, with GM310 classes lumped together	F-test	pre-test only	Accept
	F-test	post/pre test difference	Reject
(c) Mean scores are the same among the four GE classes	F-test	pre-test only	Accept
(c) Mean scores are the same among the four GE classes	F-test	post/pre test difference	Accept
(d) Mean pre-test scores for the two GM310 classes are the same	t-test	pre-test only	Reject
	t-test	post/pre test difference	Accept
(e) Mean pre-test scores for GM310 lumped together and for all four GE classes lumped together, are the same.	t-test	pre-test only	Accept
	t-test	post/pre test difference	Reject (GM310 scores higher)
(f) Mean score for all classes lumped together is zero.	t-test	post/pre test difference	Reject (difference positive)
(g) Mean score for the four GE classes lumped together is zero.	t-test	post/pre test difference	Accept
(h) Mean score for the two GM310 classes lumped together is zero.	t-test	post/pre test difference	Reject (difference positive);
(i) Mean score for each course separately is zero.	t-test	post/pre test difference	GM310.00: Reject (difference positive); GM310.01: Reject (difference positive); All others: Accept

Interpretation: [Not done yet]

Introduction
Assessments

References

Fraser, BT, 1981, TOSRA: Test of Science Related Attitudes. Australian Council for Educational Research, Hawthorne, VIC.

Action Research for SFSU's NASA-NOVA Course: Planetary Climate Change January 18, 2002