Skip to main content

Evaluate 1 - Summative Assessments

Artifact: Showcase an assessment created and include what method was used to assess the validity, reliability, and security.

Answer: What process was used to determine the validity, reliability, and security of the assessment? What was the result?

In this assignment I evaluated a district-wide benchmark that covers cell replication and genetics. Although I do not have access to all district-wide data, I did evaluate the data for all of my classes. A screenshot of part of the benchmark is included below. Because the questions are the property of USATestprep and because this is a secure district assessment, I cannot post the full assessment or a link to the document here.



Validity:
The benchmark was created in our online learning system and item bank, USATestprep. To ensure the validity of the test, I reviewed the content weights for the Georgia Milestones Biology EOC (pg 9 in this PDF) to determine which domains and strands (standards and sub-standards) are most important. When selecting items for the test, I reviewed the available data for each question (domain, standard, and DOK level) to ensure that the questions were representative of the weights on the state test and that they are aligned with the high priority learning objectives. I also tried to ensure that questions included application of content, not just recall of information.

Reliability:
Our district has only recently implemented benchmarking in science, and many of the teachers involved have questioned the reliability of the tests. In light of those recent discussions, I decided to assess the reliability of the benchmark using a Split-Half Reliability KR(20) test. Using a mathematical formula, this test considers the total number of test items, the proportion of test-takers who pass an item, the proportion of test-takers who fail an item, and the variation of the entire test. While I don't have access to all the district data, I did conduct the KR(20) test using the data for all four of my classes. The calculation produces a KR(20) score of +0.70.

According to an article on edassess.net, “The closer the KR(20) is to +1.0 the more reliable an exam is considered because its questions do a good job consistently discriminating among higher and lower performing students… The interpretation of the KR(20) depends on the purpose of the test. Most high stakes exams are intended to distinguish those students who have mastered the material from those who have not. For these, shoot for a KR(20) of +0.50 or higher. A KR(20) of less than +0.30 is considered poor no matter the sample size.” Statisticshowto.com also states in this article that a score above +0.5 is usually considered reasonable.

I was fairly pleased with my KR(20) findings, especially given that I was the one who curated the questions for this particular benchmark!

Security:
The test was administered online on USATestprep. To access the test, students are required to log into their USATestprep accounts with their usernames and passwords. On this site, the instructor has the ability to to assign it to specific groups (or individual students) with a specific start and end date. Instructors can also lock and unlock the test at any time. There are also features to restrict the number of attempts, to shuffle question order, and to prevent students from viewing results until a later time.

The test remained locked until students were all logged in to a device management system called Net Support. The system allows teachers to view the content of students screens, lock/unlock student devices, and to restrict their browser’s access to specific websites if desired. During the test, students were not allowed to leave the USAtestprep website. The students screens were monitored from my desktop screen during the test to ensure appropriate progress through the test and prevent cheating. Once all students within in a class completed the test, the test was locked.

Results:
The results of the test were good overall. Overall, my students scored 8 points higher than the school average. When I broke the data down by class periods I found that regular ed and inclusion classes were consistent with the school average, and my honors classes were as many as 20-22 points higher than the school average (something I expected). Item analysis revealed that the wording and format of two questions may have been problematic for students. 37% of students marked the correct answer for number 9, and 54% of students marked the correct answer for number 3. Because students performed well on questions that assess the same standard, I do not believe that it was an issue of content mastery; I believe it was an issue of the format of the question.

If I’m being totally honest, before “Assessment Strategies” and “Assessment Uses” became part of our teacher evaluation on TKES, the most I had probably ever done in terms of analyzing assessments was a simple item analysis. But over the past few years, I believe that this practice of comparing results across classes and analyzing by standard has helped me to grow as a teacher. It helps me ensure that my assessments are reliable, valid, and secure, and I believe it has benefited my students tremendously.

Comments