Published: Dec. 19, 2018 By

English language educators create and adapt tests continuously throughout their careers, but often assessment can be an afterthought in course design. Bachman and Palmer write that assessment developers must be held accountable for the tests they create, because those tests affect stakeholders in significant ways (2010). For example, tests can affect whether or not students enter university or pass a class. They also determine pass/fail rates in classrooms, which can influence educator pay or promotion. For these reasons, it is important that educators understand best practices in test design, especially aligning classroom exams with student learning outcomes (i.e., the overarching goals of a course) and objectives (steps to achieve the learning goals), vetting assessments, creating strong test questions and evaluating performance results.

The unity between an exam and the goals and objectives of a course is the foundation for a valid test, and paying attention to the interplay between these factors is important for all educators. A test with content validity means that “the test assesses course content and outcomes using formats familiar to the students” (Coombe, Folse & Hubley, 2007). Although this seems logical, the learning goals in a textbook and in a course may not completely overlap, so educators are responsible to make adjustments within their assessments to make up for this disconnect. Also, educators may use textbook-provided tests without critical analysis, or may create assessments from scratch without truly analyzing the learning goals of a course. If this is the case, educators may be using tests that are not valid and therefore not appropriate for their students.

Furthermore, students need to clearly understand how tests support learning goals and objectives. Hughes (2003) writes that objectives based exams clearly show the degree to which students have reached learning goals, and they promote strong teaching practices and course design, as educators and students are keenly aware of course goals outlined in syllabi. This is why educators should focus on assessing goals and objectives within their exams.

Aligning Objectives and Test Specifications

The first step in creating or adapting exams is to perform a thorough evaluation of learning outcomes and objectives, often referred to as test specifications. Evaluating test specifications will help educators develop what Fulcher and Davidson call a “Blueprint” for the exam (2007). This “critical review” ensures test validity and reliability [or consistent results across students and classes] because it promotes a critical dialogue between the reviewer and the material. When educators skip this important step, they may test students on content areas that are not included in learning outcomes and objectives.

Once a review of test specifications is complete, educators can begin item (question) selection and test design with their objectives in mind. Although many educators have access to tests either created by textbooks or by colleagues, it is crucial that these exams be thoroughly reviewed. One cannot assume that externally provided exams include well-written items (even exams published by textbooks), or that exams written by educators connect to learning outcomes or objectives. Whether analyzing or creating test questions, it is important to read and apply best practices in question design.  Carnegie Mellon University provides a great resource that offers information for aligning test questions with objectives. They argue that test objectives should be clear and obvious to the student, and that specific test items should unmistakably reflect objectives.

Learning outcomes should also determine the balance of question types, and Coombe, Folse and Hubley write that “The type of response can impact a student’s ability to demonstrate what she or he actually knows or can do” (pg. 17). They classify response items into two categories: selection and supply. They point out that students who are able to select an answer (i.e., multiple choice) may not be able to supply the same answer (i.e., essay question). For example, lower level language learners may be able to choose the correct vocabulary word for a cloze exercise, but may not be able to use that same word in a sentence. So, it is important to choose the right types of questions for the students.

Coombe, Folse and Hubley also discriminate between objective and subjective test questions. Objective test questions (i.e., multiple choice or true/false) are difficult to write, but most of the work is done prior to test delivery, and reliability is enhanced by standardized answer keys. Subjective test questions (i.e., essay or short answer) require students to produce a longer response and are easier to create, but are less reliable and demand more time and attention when scoring. In general, it is best to balance the test question types, ensuring that exams include both selection and supply items, and that exams have objective and subjective questions. This is the first point of focus when either analyzing or creating an exam, and the next area of focus is on the exam items themselves.

Objective Test Items

There are many types of objective test questions, like multiple choice (MCQs), true/false (T/F), cloze and matching.  In general, when writing these questions, Coombe, Folse and Hubley recommend that educators do the following:

  • Write questions at a lower level than the content for response (i.e., text or audio)
  • Paraphrase questions to reduce skimming for answers
  • List questions in the same order as they are in the text, as not doing so significantly increases difficulty
  • Test one skill at a time (i.e., inference vs. main idea)
  • Keep directions short and clear
  • Start exams with easier questions to reduce anxiety up front
  • Include a balance of difficult and easy questions 
  • Include a balance of question types

Another addition to this list is to include test question point values that differ by difficulty. For instance, vocabulary cloze questions may be worth one point for lower level students, while vocabulary production questions may be worth two points for the same group.

When creating or adapting exams that include a variety of question types, educators should especially review tips or best practices on writing items they are not familiar with. For example, if educators don’t have much experience with MCQs, reading about best practices for creating this type of question is key. For a comprehensive overview of the many question types, Kansas State provides a very useful guide for professors and educators. A brief overview of many question types, including links to resources, is referenced in the appendix so educators can access them on a need to know basis (See Appendices A-D).

Subjective Test Items

Since subjective test items usually demand longer responses from students, this testing format subsequently requires more attention and focus from educators when grading. However, they provide educators with opportunities to assess higher order thinking skills like critical thinking, reflection, justification, and interpretation (Coker, Kolstad & Sosa, 1988). Although assessing these skills is important, the authors point out that subjective questions also have diminished reliability and equitability. Therefore, it is important to consider these factors when scoring and when choosing test items (See Appendices E & F).

Piloting, Editing, and Reviewing Test Items

Once an educator finishes writing or editing an exam, it is important to remember that it is just a draft. Fulcher and Davidson (2007) point out that the exam development process is not linear, but is “iterative and dynamic.” In other words, educators should pilot, edit and revise their exams until they are well developed. The best option for piloting is to offer the exam to students before use, but that is not often easy to accomplish. One alternative is for educators to take their exam, and to ask a colleague to take it as well. This will not model the student experience or output, but could clear up any confusing language on the test or answers on the key. If a colleague working in the same context (i.e., Intensive English Program) has difficulty understanding or answering a question, then it needs editing. Taking an exam also helps educators better understand whether or not the questions truly support learning outcomes and objectives. Most importantly, taking an exam will increase the quality, as typos and formatting issues can be fixed prior to administration.

Scoring Tests, Revising Test Items, and Evaluating Test Difficulty

When creating an answer key for an exam, it is essential to be as clear and explicit as possible for the grader. If there are multiple points available for a question, describe how students will achieve those points differently. Hughes writes that educators must expect to receive as many disparate answers as possible, especially for items that are worth multiple points.

If educators are using a rubric to evaluate writing or speaking, make sure that it too aligns with learning outcomes and objectives. The rubric should not assess a skill or knowledge that has not been learned in the course, unless it was taught in a previous course or unless students were expected to have mastered that skill prior to instruction. For more information on creating strong rubrics, Yale University has an excellent guide with examples of holistic and analytic rubrics for writing.

After students have taken an exam and scores have been collected, it is also important to analyze results. Coombe, Folse and Hubley point out that analyzing and interpreting student exam results is an “ethical responsibility” for educators, as assessments impact students significantly in regards to passing a course, completing a language program, or fulfilling education requirements. After an exam has been scored, look for obvious trends in the data. It can be useful to use an Excel spreadsheet or a Word document to track this. Educators don’t often have a large enough sample size to throw out test questions based on student scores (unless they administer an exam to many different classes, sections, or the entire student body), but they can look for major trends. For example, if all or most students incorrectly answer a question, evaluating whether or not the question connected with objectives, was worded clearly, or was correct on the key will provide essential information for test revision. Also, it may be important to think about whether or not that skill was taught thoroughly or clearly prior to the exam. Datnow and Park (2014) write “The thoughtful use of data for instructional decision making cannot be divorced from reflection about one’s beliefs, assumptions, and practices around how students learn” (p. 3).  Although it is difficult at times, modifying teaching in reaction to test results can be crucial for students to achieve learning goals.

Finally, it is vital to evaluate test difficulty. To do this, educators need to do simple math calculations—usually evaluating the mean, the median, the mode and the range will provide enough information to make decisions. Calculators.org has a well-designed tool that can calculate these numbers. The mean will display the average score, the median will offer the middle score for all the students’ grades in the data set (eliminating the effect of outliers), the mode presents the most common score, and the range shows the highest and lowest scores. If the range, for example, is between 80% and 100% correct, the test may be too easy. One may argue that on a criterion-referenced test (a learning goals and outcomes based test, as opposed to a norm-referenced standardized test), the goal is to measure achievement, so a low range is good. This may be true for strong classes where most students can demonstrate the learning outcomes. However, the reality is that most classes contain a mixture of levels and skills, so tests must discriminate between these students. The same goes for an analysis of the mean. If the mean is low (i.e., 60-70%), then the test was probably too difficult, but if it was too high (i.e., 80-100%) then the test was probably too easy. Unless the goal is for all students to pass with high marks (because the assessment is testing essential information to move forward in a course, like calculations for building bridges), a mean score of 70-79% proves that the test is a reliable measure of skills (Coombe, Folse & Hubley).

The most important consideration when evaluating test difficulty is analyzing the course objectives, the student needs, and the goals of the learning institution. The difficulty of the learning outcomes should be reflected by the difficulty of the test. For example, if students need to demonstrate that they can comprehend an article written at the C1 CEFR scale, then the test should conform to that standard. If the homogenous goals of the students are to perform well at a top tier university’s engineering department, then tests should prepare them for the rigor of future assessments. Finally, if there are issues with students passing levels before they are ready or with grade inflation at a learning institution, then creating difficult tests that truly discriminate between high and low performing students should be the goal. In the end, it is up to educators and their institutions to truly decide the best course of action.

Conclusion

Designing valid, reliable, and appropriately difficult tests can be challenging, but sharpening this task should be a fundamental professional development process for all educators. As this article outlined, the learning goals and objectives for a course should offer a blueprint for test design, and all decisions made concerning test creation or analysis should stem from them. Also, reviewing best practices for item design and critically analyzing existing tests is an important step in the test creation process. Finally, educators should be piloting, editing and revising tests based on pre-test feedback and post-test results. Taking these steps when creating tests will ensure that they accurately evaluate student performance and that educators develop important professional development skills that they will use throughout their careers.

 

References

Bachman, L. & Palmer, A. (2010). Language Assessment in Practice. Oxford: Oxford University Press.

Brown, Race & Smith. (1996). 500 Tips on Assessment. London: Kogan Page.

Coker, D., Kolstad, R., & Sosa, A. (1988). Improving Essay Tests: Structuring the Items and Scoring Responses. The Clearing House, 61(6), 253-255. Retrieved from http://www.jstor.org.colorado.idm.oclc.org/stable/30188332

Coombe, C. Folse, K. & Hubley, N. (2007). A Practical Guide to Assessing English Language Learners. Ann Arbor: The University of Michigan Press.

Datnow, A., & Park, V. (2014). Data-driven leadership. Retrieved from https://ebookcentral.proquest.com

Fulcher, D. & Davidson, F. (2007). Language Testing and Assessment: An advanced resource book. C.N. Canlin & R. Carter (Eds.). New York: Routledge

Hugues, A. (2003). Testing for Language Educators 2nd ed. Cambridge: Cambridge University Press.

Appendix A

Multiple Choice Questions

For MCQs, there are many credible websites that provide useful tips. For a thorough explanation of MCQs, Vanderbilt University offers comprehensive guidelines. For a shorter synapsis, see The University of Texas at Austin. The existence of these online university websites prove that MCQs are frequently used in university classes, and that even professors need support when creating them.

A few important factors to pay attention to when writing MCQs, as Brown, Race and Smith (1996) highlight, is to make sure that they have a clear stem [question or statement], that they have distractors [incorrect answers] that are actually in the text or audio, and that the answer key is correct.  Coombe, Folse and Hubley add to this list by stating that MCQs should all have the same number of answers that are similar in length. Paying attention to these areas, and ones described in the websites above, will support the creation of reliable test questions.

Appendix B

True/False Questions

Another type of objective test question is True/False (T/F). Coombe, Folse and Hubley write that these types of questions are almost as popular on professional exams [including standardized tests like TOEFL, IELTS and SAT] as MCQs. The benefit of using T/F questions is that they are easy to grade, easy to incorporate into an exam and reliable—if there are at least 7-10 on the exam. The authors also highlight their drawbacks, which include a “50% guessing factor.” They suggest that test creators add an additional option like “not given” or “not enough information” to reduce that number to 33%. Also, educators can reduce the guessing factor by asking students to correct false answers. For more information on writing T/F questions, check out information from the University of Waterloo.

Appendix C

Cloze or Gap-fill Items

When writing cloze or gap-fill items, Hughes points out that putting questions into context can help examinees make stronger choices. For example when testing vocabulary, writing a story or paragraph rather than disconnected sentences provides an element of authenticity. Also, making sure that there is no more than one option that fits into the blank is very important for these questions, especially when testing vocabulary and grammar. Finally, directions must be very specific for cloze questions. For example, will students lose points for incorrect part of speech, word form or other errors that stray from the original task? Educators should use learning outcomes and objectives to support these decisions. If the objectives state that students demonstrate understanding of new vocabulary and different word forms, then both skills should be counted and these questions should be weighted higher than those only testing one skill.

Appendix D

Matching Questions

Matching questions can be useful when testing students on ordering information, selecting vocabulary definitions, or classifying material. Coombe, Folse and Hubley write that educators should provide more answers than premises [questions or statements], should number the premises and letter the options, and should ask students to write the correct letter in a blank rather than draw lines, which can be confusing for graders. Finally, they suggest that all items in a matching activity be thematically related, which adds coherence. Many of the resources listed above offer tips for writing matching items, but two very useful ones come from the University of Waterloo and Kansas State.

Appendix E

Short Answer Questions

Short answer items are often used on tests to evaluate specific information either read in a text or heard via an audio or video clip. Coombe, Folse and Hubley state that these questions can be useful when testing productive skills like describing the main idea of a text, demonstrating understanding of certain sections, or responding with opinions in a short order format. However they can be difficult to score, especially when grammatical or lexical skills interfere with communication. Intra-rater reliability (or inconsistent scoring from one grader) can also be a problem. For these reasons, it makes sense to use short answer questions on exams for high intermediate to advanced students. To reduce grader subjectivity, educators should mark exams without looking at the student name, and/or have a colleague teaching a similar level check 2-3 of their exams, to make sure grades are not inflated or deflated.

Appendix F

Essay Questions

Essay test items are similar to short answer questions, but are longer and more complex. They are seemingly easy to write, but educators must ensure that the questions clearly address the learning outcomes and objectives for the course. It is also important that there is a clear grading scheme—for example point values and criteria for evaluation (usually outlined in a rubric). Coker, Kolstad and Sosa offer more recommendations for writing strong essay items:

  • Thoughtfully consider the objective or outcome and make sure they align with questions.
  • Only use subjective formats when objective formats will not fully capture the complexity of the skill.
  • Make sure the question is specific enough to elicit the expected answer.
  • Avoid offering students a choice of many questions/prompts, as they will choose the easiest one.
  • Make sure enough time is allotted for planning, writing, and editing.
  • Create or outline an answer to the question before grading.
  • Communicate whether punctuation, grammar, vocabulary or penmanship will be evaluated.
  • Share the evaluation device (i.e., rubric) with students prior to the exam.