Thursday, October 3, 2019

The Past, Present, and Future of Automated Scoring Essay Example for Free

The Past, Present, and Future of Automated Scoring Essay â€Å"No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be †¦Ã¢â‚¬  – Isaac Asimov (5) Introduction Although some realities of the classroom remain constant –they wouldn’t exist without the presence, whether actual or virtual, of students and teachers –the technology age is changing not only the way that we teach, but also how students learn. While the implications of this affect all disciplines, it is acutely evident in the teaching of writing. In the last twenty years, we have seen a rapid change in how we read, write, and process text. Compositionist Carl Whithaus maintains that â€Å"†¦ writing is becoming an increasingly multimodal and multimedia activity† (xxvi). It is no surprise then, that there are currently 100 million blogs in existence worldwide and 171 billion email messages sent daily (Olson 23), and the trend toward digitally-based writing is also moving into the classroom. The typical student today writes â€Å"almost exclusively on a computer, typically one equipped with automated tools to help them spell, check grammar, and even choose the right words† (Cavanaugh 10). Furthermore, CCC notes that â€Å"[i]ncreasingly, classes and programs in writing require that students compose digitally† (785). Given the effect of technology on writing and the current culture of high stakes testing ushered in by the mandates of the No Child Left Behind Act of 2001, a seemingly natural product of the combination of the two is computer-based assessment of writing. An idea still in its infancy, the process of technological change in combination with federal testing mandates has resulted in several states incorporating â€Å"computer-based testing into their writing assessments, †¦ not only because of students’ widespread familiarity with computers, but also because of the demands of college and the workplace, where word-processing skills are a must† (Cavanaugh 10). Although it makes sense to have students accustomed to composing on computer write in the same mode for high-stakes tests, does it make sense to score their writing by computer as well? This is a controversial question that has both supporters and detractors. Supporters like Stan Jones, Indiana’s Commissioner of Higher Education, believe that computerized essay grading is inevitable (Hurwitz n.p.), while detractors, primarily pedagogues, assert that such assessment defies what we know about writing and its assessment, because â€Å"[r]egardless of the medium †¦ all writing is social; accordingly, response to and evaluation of writing are human activities† (CCC 786). Even so, the reality is that the law requires testing nationwide, and in all probability that mandate is not going to change anytime soon. With NCLB up for revision this year, even politicians like Sen. Edward Kennedy of Massachusetts agree that standards are a good idea and that testing is one way to ensure that they are met. At some point, we need to pull away from all-or-none polarization and create a new paradigm. The sooner we realize that â€Å"†¦ computer technology will subsume assessment technology in some way† (Penrod 157), the sooner we will be able to address how we, as teachers of writing, can use technology effectively for assessment. In the past, Brian Huot notes that teachers’ responses have been reactionary, â€Å"cobbled together at the last minute in response to an outside call †¦ † (150). Teachers need to be proactive in addressing â€Å"†¦ technological convergence in the composition classroom, [because if we dont], others can will impose certain technologies on our teaching† (Penrod 156). Instead of passively leaving the development of assessment software solely to programmers, teachers need to be actively involved with the process in order to ensure the application of sound pedagogy in its creation and application. This essay will argue that automated essay scoring (AES) is an inevitability that provides many more positive possibilities than negative ones. While the research presented here spans K-16 education, this essay will primarily address its application in secondary environments, primarily focusing on high school juniors, a group currently consisting of approximately 4 million students in the United States, because this group represents the targeted population for secondary school high stakes testing in this country (U.S. Census Bureau). It will first present a brief history of AES, then explore the current state of AES, and finally consider the implications of AES for writing instruction and assessment in the future. A Brief History of Computers and Assessment The first time standardized objective testing in writing occurred was in 1916 at the University of Missouri as part of a Carnegie Foundation sponsored study (Savage 284). As the 20th century continued, these tests began to grow in popularity because of their efficiency and perceived reliability, and are the cornerstone of what Kathleen Blake Yancey describes as the â€Å"first wave† of writing assessment (484). To articulate the progression of composition assessment, Kathleen Blake Yancey identifies three distinct, yet overlapping, waves (483). The first wave, occurring approximately from 1950-1970, primarily focused on using objective (multiple choice) tests to assess writing simply because, as she quotes Michael Williams, they were the best response that could be â€Å"†¦ tied to testing theory, to institutional need, to cost, and ultimately to efficiency† (Yancey 489). During Yancey’s first wave of composition assessment, another wave was forming in the parallel universe of computer software design, where developers began to address the possibilities of not only programming computers to mimic the process of human reading, but † †¦ to emulate the value judgments that human readers make when they read student writing in the context of large scale assessment† (Herrington and Moran 482). Herrington and Moran identify The Analysis of Essays by Computer, a 1968 book by Ellis Page and Dieter Paulus, as one of the first composition studies books to address AES. Their goal was to â€Å"evaluate student writing as reliably as human readers, †¦ [and] they attempted to identify computer-measurable text features that would correlate with the kinds of intrinsic features †¦that are the basis for human judgments †¦, [settling on] thirty quantifiable features, †¦ [which included] essay length in words, average word length, amount and kind of punctuation, number of common words, and number of spelling errors† (Herrington and Moran 482). In their study, they found a high enough statistical correlation, .71, to support the use of the computer to score student writing. The authors note that the response of the composition community in 1968 to Page and Paulus’s book was one of indignation and uproar. In 2007, not much has changed in terms of the composition community’s position regarding computer-based assessment of student writing. To many, it is something that is an unknown, mystifying Orwellian entity waiting in the shadows for the perfect moment to jump out and usurp teachers’ autonomy in the classroom. Nancy Patterson describes computerized writing assessment as â€Å"a horror story that may come sooner than we realize† (56). Furthermore, P.L. Thomas offers the following question and response: â€Å"How can a computer determine accuracy, originality, valuable elaboration, empty language, language maturity, and a long list of similar qualities that are central to assessing writing? Computers can’t. WE must ensure that the human element remains the dominant factor in the assessing of student writing† (29). Herrington and Moran make the issue a central one in the teaching of writing and have â€Å"†¦ serious concerns about the potential effects of machine reading of student writing on our teaching, on our students’ learning, and therefore on the profession of English† (495). Finally, CCC definitively writes, â€Å"We oppose the use of machine-scored writing in the assessment of writing† (789). While the argument against AES is clear here, the responses appear to be based on a lack of understanding of the technology and an unwillingness to change. Instead of taking a reactionary position, it might be more constructive for teachers to assume the inevitability of computerized assessment technology – it is not going away — and to use that assumption as the basis for taking a proactive role in its implementation. The Current Culture of High-Stakes Testing At any given time in the United States, there are approximately 16 million 15-18 year-olds, the majority of whom receive a high school education (U.S. Census). Even when factoring in a maximum of 10 percent (1.6 million) who may drop out or otherwise not receive a diploma, there is a significant amount of students, 14-15 million, who are attending high school. The majority of these students are members of the public school system and as such must be tested annually according to NCLB, though the most significant focus group for high-stakes testing is 11th grade students. Currently in Michigan, 95% of any given public high school’s junior population must sit for the MME, Michigan Merit Exam, in order for the school to qualify for AYP, Adequate Yearly Progress[1]. Interestingly, those students do not all have to pass currently, though by 2014 the government mandates a 100% passing rate, a number that most admit is an impossibility and will probably be addressed as the NCLB Act is up for review this year. In the past, as part of the previous 11th grade examination, the MEAP, Michigan Educational Assessment Program, required students to complete an essay response, which was assessed by a variety of people, mostly college students and retired teachers, for a minimal amount of money, usually in the $7.50 – $10.00 per hour range. As a side note, neighboring Ohio sends its writing test to North Carolina to be scored by workers receiving $9.50 per hour (Patterson 57), a wage that fast food employees make in some states. Because of this, it was consistently difficult for the state to assess these writings in a short period of time, causing huge delays in distributing the results of the exams back to the school districts, posing a huge problem as schools could not use the testing information in order to address educational shortfalls of their students or programs in a timely manner, one of the purposes behind getting prompt feedback. This year (2007), as a result of increased graduation requirements and testing mandates driven by NCLB, the Michigan Department of Education began administering a new examination to 11th graders, the MME, an ACT fueled assessment, as ACT was awarded the testing contract. The MME is comprised of several sections and required most high schools to administer it over a period of 2-3 days. Day one consists of the ACT + Writing, a 3.5 hour test that includes an argumentative essay. Days two/three (depending on district implementation), consist of the ACT WorkKeys, a basic work skills test of math and English, further mathematics testing (to address curricular content not covered by the ACT + Writing), and a social studies test, which incorporates another essay that the state combines with the argumentative essay in the ACT + Writing in order to determine an overall writing score. Miraculously, under the auspices of ACT, students received their ACT + Writing scores in the mail approximately three weeks after testing, unlike the MEAP, where some schools did not receive test scores for six months. In 2005, a MEAP official admitted that the cost of scoring the writing assessment was forcing the state to go another route (Patterson 57), and now it has. So how is this related to automated essay scoring? My hypothesis is that as states are required to test writing as part of NCLB, there is going to be a lack of qualified people to be able to read and assess student essays and determine results within a reasonable amount of time to purposefully inform necessary curricular and instructional change, which is supposed to be the point of testing in the first place. Four million plus essays to evaluate each year (sometimes more if more writing is required, like Michigan requiring two essays) on a national level is a huge amount. Michigan Virtual University’s Jamey Fitzpatrick says, â€Å"Let’s face it. It’s a very labor-intensive task to sit down and read essays† (Stover n.p.). Furthermore, it only makes sense that instead of states working on their own test management, they will contract state-wide testing to larger testing agencies, like Michigan and Illinois have with ACT, to reduce costs and improve efficien cy. Because of the move to contract ACT, my guess is that we are moving in the direction of having all of these writings scored by computer. In email correspondence that I had with Harry Barfoot at Vantage Learning in early 2007, a company that creates and markets AES software, said, â€Å"Ed Roeber has been to visit us and he is the high stakes assessment guru in Michigan, and who was part of the MEAP 11th grade becoming an ACT test, which [Vantage] will end up being part of under the covers of ACT.† This indicates the inevitability of AES as part of high-stakes testing. In spite of the fact that there are no states that rely on computer assessment of writing yet, â€Å"†¦ state education officials are looking at the potential of this technology to limit the need for costly human scorers – and reduce the time needed to grade tests and get them back in the hands of classroom teachers† (Stover n.p.). Because we live in an age where the budget axe frequently cuts funding to public education, it is in the interest of states to save money any way they can, and â€Å"[s]tates stand to save millions o f dollars by adopting computerized writing assessment† (Patterson 56). Although AES is not a reality yet, every indication is that we are moving toward it as a solution to the cost and efficiency issues of standardized testing. Herrington and Moran observe that â€Å"[p]ressures for common assessments across state public K-12 systems and higher education – both for placement and for proficiency testing – make attractive a machine that promises to assess the writing of large numbers of students in a fast and reliable way† (481). To date, one of the two readers (the other is still human) for the GMAT is e-Rater, an AES software program, and some universities are using Vantage’s WritePlacerPlus software in order to place first year university students (Herrington and Moran 480). However, one of the largest obstacles in bringing AES to K-12 is one of access. In order for students’ writing to be assessed electronically, it must be inputted electronically, meaning that every student will have to compose their essays via comp uter. Sean Cavanagh’s article of two months ago maintains that ACT has already suggested delivering computers to districts who do not have sufficient technology in order to accommodate technology differences (10). As of last month, March 2007, Indiana is the only state that relies on computer scoring of 11th grade essays for the state-mandated English examination (Stover n.p.) for 80 percent of their 60,000 11th graders (Associated Press), though their Assistant Superintendent for Assessment, Research, and Information, West Bruce, says that the state’s computer software assigns a confidence rating to each essay, where low confidence essays are referred to a human scorer (Stover n.p.). In addition, in 2005 West Virginia began using an AES program to grade 44,000 middle and high school writing samples from the state’s writing assessment (Stover n.p.). At present, only ten percent of states â€Å"†¦currently incorporate computers into their writing assessments, and two more [are] piloting such exams† (Cavanagh 10). As technology becomes more accessible for all public education students, the possibilities for not only computer-based assessment but also AES become very real. Automated Essay Scoring Weighing the technological possibilities against logistical considerations, however, when might we expect to see full-scale implementation of AES? Semire Dikli, a Ph.D. candidate from Florida State University, writes that â€Å"†¦for practical reasons the transition of large-scale writing assessment from paper to computer delivery will be a gradual one† (2). Similarly, Russell and Haney â€Å"†¦ suspect that it will be some years before schools generally †¦ develop the capacity to administer wide-ranging assessments via computer† (16 of 20). The natural extension of this, then, is that AES cannot happen on a large-scale until we are able to provide conditions that allow each student to compose essays via computer with Internet access to upload files. At issue as well is the reliability of the company contracted to do the assessing. A March 24, 2007 Steven Carter article in The Oregonian reports that access issues resulted in the state of Oregon canceling its contract with Vantage and signing a long-term contract with American Institutes for Research, the long-standing company that does NAEP testing. Even though the state tests only reading, science, and math this way (not writing), it nevertheless indicates that reliable access is an ongoing issue that must be resolved. Presently, there are four commercially available AES systems: Project Essay Grade (Measurement, Inc.), Intelligent Essay Assessor (Pearson), Intellimetric (Vantage), and e-Rater (ETS) (Dikli 5). All of these incorporate the same process in the software, where â€Å"First, the developers identify relevant text features that can be extracted by computer (e.g., the similarity of the words used in an essay to the words used in high-scoring essays, the average word length, the frequency of grammatical errors, the number of words in the response). Next, they create a program to extract those features. Third, they combine the extracted features to form a score. And finally, they evaluate the machine scores empirically,†(Dikli 5). At issue with the programming, however, is that â€Å"[t]he weighting of text features derived by an automated scoring system may not be the same as the one that would result from the judgments of writing experts† (Dikli 6). There is still a significant difference between â€Å"statistically optimal approaches† to measurement and scientific or educational approaches to measurement, where the aspects of writing that students need to focus on to improve their scores â€Å"are not the ones that writing experts most value† (Dikli 6). This is the tension that Diane Penrod addresses in Composition in Convergence that was mentioned earlier, in which she recommends that teachers and compositionists become proactive by getting involved in the creation of the software instead of leaving it exclusively to programmers. And this makes sense. Currently, there are 50-60 features of writing that can be extracted from text, but current programs only use about 8-12 of the most predictive features of writing to determine scores (Powers et. al. 413). Moreover, Thomas writes that â€Å"[c]omposition experts must determine what students learn about writing; if that is left to the programmers and the testing experts, we have failed† (29). If compositionists and teachers can enmesh themselves in the creation of software, working with programmers, then the product would likely be one that is more palatable and suitable based on what we know good writing is. While the aura of mystery behind the creation of AES software is of concern to educators, it could be easily addressed by education and involvement. CCC reasons that â€Å"†¦ since we can not know the criteria by which the computer scores the writing, we can not know whether particular kinds of bias may have been built into the scoring† (4 89). It stands to reason, then, that if we take an active role in the development of the software, we will have more control over issues such as bias. Another point of contention with moving toward computer-based writing and assessment is the concern that high-stakes testing will result in students having a narrow view of good writing, particularly those moving to the college level, where writing skill is expected to be more comprehensive than a prompt-based five-paragraph essay written in 30 minutes. Grand Valley State University’s Nancy Patterson opposes computer scoring of high stakes testing, saying that no computer can evaluate subtle or creative styles of writing nor can they judge the quality of an essay’s intellectual content (Stover n.p.). She also writes that â€Å"†¦standardized writing assessment is already having an adverse effect on the teaching of writing, luring many teachers into more formulaic approaches and an over-emphasis on surface features† (Patterson 57). Again, education is key here, specifically teacher education. Yes, we live in a culture of high-stakes testing, and students must be prepared to write successfully for this genre. But, test-writing is just that, a genre, and should be taught as such – just not to the detriment of the rest of a writing program – something that the authors of Writing of Demand assert when they write: â€Å"We believe it is possible to integrate writing on demand into a plan for teaching based on best practices† (5). AES is not an attack on best practices, but a tool for cost-effective and efficient scoring. Even though Thomas warns against â€Å"the demands of standards and high stakes testing† becoming the entire writing program, we still must realize that computers for composition and assessment can have positive results, and â€Å"[m]any of the roadblocks to more effective writing instruction – the paper load, the time involved in writing instruction and assessmen t, the need to address surface features individually – can be lessened by using computer programs† (29). In addition to pedagogical concerns, skeptics of AES are leery of the companies themselves, particularly the aggressive marketing tactics that are used, particularly those that teachers perceive to be threats not only to their autonomy, but their jobs. To begin, companies aggressively market because we live in a capitalist society and they are out to make money. But, to cite Penrod, â€Å"both computers and assessment are by-products of capitalist thinking applied to education, in that the two reflect speed and efficiency in textual production† (157). This is no different than the first standardized testing experiments by the Carnegie Foundation at the beginning of the 20th Century, and it is definitely nothing new. Furthermore, Herrington and Moran admit that â€Å"computer power has increased exponentially, text- and content- analysis programs have become more plausible as replacements for human readers, and our administrators are now the targets of heavy marketing from com panies that offer to read and evaluate student writing quickly and cheaply† (480). In addition they see a threat in companies marketing programs that â€Å"define the task of reading, evaluating, and responding to student writing not as a complex, demanding, and rewarding aspect of our teaching, but as a ‘burden’ that should be lifted from our shoulders† (480). In response to their first concern, teachers becoming involved in the process of creating assessment software will help to define the task the computers perform. Also, teachers will always read, evaluate, and respond, but probably differently. Not all writing is for high-stakes testing. Secondly, and maybe I’m alone in this (but I think not), but I’d love to have the tedious task of assessing student writing lifted from my plate, especially on sunny weekends when I’m stuck inside for most of the daylight hours assessing student work. To be a dedicated writing teacher does not necessarily involve martyrdom, and if some of the tedious work is removed, it can give us mor e time to actually teach writing. Imagine that! The Future of Automated Essay Scoring On March 14th, 2007, an article appeared in Education Week that says that beginning in 2011, the National Association for Educational Progress will begin conducting the testing of writing for 8th and 12th grade students by having the students compose on computers, a decision unanimously approved as part of their new writing assessment framework. This new assessment will require students to write two 30-minute essays and evaluate students’ ability to write to persuade, to explain, and to convey experience, typically tasks deemed necessary both in school and in the workplace (Olson 23). Currently, NAEP testing is assessed by AIR (mentioned above), and will no doubt incorporate AES for assessing these writings. In response, Kathleen Blake Yancey, Florida State University professor and president-elect of NCTE, said the framework â€Å"Provides for a more rhetorical view of writing, where purpose and audience are at the center of writing tasks,† while also requiring students to write at the keyboard, providing â€Å"a direct link to the kind of composing writers do in college and in the workplace, thus bringing assessment in line with lifelong composing practices† (Olson 23). We are on the cusp of a new era. With the excitement of new possibilities, though, we must remember, as P.L. Thomas reminds us, that while â€Å"technology can be a wonderful thing, it has never been and never will be a panacea† (29). At the same time, we must also discard our tendency to avoid change and embrace the overwhelming possibilities of incorporating computers and technology with writing instruction. Thomas also says that â€Å"[w]riting teachers need to see the inevitability of computer-assisted writing instruction and assessment as a great opportunity. We should work to see that this influx of technology can help increase the time students spend actually composing in our classrooms and increase the amount of writing students produce† (29). Moreover, we must consider that the methods used to program AES software are not very different than the rubrics that classroom teachers use in holistic scoring, something Penrod identifies as having â€Å"numerous subsets and criteria that do indeed divide the students’ work into pieces† (93). I argue that our time is better spent working within the system to ensure that its inevitable changes reflect sound pedagogy, because the trend that we’re seeing is not substantially differently from previous ones. The issue is in how we choose to address it. Instead of eschewing change, we should embrace it and make the most of its possibilities.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.