IML White Papers
Automated Essay Evaluation
– Introducing Intergrader,
Human grading of essays and open text responses [ORA] is time consuming and subject to inconsistencies and biases. While there is a significant body of research on automated grading of essays [AES], they have been criticized for lack of context awareness, the need for large volume of training data, data driven biases and their black box nature. Further, in online courses like MOOCs, the common practice is to use peer grading. Peer grading has serious drawbacks as a grading mechanism not the least of which is a natural agency for grade inflation. There is a critical need for an AES which can evaluate the essay or ORA in its context, gather intelligence from a sparse data set of labeled essays, be transparent in its reasoning or logic and be tractable so it can be improved in an understandable manner. Besides, we need to be able to configure AES rapidly to reflect course-specific requirements.
We present a novel AES using a computational linguistics pipeline relying on the rhetorical structure of the essay, essay prompt and related subject content. We do not rely on any statistical machine learning methods. Our engine overcomes many of the limitations of the current solutions. It does not need a large amount of training data, is completely context aware and its reasoning is traceable. Our method also enables easy configuration of course level or ORA question level assessment rubric. We present results from two real world implementations to demonstrate the method’s efficacy. A larger benchmarking study using the ASAP competition data is underway.
Essay grading and open text response is a common form of evaluation used in many settings to assess effective learning. The use of essays and ORAs has generally declined because it is difficult to grade such responses manually. Multiple choice assessment has become quite popular for this simple reason. Yet, holistic and complete understanding of any subject can only be assessed using essays and ORAs. It is well understood that human grading of essays is fraught with inconsistencies and other human limitations such as recency bias. Different graders of the same essay are quite likely to grade the same response differently.
Automated essay evaluation systems have been introduced over the last couple of decades as a way of overcoming these challenges. Most if not all of the AES use word-based statistical models to assign grades to an essay. Word-based models are based on learning word patterns from essays such as co-occurrence, collocation, and so on. They do not attempt to process the text as it is written and consider the meaning of the sentences, paragraphs and the full essay directly. These word-based statistical models, including the more recent language models like BERT and XLNet, do not capture the real meaning of the essay and its relevance to the essay prompt and the related learning content. They also suffer from other serious limitations – the need for large data, the lack of traceability and the lack of tractability. Traceability refers to the ability to see the details of the model results in a meaningful way. Tractability here refers to the ease with which we can improve the predictive model in an assured way without just adding more data and hoping for the best as is the case with black box models. Besides, since they are entirely dependent on the data set that the models learn from, they have been criticized for not being able to recognize gibberish essays [Kolowich, 2014] and essays that reflect divergent thinking. They are also not feasible for course specific essay assignments since the rubric may vary from course to course and there will not be enough data for training a statistical model in such cases.
In this paper, we present Intergrader, an AES based on a proprietary computational linguistics pipeline which does not rely on a statistical model. Intergrader relies on a comprehensive analysis of the syntactical, linguistic and rhetorical properties of the essay. Intergrader uses an extensible 3-dimensional assessment framework comprising syntactical, linguistic and rhetorical assessments, with each dimension including a number of metrics. The dimensions go beyond the syntactic and linguistic properties considered in the literature so far to include a complete rhetorical decomposition of the essay to assess its completeness and relevance with respect to the essay prompt and the learning content [e.g., course material].
2. Prior Work
While there has been considerable research related to AES, almost all work has focused on linguistic features of the essay outside the content of the essay. Page  first introduced the idea of automated essay grading and later developed Project Grade Essay [Page, 1968] which used multiple linear regression to evaluate essays. Since then numerous AES have been developed using a variety of statistical approaches. For a summary of these studies, see Janda et al [Janda, 2019]. All of them have focused on generating a predictive model based on a large corpus of essays. Model features used in these studies can be classified into two groups: syntactical and semantic similarity. Syntactical features draw on the syntactical organization of the text. For example, parts of speech or phrases such as noun or verb phrase are considered syntactic in nature. On the other hand, semantic similarity features measure the semantic similarity between the essay and essay prompt based on the belief that such similarity can reflect the relevance of the essay to the essay prompt [Zupanc, 2017].
Syntactic features widely used in the literature include the following:
Fig. 1: Illustrative Syntactic Features
Semantic Similarity. The extent to which Natural Language Processing techniques have been used in AES appears to be limited to determining semantic similarity using a variety of techniques, both supervised and unsupervised. Most approaches here have relied on word-level models like co-occurrence, word vectors using neural networks, and word embeddings [Janda, 2019].
Usually, the unsupervised approaches assume that lexical cohesion, i.e., repetition of words and phrases in an essay is indicative of a good essay. Various statistical methods like Latent Semantic Analysis, probabilistic latest semantic analysis and Latent Dirichlet Allocation have been used to measure similarity of words and phrases. Semantic relationship between chunks of text has also been modeled as a conceptual graph which can help obtain information about the coherence in the text by pattern recognition [Chein, 2008; Pawar, 2019].
Irrespective of the method used, semantic similarity only measures the similarity between the words in the essay and essay prompt. And lexical cohesion is not a measure of the appropriateness of the essay content. At best it is a measure of writing quality. These attempts do not measure how well the essay answers the prompt. In fact, we believe that pure similarity is a highly incomplete measure of relevance and completeness.
There have also been attempts to incorporate distinctive sentiments in AES involved using subjective lexicon(s) to get the polarity of the sentences [Klebanov, 2012]. Stab and Gurevych [Stab, 2014] attempt to find an argumentative discourse structure in persuasive essays by classifying argument components as either supportive or not supportive. Klebanov et al [Klebanov, 2014] attempted to find the sentiment of sentences in essays by examining the sentiment of multi-word expressions. Farra et al. [Farra, 2015] matched up to the opinion expressions in the essays to their respective targets and use features extracted to predict the scores. There are several challenges with such approaches. For one, sentiment lexicon(s) can mean polar opposite sentiments depending on the context in which they are used. For e.g., “oil prices are going up” can be positive for the oil industry and negative for many other industries. This makes the use of standardized lexicon(s) problematic unless there is an accompanying effort to fully understand the context. Further, sentiment analysis is local to the sentence and does not measure the aggregate sentiment of complex sentences easily and certainly does not capture the aggregate sentiment of a paragraph or a long document. The most important challenge is that pure sentiments are only a small subset of the universe of rhetorical expressions and arguments that an essay may be expected to contain in support of its theses.
The Educational and Testing Services [ETS] uses e-rater [Attali, 2006] for generating scores and feedback. The feature set used with e-rater include measures of grammar, usage, mechanics, style, organization, development, lexical complexity, and prompt-specific vocabulary usage. The discourse analysis approach to measure organization, development assumes the essay can be segmented into sequences of discourse elements according to a rigid, fixed discourse structure – introductory material, a thesis statement, main ideas, and a conclusion. The identification of discourse elements was done by training a model on a large corpus of human annotated essays [Higgins, 2014]. The eRater suffers from several drawbacks. First, it presupposes a rigid discourse structure. Second, the model is trained on a hand annotated corpus of essays which has the appropriate tag structure corresponding to the discourse structure and, of course, this has all the issues associated with a word-occurrence based predictive model. E-Rater does not really attempt to understand the essay content in its full substance to assess it.
More recently, Rodriguez et al [Rodriguez, 2019], apply language models – BERT and XLNet to develop a classification model for essay grading. Their model provides superior performance over human or rule-based techniques on the ASAP data set. But it suffers from the same limitations that all statistical models suffer from – need for large data set, data bias, lack of traceability and tractability.
As Perelman [Perelman, 2020] has forcefully pointed out, all the AES reported in the literature are largely word-based and use features devoid of the rhetorical content in the essay to develop these models.
2.2. Peer Grading in Online Courses
Grading open text responses is a significant challenge in Massive Open Online Courses [MOOCs]. They necessitate a timely, accurate, and meaningful assessment of course assignments for hundreds of thousands of students. There have been two responses to this challenge – first most MOOCs seem to avoid open text assessment in favor of multiple-choice questions and second, where there are such assessments, several use ‘peer grading’.
Peer grading is defined as “an arrangement for learners to consider and specify the level, value, or quality of a product of performance of other equal-status learners” [Bachelet, 2015]. There has been extensive research on the reliability and validity of peer grading primarily in the context of face-to-face higher education [Cheng, 1999; Cho, 2006; Falchikov, 2000; Zhang, 2008]. Reliability is measured by the consistency of scores given by multiple student graders, and validity is commonly calculated as the correlation coefficient between student-assigned scores and instructor-assigned scores. Some research has found peer grading to be reliable and valid while some others have reported contradictory findings [Cheng, 1999; Korman, 1971; Mowl, 1995].
Research till date on the efficacy of peer grading has been in the context of traditional college degree courses with small or moderate enrollments and relatively homogenous student populations. Their applicability in the case of MOOCs remains largely unknown. In fact, as reported in a subsequent section in this study, we found such grading to be highly inflated and inconsistent with instructor grading.
3.1. Essay Grading Rubric
Our approach to automated essay grading addresses the limitations identified in prior sections. We do not need large training data, our approach is completely traceable, it does not suffer from bias, and it is completely tractable. The Intergrader can be deployed even where there are not a large number of essays. Its assessment logic can be configured to address course specific needs.
The Intergrader relies on a 3-part assessment framework comprising 3 assessment dimensions – Completeness/Relevance, Writing Quality and Grammar. Completeness/Relevance refers to the degree to which the essay is relevant to the essay prompt and reflects an understanding of the underlying subject material. Writing Quality refers to the structure of the essay including such linguistic properties such as cohesion, coherence, and readability. It can also include syntactical properties listed in Fig. 2. Finally, grammar refers to the extent of grammatical errors in the essay like subject-verb agreement, incorrect usage, incorrect preposition, sentence fragments, article usage and others.
While the literature has used many features related to Writing Quality and Grammar, its assessment of relevance is very weak and limited to semantic similarity measures. A response to be adequate should reflect an understanding of the subject material that is the focus of the course or exercise. Instead, we believe the AES needs to assess the essay content in much more depth by examining the rhetorical structure of the essay and the rhetorical expressions between the various concepts/ideas in the response. Where the AES has access to subject material, e.g., course content, we believe the AES needs to integrate the relevant portions of the content with the essay prompt and assess the essay response with such an integrated prompt material.
Fig. 2: Intergrader Essay Grading Framework
Intergrader has the ability to integrate the relevant portions of the underlying content with the essay prompt and assess the essay against such integrated material. It creates a semantic knowledge graph of the integrated content. The knowledge graph is a graph of concepts and expressed relationships between them. The relationships are normalized to a relationship category. For example, the relationship ‘is a’ is categorized as a ‘type of’ relationship [hypernymy]. As an example, “Benz is a car” signifies a ‘type of’ relationship between Benz and Car. Another commonly expected relationship would be causal, e.g., “Explain 3 factors that cause climate change”.
Intergrader leverages all the rhetorical relationships identified in the literature, e.g., Mann and Thompson [Mann, 1998]. These relationships were refined into a smaller set of manageable categories. In practice, most courses will have qualitative relevance/completeness items in their rubric which are mapped to one or more of these relationships to configure Intergrader specific to that course.
Fig. 3 shows a sample, minimal Knowledge Graph for the Essay Prompt, the Rubric and the underlying course materials (relevant excerpt), for an Engineering course. The figure also shows all the possible, potential expected responses, that Intergrader will deduce to have the highest score.
Fig. 3: Sample Expected Knowledge Graph for a Course
Fig. 3: Expansion of Prompt and Rubrics Semantic Knowledge Graph using Course Content
Where there is no access to the underlying material, Intergrader relies on the essay prompt or questions. Optionally, it can expand the essay prompt or question by automatically finding related concepts which might form part of a global understanding of that concept and allows the examiner to select the ones that should be used in the assessment. Alternatively, the Intergrader also allows the examiner to expand the prompt manually by specifying a more elaborate rubric. Intergrader also allows the examiner to specify the types of rhetorical relationships that are expected in the response. Our approach to assessing relevance involves determining the overlap between the knowledge graphs from the prompt + subject material and the essay response.
There is another weakness in most prior studies. Most of the features they use in building their AES may not be relevant in practice in a large number of settings. We find that the relative importance of each of the 3 dimensions – relevance, writing quality and grammar, varies significantly depending on the context. In some cases, all three dimensions matter; in others only relevance matters. In all cases, outside of where writing is the main objective, relevance carries significantly larger weight compared to writing quality and grammar.
3.2. The Engine
The Intergrader Engine works in three stages: Setup, Model Build, Assessment.
Setup allows the user to configure the assessment rubric which includes defining weights for the three dimensions and where appropriate weights for specific metrics within each of the dimensions. Weights can also be just 0 to exclude or 1 to include the dimension in which case the weights will be implicitly determined as part of the classifier that is built from the labeled data set. In the case of completeness/relevance, each rubric item is also mapped to a corresponding set of expected rhetorical expressions.
Model Build computes the metrics identified in Fig. 1 according to the preferences defined in the Setup phase and allows the user to iterate to a traceable assessor for the labeled data set. This is accomplished by perturbing the thresholds for the various metrics and finding a combination that best predicts the labels.
Finally, the Assessment step applies the assessor to new essays.
Intergrader Assessment Process
Fig. 4: The Process of Automated Essay Evaluation in Intergrader.
Fig. 4 outlines the end to end Intergrader process for an implementation. As mentioned earlier, ‘Setup’ refers to the process of setting up the Intergrader for the course or environment. It includes mapping the rubric desired by the institution to the metrics computed by the Intergrader engine. The Model step refers to the automated development of an appropriate grading model using the Setup that achieves the desired level of accuracy and recall on the sample data. The Integrate step reflects the ability of the Intergrader platform to integrate with external learning management systems and context systems in order to access student demographic data and also the necessary content to assess the essays.
4. Case Studies
4.1. A Prominent Higher Ed Institution
A prominent, globally respected higher educational institution currently uses peer grading in its online courses for open text response assessment. Intergrader was tested on a sample question from an engineering course. We selected an assignment from a course that was already graded. For the chosen assignment, the peer graders had given each other perfect scores.
Our team gathered all of the relevant materials for this assignment – the prompt, the rubric, the weights, and supplemental reading materials that students used to answer the assignment. We also collected the assignments that the students submitted. The assignment had 5 questions in total, each question could be scored from 0 – 3 (with three being the highest score). The maximum total score a student could receive was 15 points.
The Intergrader was configured with the rubric the professor had specified. In this case, the professor only wanted the writing quality and completeness/relevance graded. No weight was given to grammar. After Intergrader was configured, all of the submitted assignments were assessed through Intergrader. In total, we assessed the assignments of 10 students.
We then requested 3 TAs to grade the same essays manually with the intent of obtaining and comparing against a human baseline.
Fig. 5: Intergrader Essay Grading Framework
Fig. 5 presents the results of the pilot. As we can see, the peer grades were perfect scores for all 10 students. The TAs had a wide dispersion for the same essay. And the Intergrader were aligned with the TAs broadly. For each of the Intergrader assessments, a detailed reasoning report was provided to the professor/institution explaining how the Intergrader arrived at the final score/grade. In all the cases, the professor and TAs agreed with the Intergrader assessment.
4.2. A Global Test Prep Company
One of the largest test prep companies in the world (2M+ users), which prepares students for various tests, including the International English Language Testing System (IELTS), has a scaling and user experience challenge with its current practice of manually grading its open text responses. IELTS is an English language proficiency test for non-native English language speakers.
This test prep company helps its users on the writing assessment part of the IELTS test. Once its users provide their writing response for the IELTS prompts, they submit the response to the test prep company, which has humans grade them manually. Given the complexity of the IELTS rubric, the grading process is very tedious, and there are huge inconsistencies from grader to grader. This ultimately hurts the student’s ability to receive high-quality and consistent feedback
The test prep company provided the IELTS rubric, select prompt, and 734 writing responses from its users.
Fig. 6: The IELTS Rubric
While each response was graded by one individual, there were many graders involved in grading the 734 essays. The manual grades were thus subject to grader inconsistencies. The objective of the exercise was to compare the Intergrader results with manual grades and then examine the cases where Intergrader suggested a different grade to see if that was justified.
- As seen in Fig. 6, Intergrader fell within 1.5 point in an overwhelming majority of cases [88%]. 88% of Intergrader’s overall scores fall within 1.5-point difference of the manual graders
- 17% of Intergrader’s overall scores were exactly the same as the manual graders
- 82% of Intergrader’s Grammatical Range and Accuracy scores fall within a 1.5-point difference of the manual graders
- 77% of Intergrader’s Task Achievement scores falls within a 1.5-point difference of the manual graders
In the case of Lexical Resource scores there was no expectation for a match as it was already known that those scores suffered from grader inconsistencies. The objective in Lexical Resource was more to use Intergrader to understand the potential inconsistencies and to analyze if any changes were required in how this aspect was graded in an automated manner.
Fig. 7: Intergrader VS Manual Grades – Test Prep
The Intergrader results did point out a number of areas of improvement with respect to assessing the Lexical Resource dimension.
In all 734 cases, the Intergrader results provided a complete tracing of the reasoning behind the automated scores which allowed an informed comparison and validation of Intergrader’s efficacy.
In this paper, we have presented a novel AES which attempts to assess the writer’s understanding of the material with its comprehensive analysis of the rhetorical structure of the response and comparing appropriate elements of the structure with the course material and essay prompt. Intergrader relies on an adaptive, extensible framework organized along 3 dimensions – syntactic, linguistic and rhetoric. Course specific rubric is mapped to the appropriate set of extensive measurements in these dimensions to configure the Intergrader for a specific assessment instance. With its proprietary computational linguistics engine and its ability to generate a granular understanding of the essay response, Intergrader overcomes significant limitations of current AES – it does not require a large training set, its reasoning is fully traceable, and it is tractable.
Intergrader is the only AES which is designed to be customized for specific courses. Most current AES are designed for competitive tests like TOEFL where a large data set is available for training. They are inappropriate for course specific assessments. We reported the results from two implementations of Intergrader – a test prep company where we used it to evaluate an essay used in large scale, and a course at a higher ed institution which required a course specific rubric implementation.
In general, we believe that open text responses are a critically ignored aspect of assessment and Intergrader will enable institutions to increase their reliance on them. We believe that full understanding of any subject content can only be assessed through open text responses and not only through multiple-choice questions.
[Attali, 2006] Attali, Yigal & Burstein, Jill. (2006). Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning, and Assessment, 4(3). Journal of Technology, Learning, and Assessment. 4.
[Bachelet, 2015] Bachelet, R., Zongo, D. & Bourelle, A. (2015) Does peer grading works? How to implement and improve it? Comparing instructor and peer assessment in MOOC. GdP. European Stakeholder Summit. 224-233.
[Berry, 2008] Berry, K. J., Johnston, J. E., & Mielke, P. W. (2008). Weighted Kappa for Multiple Raters. Perceptual and Motor Skills, 2008, 107(3), 837–848.
[Burstein, 2001] Burstein, Jill & Leacock, Claudia & Swartz, Richard. (2001). Automated evaluation of essays and short answers. Tech. Rep., 2001.
[Burstein, 2003] Burstein, J. (2003). The E-rater® scoring engine: Automated essay scoring with natural language processing.
[Chein, 2008] Chein, Michel & Mugnier, Marie-Laure. (2008). A Graph-Based Approach to Knowledge Representation: Computational Foundations of Conceptual Graphs (Part. I). 10.1007/978-1-84800-286-9.
[Chen, 2012] Chen, H., He, B., Luo, T., Li, B. 2012. A Ranked-Based Learning Approach to Automated Essay Scoring. 2012 Second International Conference on Cloud and Green Computing, Xiangtan, 2012, pp. 448-455.
[Chen, 2018] Chen, M., Li, X. 2018. Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, pp. 378-383.
[Cheng, 1999] Cheng, W., Warren, M. (1999). Peer and Teacher Assessment of the Oral and Written Tasks of a Group Project, Assessment & Evaluation in Higher Education, 24:3, 301-314, DOI: 10.1080/0260293990240304
[Cho, 2006] Cho, K., Schunn, C., Wilson, R. (2006). Validity and Reliability of Scaffolded Peer Assessment of Writing From Instructor and Student Perspectives. Journal of Educational Psychology. 98. 891-901. 10.1037/0022-06188.8.131.521.
[Darus, 2009] Darus, S. & Subramaniam, K. (2009). Error analysis of the written English essays of secondary school students in Malaysia: A case study. European Journal of Social Sciences. vol. 8, no. 3, pp. 483–495, 2009.
[Falchikov, 2000] Falchikov, N., Goldfinch, J. (2000). Student Peer Assessment in Higher Education: A Meta-Analysis Comparing Peer and Teacher Marks. Review of Educational Research, 70(3), 287-322.
[Farra, 2015] Farra, Noura & Somasundaran, Swapna & Burstein, Jill. (2015). Scoring Persuasive Essays Using Opinions and their Targets. 10.3115/v1/W15-0608.
[Foltz, 1998] Foltz, P., Kintsch, W., Landauer, T. (1998). The measurement of textual coherence with latent semantic analysis, Discourse Processes, 25:2-3, 285-307.
[Foltz, 2013] Foltz, P., Streeter, L., Lochhaum, K., Landauer, T. (2013). Implementation and Applications of the Intelligent Essay Assessor. In Handbook of Automated Essay Evaluation, M. Shermis and J. Burstein, Eds. New York, NY, USA: Routledge, 2013, pp. 68–88.
[He, 2017] He, Z., Gao, S., Xiao, L., Liu, D., He, H., Barber, D. 2017. Wider and deeper, cheaper and faster: tensorized LSTMs for sequence learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1–11.
[Higgins, 2004] Higgins, D., Burstein, J., Marcu, D., & Gentile, C. (2004). Evaluating Multiple Aspects of Coherence in Student Essays. HLT-NAACL.
[Janda, 2019] H. K. Janda, A. Pawar, S. Du and V. Mago, “Syntactic, Semantic and Sentiment Analysis: The Joint Effect on Automated Essay Evaluation,” in IEEE Access, vol. 7, pp. 108486-108503, 2019.
[Jin, 2007] Jin, W., Srihari, R. 2007. Graph-based text representation and knowledge discovery. In Proceedings of the 2007 ACM symposium on Applied computing (SAC ’07). Association for Computing Machinery, New York, NY, USA, 807–811.
[Jin, 2018] Jin, C., He, B., Hui, K., Sun, Le. 2018. TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring. 10.18653/v1/P18-1100.
[Klebanov, 2012] Klebanov, Beata & Burstein, Jill & Madnani, Nitin & Faulkner, Adam & Tetreault, Joel. (2012). Building Subjectivity Lexicon(s) From Scratch For Essay Data. 591-602. 10.1007/978-3-642-28604-9_48.
[Klebanov, 2014] Klebanov, B., Madnani, N., Burstein, J., Somasundaran, S. (2014). Content Importance Models for Scoring Writing From Sources. 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 – Proceedings of the Conference. 2. 247-252. 10.3115/v1/P14-2041.
[Kolowich, 2014] Kolowich, S. 2014. Writing Instructor, Skeptical of Automated Grading, Pits Machine vs. Machine. In The Chronicle of Higher Education, April 28, 2014. Retrieved from https://www.chronicle.com/article/Writing-Instructor-Skeptical/146211 on January 20, 2020.
[Korman, 1971] Korman, M, Stubblefield, R. 1971. Medical School Evaluation and Internship Performance. In the Journal of Academic Medicine. 1971.
[Li, 2006] Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K. (2006). Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering. vol. 18, no. 8, pp. 1138–1150, Aug. 2006.
[Mann, 1998] Mann, William & Thompson, Sandra. (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text. 8. 243-281. 10.1515/text.1.19184.108.40.206.
[McNamara, 2010] McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic Features of Writing Quality. Written Communication, 27(1), 57–86.
[Miller, 2003] Miller, T. (2003). Essay Assessment with Latent Semantic Analysis. Journal of Educational Computing Research, 29(4), 495–512.
[Miltsakaki, 2000] Miltsakaki, E., & Kukich, K. Automated Evaluation of Coherence in Student Essays. In Proc. LREC, 2000, pp. 1–8.
[Mowl, 1995] Mowl, Graham & Pain, Rachel. (1995). Using Self and Peer Assessment to Improve Students’ Essay Writing: A Case Study from Geography. Innovations in Education and Training International. 32. 10.1080/1355800950320404.
[Norton, 1990] Norton, L. 1990. Essay-writing: What really counts? Higher Educ., vol. 20, no. 4, pp. 411–442, Dec. 1990. doi: 10.1007/BF00136221.
[Ong, 2014] Ong, N., Litman, D., Brusilovsky, A. 2014. Ontology-Based Argument Mining and Automatic Essay Scoring. In Proc. 1st Workshop Argumentation Mining, 2014, pp. 24–28. 10.3115/v1/W14-2104.
[Page, 1966] Page, E. (1966). The Imminence of… Grading Essays by Computer. The Phi Delta Kappan, 47(5), 238-243.
[Page, 1968] Page, E. (1968). The Use of the Computer in Analyzing Student Essays. International Review of Education / Internationale Zeitschrift Für Erziehungswissenschaft / Revue Internationale De L’Education, 14(2), 210-225.
[Pawar, 2019] Pawar, A., Mago, V. 2019. “Challenging the Boundaries of Unsupervised Learning for Semantic Similarity,” in IEEE Access, vol. 7, pp. 16291-16308, 2019.
[Perelman, 2020] Perelman, L., 2020. Babel Essay Generator. Retrieved from http://lesperelman.com/writing-assessment-robo-grading/babel-generator/ on January 20, 2020.
[Phandi, 2015] Phandi, P., Chai, K.M., & Ng, H.T. (2015). Flexible Domain Adaptation for Automated Essay Scoring Using Correlated Linear Regression. EMNLP.
[Rodriguez, 2019] Rodriguez, P., Jafari, A., & Ormerod, C.M. (2019). Language models and Automated Essay Scoring. ArXiv, abs/1909.09482.
[Rubenstein, 1965] Rubenstein, H., Goodenough, J. 1965. Contextual correlates of synonymy. Commun. ACM, vol. 8, no. 10, pp. 627–633, Oct. 1965.
[Shermis, 2003] Shermis, M. D., & Burstein, J. (Eds.). (2003). Automated essay scoring: A cross-disciplinary perspective. Lawrence Erlbaum Associates Publishers.
[Shermis, 2013] Shermis, M.D., & Burstein, J. (2013). Handbook of automated essay evaluation : current applications and new directions. Evanston, IL, USA: Routledge, 2013.
[Shultz, 2013] Schultz, M. (2013). The IntelliMetric™ Automated Essay Scoring Engine – A Review and an Application to Chinese Essay Scoring. In Handbook of Automated Essay Scoring: Current Applications and Future Directions, M.D.Shermis and J. Burstein, Eds. New York, NY, USA: Routledge, 2013, pp. 89–98.
[Somasundaran, 2016] Somasundaran, S., Riordan, B., Gyawali, B., & Yoon, S. (2016). Evaluating Argumentative and Narrative Essays using Graphs. COLING, 2016, pp. 1568–1578.
[Stab, 2014] Stab, C., Gurevych, I. (2014). Identifying Argumentative Discourse Structures in Persuasive Essays, EMNLP, pp. 46-56, 2014.
[Taguchi, 2013] Taguchi, N. & Crawford, W., Wetzel, D. (2013). What Linguistic Features Are Indicative of Writing Quality? A Case of Argumentative Essays in a College Composition Program. TESOL Quarterly, vol. 47, no. 2, pp. 420–430, 2013.
[Zhang, 2008] Zhang, Bo & Johnston, Lucy & Kilic, Gulsen & Leblebicioglu, Gulsen. (2008). Assessing the reliability of s