Educational data mining is an emerging discipline concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students and the settings in which they learn. By applying educational data mining techniques to the analysis of the Grade Grinder corpus, we aim to develop techniques and approaches that will be subsequently reusable by colleagues wishing to exploit the other large educational data sets that are likely to become ubiquitous as, for example, learning management systems become more widely used. The following sections outline what we're doing with the data and the questions we seek to address.
The taxonomisation of student errors: Educational data mining techniques will be used to identify a detailed taxonomy of errors made by students as they learn to reason in this formal domain. Through detailed analyses of patterns of error instances across students, and within individual students across time, we aim to identify distinct types of error, such as misconceptions and slips. In pilot work on a small subset of the data (Barker-Plummer, Cox, Dale and Etchemendy, 2008), we have so far identified three top-level error types (which we call structural, connective and atomic) that account for a significant proportion of the data. We also propose to discover the mal-rules that students appear to use in producing the error patterns we observe. The error types and mal-rules will reflect current cognitive theories of human problem solving, reasoning and comprehension and will take into account individual differences in reasoning style.
In pilot work (Barker-Plummer et al., 2008; Dale, Barker-Plummer and Cox, 2009) we have established that a major source of difficulty for students stems from the fact that conditionals (e.g. if) and quantifiers (e.g. some) are used quite differently in natural language compated to their use in logic. This result mirrors existing work in mathematics education, and indicates the generality of the results that we will obtain in this work. Although our corpus contains work in undergraduate logic, the general task of learning to manipulate and use formal expressions in a careful way underlies all of mathematics, technology, engineering and science. We are confident that the results that we will obtain carry over into these other domains and may be used to inform education beyond the undergraduate logic curriculum.
- Informing the design of student learning support: The results of the error taxonomy analyses will, inter alia, inform the design of automated diagnostic and remedial extensions to the current e-assessment system. Our pilot work here has been promising, with an approach to classification based on regular expression patterns correctly identifying an average of 85% of errors. The aim is for the e-assessment system to ultimately provide highly-targetted, personalised support to learners. We expect that the techniques we develop will also generalise to a wide array of other domains and subject areas.
- The development of innovative language technologies: We will explore the use of statistical and symbolic corpus analysis methods from computational linguistics and language technology for the purpose of generating appropriate English paraphrases of students' submitted logic sentences. The goal here is to improve the effectiveness of e-assessment system feedback, and in so doing to make it possible for more students to come to grips with this traditionally difficult subject.
- Studying the time-course of student learning: Individual student submissions are time-stamped. By analysing successive exercise submissions by individuals, we can examine individual students' learning trajectories, the time-course of their learning, and learning impasses. In pilot work (Dale, Barker-Plummer and Cox, 2009) we have identified a useful measure of learning that we term stickiness. This is defined as the number of attempts it takes for for a student determine a correct answer once they have made their initial mistake. We would like to research this metric further and use it as an outcome measure in learning evaluation studies.
- Studying the role of diagrams in learning: The corpus contains diagrams as well as sentences of logic. Students use desktop applications to build or manipulate blocks worlds such that sentences of natural language or logic are true in them. Hence we are able to triangulate students' performance in the linguistic domain (natural language, logic) with their performance in the graphical (diagrammatic) domain. A preliminary study of a small data subset (Cox, Dale Etchemendy and Barker-Plummer, 2008) has revealed theoretically significant findings. For example, errors in diagramming sentences such as not a small cube are manifested much more frequently with respect to the object's size than with respect to its shape.
- The construction of an open-access front-end: We would like to make our corpus of data accessible to the wider academic community. To that end, we propose to develop OpenFace, a user-friendly web-based front-end designed to facilitate data filtering, sharing and re-use. Users will be encouraged to grow the resource by submitting the results of their analyses, and ancillary materials such as copies of publications. A discussion forum will also be provided. We plan to accommodate interoperability requirements (e.g. with existing data mining tools). We intend to structure the corpus in terms of the learning tasks posed to the learner and in terms of a philosophical logic curriculum (i.e., a hierarchy of conceptual pre- and co-requisites).
