Towards Linguistically-informed Representations for English as a Second or Foreign Language: Review, Construction and Application
A new annotated dataset treats non-native English as its own linguistic system, not just broken grammar — potentially improving AI language tools for billions of learners.

The Thesis
Roughly 1.5 billion people use English as a second or foreign language, yet most AI language tools are built assuming native-speaker norms. This paper argues that non-native English — called ESFL, for English as a Second or Foreign Language — is a coherent linguistic system with its own patterns, not merely a corrupted version of standard English. The authors build a hand-annotated dataset of 1,643 sentences that maps how grammar and meaning connect in non-native English, using a framework called Construction Grammar, which treats recurring form-meaning pairings (for example, the way a learner might say 'I am agree' to express agreement) as the fundamental building blocks of language. The catch is scale: 1,643 sentences is small by modern NLP standards, and the paper is primarily a resource-construction and pilot study, not a deployed system.
Catalyst
Large language models trained predominantly on native-speaker text have made the performance gap for non-native users more visible and commercially consequential. Simultaneously, the field of Second Language Acquisition research has developed richer theoretical frameworks — particularly Construction Grammar — that give researchers principled tools for annotating learner language rather than just flagging errors. These converging pressures created demand for a structured, theoretically grounded dataset that simply did not exist before.
What's New
Prior learner corpora — collections of non-native writing such as the Cambridge Learner Corpus — were mostly annotated for surface errors (wrong verb tense, missing article) without modeling the underlying syntax-semantics relationship. That approach treats learner language as deficient native English rather than as a system in its own right. This paper instead applies Construction Grammar annotation, capturing the form-meaning mappings unique to ESFL and enabling research questions about how learners' linguistic patterns differ structurally, not just superficially.
The Counter
A dataset of 1,643 sentences is too small to train or meaningfully fine-tune any modern language model — for context, even modest NLP benchmarks run to tens of thousands of examples. The paper is largely a theoretical framework and annotation exercise, with the 'pilot study' testing the Linguistic Niche Hypothesis being exploratory rather than confirmatory. Construction Grammar, while intellectually appealing, is a contested theoretical framework with limited uptake in mainstream NLP precisely because its categories are hard to annotate consistently at scale. There is no evidence here that models trained or evaluated on this resource would outperform current systems on any practical task like grammar correction or language tutoring. The gap between a gold-standard academic resource and a commercially deployable product for ESFL learners remains enormous, and better-resourced teams at companies like Duolingo or Grammarly are already building large proprietary learner datasets without publishing them.
Longs
- DUOL (Duolingo) — direct exposure to AI-powered ESFL instruction tools
- CHGG (Chegg) — writing and tutoring services for non-native English students
- COUR (Coursera) — large non-native English learner base on platform
Shorts
- Generic grammar checkers (e.g., traditional Grammarly rule engine) — flagging ESFL as 'wrong' rather than understanding it structurally becomes a product liability as users demand more nuanced feedback
- Standard benchmark providers — leaderboards built on native-English test sets systematically undercount model quality for the majority of real-world English users
Enablers (Picks & Shovels)
- Universal Dependencies — open annotation standard that this work builds on and extends
- spaCy / Stanza — open-source NLP pipelines used to process and tag learner text
- Cambridge Learner Corpus and similar existing learner corpora — the raw material prior datasets drew from
Private Watchlist
- Quill.org — grammar and writing tools for learners
- Elsa Speak — AI pronunciation and fluency coaching for non-native speakers
- Grammarly — already serves large non-native user base, research could inform product
Resources
The Paper
The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax--semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL's unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank's practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.