Alumni
Bio
JC is a geeky polymath and passionate dreamer who believes that R&D is the most efficient way to improve human well-being. He studied medicine for four years, yet he was interested in a more comprehensive research field; hence, he made a career shift toward computer science.
Currently, he studies two majors, computer science and industrial engineering, both of them online. Additionally, he works at an educational startup, where he does educational research and develops a financial literacy program for kids.
Project: Identifying Native Language Using Machine Learning
English is the default language for science, technology and international trade. Around the world, ~2 billion humans speak English; still, it is the native language (L1) for only 20% of them, making English the most-learned second language (L2). In a process called language transfer, people tend to apply previous knowledge about the structure of L1 on L2. This results in characteristic L2 traits that depend on the structure of L1. Assuming the existence of language transfer, many applied linguistics researchers have tried and managed to identify a writer’s L1 by analyzing how they write L2 texts, particularly English. Native language identification has many useful applications, from better automatic spellchecking and style correction methods to forensic linguistics.
The goal of this project was, given a piece of written English text, to identify a writer’s L1. To tackle this task, we used L1-labeled parts of speech—word and character—n-grams (with n ∈ [1,6]) from the English learner corpus to train some hidden Markov models (HMM).
References
[1] M. P. Lewis, G. F. Simons and C. D. Fennig, eds., “English in the Language Cloud,” Ethnologue: Languages of the World, Nineteenth ed. (Sep 15, 2016). www.ethnologue.com/cloud/eng.
[2] Wikipedia, “List of Languages by Total Number of Speakers.” (Jun 29, 2016) en.wikipedia.org/wiki/List_of_languages _by _total _number _of _speakers.
[3] Wikipedia, “Language (Structure).” (Jun 29, 2016) en.wikipedia.org/wiki/Language#Structure.
[4] Wikipedia, “Language Transfer.” (Jun 29, 2016) en.wikipedia.org/wiki/Language_transfer.
[5] E. Kochmar, “Identification of a Writer’s Native Language by Error Analysis,” PhD dissertation, University of Cambridge, Cambridge. www.cl.cam.ac.uk/~ek358/Native_Language_Detection.pdf.
[6] J. Tetreault, D. Blanchard and A. Cahill, “A Report on the First Native Language Identification Shared Task,” in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta: Association for Computational Linguistics, 2013 pp. 48–57. citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.5931&rep=rep1&type=pdf.
[7] M. Koppel, J. Schler and K. Zigdon, “Determining an Author’s Native Language by Mining a Text for Errors,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, New York: ACM Publications, 2005 pp. 624–628. u.cs.biu.ac.il/~schlerj/schler_kdd05.pdf.