Quick links: slides
This talk presents work-in-progress involving computational tools to build a balanced corpus of Twitter data for the purpose of studying New Zealand English, in our case, specifically Māori loanwords. Following a wealth of studies which document the use of Māori loanwords in newspaper language (Deverson 1991, Davies and Maclagen 2006, Macalister 2000, 2001 , 2004, 2006, 2007, 2008, 2009) and a small number considering spoken language (from the late 1990s, Kennedy 2001, Calude et al 2017), children’s picture books (Daly 2007, 2009, 2017) and TV news broadcasts (de Bres 2006), we aim to complement this body of data with analyses of Social Media language. To this end, we devised a novel method of building a corpus of NZE Tweets which is both (relatively) clean and large (1.2M Tweets), using machine learning techniques. The MLT Corpus (Māori Loanword Twitter Corpus) affords the study of Twitter language diachronically (over a ten year period) and idiolectally (by user ID profile). Because our main interest lies with the use of Māori loanwords, we discuss two main research questions we are currently pursuing using this dataset, namely (1) analysing the frequency and internal structure of hybrid hashtags (#tereostories, #growingupkiwi), and (2) studying semantic representations of Māori loanwords using Word Embeddings (such as, Word2Vec, Mikolov et al 2013).