Posts by Collection

portfolio

publications

Māori loanwords: A corpus of New Zealand English tweets

Published in 57th Annual Meeting of the Association for Computational Linguistics, 2019

Quick links: paper, poster, data, code

Māori loanwords are widely used in New Zealand English for various social functions by New Zealanders within and outside of the Māori community. Motivated by the lack of linguistic resources for studying how Māori loanwords are used in social media, we present a new corpus of New Zealand English tweets. We collected tweets containing selected Māori words that are likely to be known by New Zealanders who do not speak Māori. Since over 30% of these words turned out to be irrelevant, we manually annotated a sample of our tweets into relevant and irrelevant categories. This data was used to train machine learning models to automatically filter out irrelevant tweets.

Recommended citation: Trye, D., Calude, A. S., Bravo-Marquez, F., & Keegan, T. T. (2019). Māori loanwords: A corpus of New Zealand English tweets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 136-142. Florence, Italy: Association for Computational Linguistics. http://doi.org/10.18653/v1/P19-2018

Hybrid hashtags: #YouKnowYoureAKiwiWhen your tweet contains Māori and English

Published in Frontiers Special Issue in Computational Sociolinguistics, 2020

Quick links: paper, data, code

Twitter constitutes a rich resource for investigating language contact phenomena. In this paper, we report findings from the analysis of a large-scale diachronic corpus of over one million tweets, containing loanwords from te reo Māori, the indigenous language spoken in New Zealand, into (primarily, New Zealand) English. Our analysis focuses on hashtags comprising mixed-language resources (which we term hybrid hashtags), bringing together descriptive linguistic tools (investigating length, word class, and semantic domains of the hashtags) and quantitative methods (Random Forests and regression analysis). Our work has implications for language change and the study of loanwords (we argue that hybrid hashtags can be linked to loanword entrenchment), and for the study of language on social media (we challenge proposals of hashtags as “words,” and show that hashtags have a dual discourse role: a micro-function within the immediate linguistic context in which they occur and a macro-function within the tweet as a whole).

Recommended citation: Trye, D., Calude, A. S., Bravo-Marquez, F., & Keegan, T. T. (2020). Hybrid hashtags: #YouKnowYoureAKiwiWhen your tweet contains Māori and English. Frontiers Special Issue in Computational Sociolinguistics, 3. https://doi.org/10.3389/frai.2020.00015

Harnessing Indigenous Tweets: The Reo Māori Twitter corpus

Published in Language Resources and Evaluation, 2022

Quick links: paper, data, code

Te reo Māori, the Indigenous language of Aotearoa New Zealand, is a distinctive feature of the nation’s cultural heritage. This paper documents our efforts to build a corpus of 79,000 Māori-language tweets using computational methods. The Reo Māori Twitter (RMT) Corpus was created by targeting Māori-language users identified by the Indigenous Tweets website, pre-processing their data and filtering out non-Māori tweets, together with other sources of noise. Our motivation for creating such a resource is three-fold: (1) it serves as a rich and unique dataset for linguistic analysis of te reo Māori on social media; (2) it can be used as training data to develop and augment Natural Language Processing (NLP) tools with robust, real-world Māori-language applications; and (3) it will potentially promote awareness of, and encourage positive interaction with, the growing community of Māori tweeters, thereby increasing the use and visibility of te reo Māori in an online environment. While the corpus captures data from 2007 to 2020, our analysis shows that the number of tweets in the RMT Corpus peaked in 2014, and the number of active tweeters peaked in 2017, although at least 600 users were still active in 2020. To the best of our knowledge, the RMT Corpus is the largest publicly-available collection of social media data containing (almost) exclusively Māori text, making it a useful resource for language experts, NLP developers and Indigenous researchers alike.

Recommended citation: Trye, D., Keegan, T. T., Mato, P., Apperley, M. (2022). Harnessing Indigenous Tweets: The Reo Māori Twitter corpus. In Lang Resources & Evaluation, 56, 1229-1268. https://doi.org/10.1007/s10579-022-09580-w

A hybrid architecture for labelling bilingual Māori-English tweets

Published in Findings of the Association for Computational Linguistics: AACL-IJCNLP, 2022

Quick links: paper, video, code, slides, MET Corpus Explorer, Interactive Error Analysis

Most large-scale language detection tools perform poorly at identifying Māori text. Moreover, rule-based and machine learning-based techniques devised specifically for the Māori-English language pair struggle with interlingual homographs. We develop a hybrid architecture that couples Māori-language orthography with machine learning models in order to annotate mixed Māori-English text. This architecture is used to label a new bilingual Twitter corpus at both the token (word) and tweet (sentence) levels. We use the collected tweets to show that the hybrid approach outperforms existing systems with respect to language detection of interlingual homographs and overall accuracy. We also evaluate its performance on out-of-domain data. Two interactive visualisations are provided for exploring the Twitter corpus and comparing errors across the new and existing techniques. The architecture code and visualisations are available online, and the corpus is available on request.

Recommended citation: Trye, D., Yogarajan, V., König, J., Keegan, T. T., Bainbridge, D., & Apperley, M. (2022). A hybrid architecture for labelling bilingual Māori-English tweets. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pp. 119-130. https://aclanthology.org/2022.findings-aacl.11/

When loanwords are not lone words: Using networks and hypergraphs to explore Māori loanwords in New Zealand English

Published in International Journal of Corpus Linguistics, 2023

Quick links: accepted manuscript, code

Networks are being used to model an increasingly diverse range of real-world phenomena. This paper introduces an exploratory approach to studying loanwords in relation to one another, using networks of co-occurrence. While traditional studies treat individual loanwords as discrete items, we show that insights can be gained by focusing on the various loanwords that co-occur within each text in a corpus, especially when leveraging the notion of a hypergraph. Our research involves a case-study of New Zealand English (NZE), which borrows Indigenous Māori words on a large scale. We use a topic-constrained corpus to show that: (i) Māori loanword types tend not to occur by themselves in a text; (ii) infrequent loanwords are nearly always accompanied by frequent loanwords; and (iii) it is not uncommon for texts to contain a mixture of listed and unlisted loanwords, suggesting that NZE is still riding a wave of borrowing importation from Māori.

Recommended citation: Trye, D., Calude, A. S., Keegan, T. T., & Falconer, J. (2023). When loanwords are not lone words: Using networks and hypergraphs to explore Māori loanwords in New Zealand English. International Journal of Corpus Linguistics, 28(4), 461-499. https://doi.org/10.1075/ijcl.21124.try

Intensifying expletive constructions and their use on social media: Innovative functions of the hashtag #wokeAF in English tweets

Published in Discourse, Context & Media, 2023

Quick links: paper

The hashtag has seen increasing attention in the linguistics literature, in recognition of its prevalence on social media and in other modes of communication. Here, we report on a diachronic analysis of the hashtag #wokeAF in English-language tweets posted between 2012 and 2022. First, we trace the use of the word woke from verb to adjective, with novel uses arising in African American Vernacular English. Such uses then spread into mainstream standard English, eventually being used in a novel construction: the intensifying expletive ([adjective+as+expletive]). Although examples of the intensifying expletive are listed in the Urban Dictionary, to our knowledge, this is the first linguistic analysis of the construction. Second, we analyse semantic interpretations and syntactic characteristics of the intensifying expletive #wokeAF, by documenting its occurrence in tweets spanning eleven years. Analysis of the discourse and the context in which the hashtag appears allows us to uncover its novel use as a collective noun, which in our data, is linked to a pejorative stance. In general, we find innovation in the semantic scope of the hashtag and versatility in its position and integration within tweets. Given the pervasiveness of the word woke in the public consciousness, as evidenced by its occurrence in the popular press, this article aims to fill a timely gap while providing an interesting example of language innovation online.

Recommended citation: Calude, A. S., Anderson, A., & Trye, D. (2023). Intensifying expletive constructions and their use on social media: Innovative functions of the hashtag #wokeAF in English tweets. Discourse, Context & Media, 56. https://authors.elsevier.com/c/1hzQ17suQFuPIa

Extending the Heatmap Matrix: Pairwise analysis of multivariate categorical data

Published in 27th International Conference on Information Visualisation (IV), 2023

Quick links: paper, video, slides

Analysts are often interested in understanding the association between variables within a dataset. This paper describes a set of techniques for augmenting the Heatmap Matrix, which represents pairwise intersections of categorical variables. The proposed extensions include adapting the design and layout of the matrix to enhance its readability, expanding the number of metrics that can be presented, displaying matching records in a coordinated table view, and embedding the Chi-square test of independence. These features are demonstrated on two datasets using the empirical prototype that has been developed.

Recommended citation: Trye, D., Apperley, M., & Bainbridge, D. (2023). Extending the Heatmap Matrix: Pairwise analysis of multivariate categorical data. In 2023 27th International Conference Information Visualisation (IV), pp. 29-36. Tampere, Finland: IEEE. https://doi.org/10.1109/IV60283.2023.00016

talks

What can social media tell us about Māori loanwords?

Published:

Quick links: slides

Joint presentation with Andreea S. Calude, in collaboration with Felipe Bravo-Marquez and Te Taka Keegan.

This talk presents work-in-progress involving computational tools to build a balanced corpus of Twitter data for the purpose of studying New Zealand English, in our case, specifically Māori loanwords. Following a wealth of studies which document the use of Māori loanwords in newspaper language (Deverson 1991, Davies and Maclagen 2006, Macalister 2000, 2001 , 2004, 2006, 2007, 2008, 2009) and a small number considering spoken language (from the late 1990s, Kennedy 2001, Calude et al 2017), children’s picture books (Daly 2007, 2009, 2017) and TV news broadcasts (de Bres 2006), we aim to complement this body of data with analyses of Social Media language. To this end, we devised a novel method of building a corpus of NZE Tweets which is both (relatively) clean and large (1.2M Tweets), using machine learning techniques. The MLT Corpus (Māori Loanword Twitter Corpus) affords the study of Twitter language diachronically (over a ten year period) and idiolectally (by user ID profile). Because our main interest lies with the use of Māori loanwords, we discuss two main research questions we are currently pursuing using this dataset, namely (1) analysing the frequency and internal structure of hybrid hashtags (#tereostories, #growingupkiwi), and (2) studying semantic representations of Māori loanwords using Word Embeddings (such as, Word2Vec, Mikolov et al 2013).

#KiwiIngenuity: creative uses of Māori loanwords in NZE Twitter posts

Published:

Quick links: slides

Joint presentation with Andreea S. Calude, in collaboration with Felipe Bravo-Marquez and Te Taka Keegan.

In this talk, we outline how computational tools can be used to obtain a large corpus of Tweets, and discuss trends identified in the use of Māori loanwords in this data.

Following a wealth of studies which document the use of Māori loanwords in newspaper language (Calude et al In press, Deverson 1991, Davies and Maclagen 2006, Levendis and Calude 2019, Macalister 2006, 2007, 2008, 2009) and a small number considering spoken language (from the late 1990s, Kennedy 2001, Calude et al 2017), children’s picture books (Daly 2007, 2008, 2017) and TV news broadcasts (De Bres 2006), we complement this body of data with analyses of Social Media language. To this end, we propose a novel method of building a corpus of NZE Tweets which is both (relatively) clean and large, using machine learning techniques. The MLT Corpus (Māori Loanword Twitter Corpus) affords the study of Twitter language diachronically (over a ten year period) and idiolectally (by user ID profile). Our data comprises a mix of manually labelled and automatically categorised Tweets (in total, approx. 1 million Tweets, and nearly 20 million word tokens).

Because our main interest lies with the use of Māori loanwords, we analyse patterns observed in the use of hybrid hashtags containing (at least) one Māori word and (at least) one native English word, e.g., #tereostories, #growingupkiwi. We first extracted all the hashtags in the MLT corpus, and then manually inspected them in order to find the 100 most frequently occurring hybrid hashtags. In the talk, we discuss (1) diachronic patterns of these hashtags over the ten-year period analysed, (2) their syntactic structure (categorising them into compound hybrid hashtags, and phrasal hybrid hashtags, see. Caleffi 2015), (3) their discourse function, and (4) trends in their position within Tweets (that is, whether they occur within the main text of the tweet, or as annexed tags at its periphery). Our findings point to creative and novel uses of Māori loanwords in Twitter, not unlike the phenomena classified under “word play” by Zirker and Winter-Froemel (2015).

We hope that this work can contribute to current knowledge of the use of Māori loanwords and to methods in large-scale corpus building.

Exploring Loanword Networks

Published:

Quick links: programme, thread

Joint presentation with Andreea S. Calude for “the first international Twitter conference on linguistics”.

Word borrowing is often investigated using frequency-based measures, such as types and tokens in a corpus. We introduce an alternative approach to studying loanwords, which involves building collocation networks, based on sets of borrowings that co-occur within the same text.

For more than a hundred years, linguists have puzzled over questions regarding word borrowing. Empirical studies generally capture loanword use either by analysing types and tokens of borrowed words in a corpus, or by comparing type/token frequency with near-synonyms native to the receiver language. In such studies, the unit of measurement for investigating loanwords is frequency of use.

This presentation will introduce an alternative approach to studying loanwords. Our method involves building networks of collocation by extracting sets of borrowings that co-occur within the same text. We refer to these sets as “intra-textual relationships”. Collocation is usually operationalised a priori with a specified window size (e.g. five words to the left or right of the keyword); however, the texts analysed are typically much larger than this window and may differ in length. Consequently, we extend the notion of collocation to what we term “collotextualisation”: capturing co-occurrence across the entire text, regardless of size.

We present a case-study of how collotextualisation can be used to complement conventional frequency-of-use measures when exploring loanwords. The data in our analysis consists of New Zealand English newspaper articles, which we use to study indigenous Māori words. Our corpus is themed around Matariki, the Māori New Year, and spans a period of ten years (2007-2016). The corpus comprises 91,958 words and 194 texts, with a borrowing rate of 29 loanwords per 1,000 words. After extracting 107 borrowings that occur at least five times in the corpus, we analysed the data by leveraging a special type of network (called a hypergraph) that preserves intra-textual relationships involving multiple loanwords. This allowed us to bypass the limitations of a standard network, which flattens the data into (less meaningful) pairwise co-occurrences.

We show that hypergraphs can uncover fresh insights into loanword use, especially when explored over time or by examining the (average) size of the intra-textual relationships. We report three main findings. First, most loanwords in our data occur with at least four others (i.e. loans occur in sets rather than in isolation). Second, there is an inverse relationship between intra-textual co-occurrence size and frequency of use, which means that newspaper articles are unlikely to contain an infrequent loanword and no frequently occurring ones. This is consistent with the idea that loanwords might occur in vocabulary frequency bands. Third, frequent loanwords take part in more distinct and recurrent relationships than infrequent ones, and are typically the first to occur in a given text.

Collotextualisation: An alternative approach to studying loanwords.

Published:

Quick links: video, slides

Joint presentation with Andreea S. Calude.

In traditional studies, word borrowing has been investigated through frequency-based measures, such as number of types and tokens in a corpus (see Poplack 2018 and references within; New Zealand English examples include Davies & Maclagan 2006, de Bres 2006 and Macalister 2006). This talk introduces an alternative approach to the study of loanwords, which involves building networks of collocation (Firth 1957; see also its grammatical parallel, collostruction, Stefanowitsch & Gries 2003), by extracting sets of borrowings that co-occur within the same text. Collocation is usually operationalised a priori with a specified window size (e.g. five words to the left or right of the keyword); however, the texts analysed are typically much larger than this window and may differ in length. Consequently, we extend the notion of collocation to what we term “collotextualisation”: capturing co-occurrence across the entire text, regardless of size.

We present a case-study of how collotextualisation can be used to complement conventional frequency-of-use measures when exploring loanwords. We compare Māori loanword use across three different corpora of New Zealand English newspaper articles and report three main findings. First, most loanwords in our data occur with several others, supporting the notion that loanwords occur in sets rather than in isolation (see also Macdonald & Daly 2013). Second, there is an inverse relationship between the length of a set and frequency of use, which means that newspaper articles are unlikely to contain infrequent loanwords and no frequently occurring ones. This is consistent with the idea that loanwords might occur in vocabulary frequency bands (as proposed for measuring L2 vocabulary; see Laufer & Nation 1995 and Nation 2006). Third, frequent loanwords take part in more distinct and recurrent relationships than infrequent ones, and are typically the first to occur in a given text.

Visualising Multivariate Categorical Data

Published:

Quick links: paper, poster, preview video, supplementary figure

Recommended citation: Trye, D. (2022, April 11-14). Visualising multivariate categorical data [Poster presentation]. 2022 IEEE 15th Pacific Visualization Symposium (PacificVis), Tsukuba, Japan. https://dgt12.github.io/files/pvis22_paper.pdf

Despite categorical dimensions being common in real-world datasets, few visualisation techniques support the analysis of multiple categorical variables at the same time. Those methods that do exist do not scale well, or do not consider relationships between all variables simultaneously, instead breaking them down into more restricted views or reflecting a hierarchy of variables. Drawing inspiration from set-based tools, this paper introduces a novel technique for visualising multivariate categorical data, by aggregating different combinations of categories. Advantages of this approach include the ability to easily compare frequencies among both variable categories and their combinations, the absence of line crossings, and a non-hierarchical layout that does not privilege one variable above all others.

Aggregating Hypergraphs by Node Attributes

Published:

Quick links: paper, poster

Recommended citation: Trye, D., Apperley, M., & Bainbridge, D. (2022). Aggregating hypergraphs by node attributes. In Angelini, P., & von Hanxleden, R. (Eds.), Graph Drawing and Network Visualization: 30th International Symposium, GD 2022, Tokyo, Japan, September 13–16, 2022, Revised Selected Papers (Vol. 13764, pp. 487-489). Springer Nature. https://doi.org/10.1007/978-3-031-22203-0

PAOHVis (Valdivia et al., 2021) displays hypergraphs in a matrix where rows represent nodes (dots) and columns represent hyperedges (vertical lines). We propose extensions to PAOHVis for leveraging repeated hyperedges in non-simple hypergraphs, and displaying multiple node attributes. This is accomplished through two aggregation functions: count-based, which targets low-level detail, and binary, for high-level overview. In doing so, we introduce a domain-agnostic framework for consolidating hypergraphs by one or more categorical node attributes.

Reference:

  • Valdivia, P., Buono, P., Plaisant, C., Dufournaud, N., Fekete, J.D. (2021). Analyzing dynamic hypergraphs with parallel aggregated ordered hypergraph visualization. IEEE Transactions on Visualization and Computer Graphics, 27(1), 1–13. https://doi.org/10.1109/TVCG.2019.2933196

Interactive Techniques for Visualising Categorical Data in Linguistics

Published:

Quick links: slides

So much of linguistic analysis involves categorical variables (Levshina, 2015; Stefanowitsch, 2020), from phonological, to lexical and grammatical features (see, for instance, Dryer & Haspelmath, 2013), and even language contact phenomena (Trye et al., 2020). Crucially, though, information visualisation is needed to make sense of large quantities of data. Yet, few visualisation techniques support the analysis of more than a handful of categorical variables at the same time. As the size and complexity of datasets continues to increase, more powerful visualisation tools are needed to facilitate their effective exploration. In this talk, I introduce two interactive techniques for visualising multivariate categorical data, which are being developed into free online tools. These techniques can also be applied to datasets of mixed types, provided the continuous variables are appropriately ‘binned’. The functionality of both tools is demonstrated using a COVID-19 Twitter dataset coded for the use of directives and users’ stance towards government measures in New Zealand (Burnette & Calude, 2022).

The first technique I will introduce is the Staircase Plot, which employs a space-efficient, matrixbased layout to display multiple bivariate summaries. The main visualisation is a heatmap depicting all possible two-way contingency tables for the given collection of variables. This allows the user to quickly identify associations between variables, and to detect any cells with zero frequencies or exceedingly low/high counts. The display can also be changed to show proportions or Pearson residuals instead of raw frequencies. Moreover, there is built-in support for the Chi-squared test of independence: all variable pairs that satisfy the criteria are coloured according to the effect size, as measured by Cramer’s V. This removes the burden of manual computation, visually reinforces correct interpretations and enables all results to be conveniently displayed in one place.

The second technique, MultiCat (Trye, 2022), is designed for examining higher-order categorical relationships: that is, multivariate rather than pairwise associations. The display enables the comparison of category frequencies, foregrounding precise combinations of categories that are commonly observed across individual data items. This can provide insights that are distinct from but complementary to those revealed by Staircase Plots. The proposed techniques have practical value for linguists who frequently deal with categorical data and wish to enhance their workflows for data exploration, anomaly detection, knowledge discovery and hypothesis testing. The interactive nature of these tools encourages the user to uncover patterns that may otherwise go unnoticed, by examining the data from multiple perspectives.

References:

  • Burnette, J., & Calude, A. S. (2022). Wake up New Zealand! Directives, politeness and stance in Twitter #Covid19NZ posts. Journal of Pragmatics, 196, 6-23.
  • Dryer, M. S., & Haspelmath, M. (2013). The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info, Accessed on 2022-10-10.)
  • Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. John Benjamins Publishing Company.
  • Stefanowitsch, A. (2020). Corpus linguistics: A guide to the methodology. Language Science Press.
  • Trye, D. (2022, April 11-14). Visualising multivariate categorical data. In Proceedings of the IEEE Pacific Visualization Symposium (PacificVis), Tsukuba, Japan.

From Heatmaps to Hypergraphs: Visualising Language Data in Aotearoa

Published:

Quick links: video

The 3MT Doctoral Competition challenges students to summarise their thesis in three minutes to a non-specialist audience.

Abstract: Visualisations can be extremely helpful for making sense of data. My research investigates novel techniques for visualising both categories and sets. I am currently applying these techniques to language data in order to better understand how te reo Māori is used on Twitter and in New Zealand newspaper articles.

Intensifying expletive constructions in English tweets: the case of #wokeAF

Published:

Quick links: slides

Presented by Andreea S. Calude, as part of a joint project with Amber Anderson and David Trye.

The hashtag has seen increasing attention in the linguistics literature, in recognition of its prevalence on social media and in other modes of communication. In this talk, we report on a diachronic analysis of the hashtag #wokeAF in English-language tweets posted between 2012 and 2022. First, we trace the use of the word woke from verb to adjective, with novel uses arising in African American Vernacular English and spreading to standard English. We argue that such uses led to a novel construction: the intensifying expletive ([adjective + as + expletive]). Although examples of the intensifying expletive are listed in the Urban Dictionary, to our knowledge, this is the first linguistic analysis of the construction. Second, we analyse semantic interpretations and syntactic characteristics of the intensifying expletive #wokeAF, by documenting its use in tweets spanning eleven years. Analysis of the discourse and the context in which the hashtag appears allows us to uncover its novel use as a collective noun, which in our data, is linked to a strongly pejorative stance. In general, we find innovation in the semantic scope of the hashtag and versatility in its position and integration within tweets. Given the pervasiveness of the word woke in the public consciousness as evidenced by its occurrence in the popular press, we hope to fill a timely gap, while also tackling broader issues around the role of social media in language change.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Heading 1

Heading 2

Heading 3

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Heading 1

Heading 2

Heading 3