Visualising Categorical Data: Linguistic Case Studies from te Reo Māori and New Zealand English

Published in Waikato Research Commons, 2024

Recommended citation: Trye, D. (2024). Visualising categorical data: Linguistic case studies from te Reo Māori and New Zealand English. (Doctoral thesis, The University of Waikato, Hamilton, New Zealand). The University of Waikato. https://hdl.handle.net/10289/17049

Quick links: PhD thesis, CatVis database, MultiCat

Categorical variables are prevalent in real-world datasets across numerous domains, yet few visualisation techniques accommodate them effectively. This is especially true of datasets comprising three or more categorical variables, termed multivariate categorical data. Visualising such data is challenging due to the lack of inherent ordering of nominal categories, the so-called ‘curse of dimensionality’, and the potential variability in the number of categories per variable. Corpus linguistics, which involves the study of large digital collections of naturally occurring language, serves as the primary application domain in this thesis. This domain was chosen because it is rich in multivariate categorical data and, at the same time, is often visualised using only basic techniques.

This thesis contributes to the area of categorical data visualisation in several ways. First, we propose a taxonomy of techniques for visualising categorical data, highlight limitations of existing solutions, and identify relevant analysis tasks. Building on this foundation, the thesis introduces novel techniques and enhancements for visualising datasets involving multiple categorical variables. We focus on adapting the layout and interactive capabilities of an existing technique that uses a matrix of heatmaps to represent pairwise category intersections. These modifications show that directly visualising statistical test results for categorical data can be beneficial for exploring bivariate patterns and associations. Furthermore, we contribute the design, implementation and evaluation of a novel technique called MultiCat, which is not restricted to pairwise intersections but rather facilitates analysis of relationships among multiple variables simultaneously. Both these techniques are interactive and offer greater scalability than existing alternatives, thereby affording new possibilities for analysing multivariate categorical data. However, since categorical variables can occur within more complex data structures, we also consider their presence in networks and hypergraphs, which require specialised methods.

To demonstrate the application of these techniques, we draw on two linguistic case studies that focus on languages of special significance in Aotearoa New Zealand. Addressing the low-resource status of Māori, the country’s Indigenous language, we first contribute two related Twitter datasets—a monolingual Māori corpus and a mixed–language Māori–English corpus—together with an architecture for differentiating Māori and English words. Our initial case study uses the monolingual Māori corpus and proposed visualisation techniques to investigate grammatical possession in Māori, offering fresh insights into the linguistic practices of contemporary speakers. The second case study uses networks and hypergraphs with categorical attributes to explore Māori loanword co-occurrence in New Zealand English newspaper articles. We find that loanwords tend not to occur in isolation and that New Zealanders are still importing new (unlisted) borrowings from Māori.

Ultimately, the techniques developed in this thesis have broad applications both within and beyond the corpus linguistics community. By enabling more effective visualisation and analysis of multivariate categorical data, this research has the potential to facilitate deeper insights into domains as diverse as education, healthcare, business and science.