Spoken Language and Text Corpora
Language is perhaps the most remarkable innovation in the history of the human species giving us an effective means to cooperate in groups, pursue complex ideas and develop unique perspectives of our world. Our Centre is investigating language as diverse, dynamic and evolving systems that interacts with our perceptual processes in intricate ways. Understanding why the world’s languages are designed so differently—and how our minds acquire and exploit them to achieve different outcomes—will help generate important scientific insights and exciting new technologies.
We are investigating how languages vary, how we learn them, how we process them and how they evolve.
We aim to integrate typology and descriptive linguistics, evolutionary approaches, and studies of learning and processing across a wide range of linguistic types with the aim of setting up a new approach to language that places diversity, variation and change at centre stage.
We are building an ever-increasing number of corpora from languages of our region, in particular indigenous languages of Australia, Papua and Austronesian languages of Indonesia, PNG, and island Melanesia.
Corpora are collections of texts, either spoken or written, that are initially compiled within language documentation projects. Language documentation involves a very open research approach to a little-known language, comprising recording of different communicative events, their transcription and translation, as well as documentation of vocabulary, cultural and encyclopaedic knowledge, and speakers’ judgments about language structures.
Whether drawn from older written material, audio recordings or newer video recordings, corpora are amenable for linguistic research through relevant mark-up and metadata, so that linguists can retrieve relevant linguistic information and relate these to structural context and information about speakers, occasion of communicative events, audiences, etc.
Corpora are collections of texts, either spoken or written, that are initially compiled within language documentation projects. Language documentation involves a very open research approach to a little-known language, comprising recording of different communicative events, their transcription and translation, as well as documentation of vocabulary, cultural and encyclopaedic knowledge, and speakers’ judgments about language structures.
Whether drawn from older written material, audio recordings or newer video recordings, corpora are amenable for linguistic research through relevant mark-up and metadata, so that linguists can retrieve relevant linguistic information and relate these to structural context and information about speakers, occasion of communicative events, audiences, etc.
The systematic use and compilation of corpora for research on lesser-studied languages is still a relatively recent development. Nonetheless, many of our corpora contain good amounts of text data, collected over decades of research in communities in and around Australia. We are also developing new techniques of specific data collection and data mark-up that will enhance more systematic corpus linguistic research across diverse languages. It is our belief that making more of this material accessible to a broader public of scholars will be a valuable contribution to the empirical scientific study of language.
Our researchers are at the forefront of developing different aspects of this emergent development in language research. Our corpora serve as the basis for grammars, dictionaries, and other studies (see Nafsan). In other projects, we make accessible older specimens of data from a range of languages of a particular area by digitizing older tape recordings, and presenting respective media files and other information on a dedicated web landing page (see Daly River languages). Yet other projects are concerned explicitly with the comparison of diverse languages with regard to specific linguistic structures, employing newly developed systems of data mark-up to bear on long-standing questions about the universality of certain patterns of language use (see Multi- CAST and SCOPIC).