How data expertise is fostering endangered languages

Like many indigenous peoples, the Native American Cherokee Nation want their younger people to speak and pass on their language, but this isn’t easy given the potentially overwhelming influence of English, the colonising language of the US.

Like many indigenous languages in the wake of European colonisation, the Cherokee language had been actively suppressed, often by Christian missionaries, but also by an alien education system.

The Cherokee are one of the biggest Native American nations and currently include some 3,000 speakers of the Cherokee language. But in 2002, when the Cherokee surveyed its population, . This sparked the creation of the

An important part of any effort to revitalise endangered indigenous languages and heritage is making earlier records, like audio recordings and text, easily available for use today and ensuring ongoing records are made for the benefit of future generations.

But the task of managing this mass of information can be daunting, and it is crucial that the information be managed in a way that makes it accessible over time.

Archived records can be difficult to access if the necessary permissions aren’t clearly applied to each item and if the systems that store the records are built in a way that doesn’t allow access to authorised users.

In Australia, at the we have created and curated a major collection of language records covering indigenous languages of the Pacific, including Australia and Papua New Guinea.

Through this work, recordings that were originally made on reel-to-reel tapes and cassettes are now digitised and readily available even on mobile phones. It means the voices of past generations can be heard and learned from by their descendants.

The methods we have established at PARADISEC over the last 20 years caught the attention of the Cherokee and we have now worked with them to help organise their recordings, transcripts, manuscripts, images, and texts, all now stored on hard disks and computers in a range of locations.

They are working to record every current Cherokee speaker, but keeping track of all of this new and past material isn’t easy. It requires a classification and archiving methodology that we have had some experience in formulating at PARADISEC.

How records like these are classified is crucial in not only promoting access so that the records are actually used but also in the ongoing research effort to understand how humans make language.

To appreciate the importance of language classification, we can think about the vast range of plant and animal types biologists need to know before they can understand biology more broadly. The same goes for languages. The more records of languages we have, the better will be our understanding of the origins of human languages and the many diverse types of languages there are in the world.

At PARADISEC, we are concerned to create research objects and records that can be used widely and far into the future. That means we avoid using software for content management that could make it difficult or impossible to access the data itself.

Within our collection of 15,500 hours of audio, a user can find a single sentence and instantly hear it played. There is text or media in over 1,300 languages in this collection that takes up 180 terabytes of memory – that’s equivalent to more than 700 average laptops.

This model of curating language data in open formats that can be accessed using any software, and adopting the relevant standards for metadata (data providing information on other data), appealed to the Cherokee Language Revitalisation Program.

So, in June 2022 my colleague Marco La Rosa and I travelled to Tahlequah in Oklahoma for a week of training and developing a system for describing their existing collections. We trained them in the use of software to allow them to describe their collections using standard terms.

This method allows us to take individual items from our collection to build sub-collections that can be returned to the source communities.

For example, if we want to take all items we have digitised for the Solomon Islands National Museum, we can copy them to a hard disk, and then we have written a service that looks at just that set of files and creates a catalogue of that collection written in the standard mark-up language for web browsing (HTML). This means that the recipient can look at a catalogue and find relevant information that is linked to the catalogue.

We have also put the same set of items onto Raspberry Pi computers – low-cost computers developed by the charity Raspberry Foundation – which can broadcast a local wifi network, allowing the information to be viewed on mobile phones with no need for an internet connection.

PARADISEC then is a real-world service both for accessing cultural records and citations of research data created by linguists, musicologists, and anthropologists.

Having this repository, and the expertise we have developed means that records aren’t lost at the end of each research funding cycle and that we can keep building on the existing materials for future exploration of human language, not just in our own Pacific region but as the work with the Cherokee shows, across the world.

/Public Release. View in full .