Rebecca Woods on Large Language Models, Language and Meaning, and How Children Learn Languages

In this episode, we talked to Rebecca Woods, a Senior Lecturer in Language and Cognition at Newcastle University. We have an amazing chat about language learning in AI, and she tells us how language is crucial to how GPT functions. She's also an expert in how children learn languages, and she compares this to teaching AI how to process language.

Rebecca is a Lecturer in Language and Cognition in the School of English Language, Literature and Linguistics at Newcastle University. Prior to joining Newcastle in September 2019, she was Senior Lecturer in Language Acquisition at Linguistics at Huddersfield, within the University of Huddersfield. She joined Huddersfield from the University of York, where I completed her MA in Psycholinguistics (2012) and PhD in Linguistics (2016). She gained my BA in French and Linguistics from the University of Sheffield (2010). She is primarily interested in questions, both main and embedded, and their syntax, semantics and acquisition. She is also working on clausal embedding and discourse particles, while past projects have included bilingual first language acquisition, dative constructions, clitic doubling, sentential adverbs and possessives.

READING LIST:

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922

Bowerman, Melissa. (1974). Learning the structure of causative verbs: A study in the relationship of cognitive, semantic, and syntactic development. Papers and reports on child language development, 8, 142-178. URL: https://pure.mpg.de/rest/items/item_1451201/component/file_1451200/content

Christine Cuskley, Rebecca Woods and Molly Flaherty. Under revision. Language is more than text: The limitations of large language models for understanding human language and cognition. Submitted to Open Mind. Pre-print URL: https://blogs.ncl.ac.uk/rebeccawoods/files/2023/09/LanguageIsMoreThanText_V3.docx.pdf

Ingason, Anton Karl. 2016. Realizing morphemes in the Icelandic Noun Phrase. Doctoral dissertation, University of Pennsylvania. URL: http://repository.upenn.edu/edissertations/1776

Watt, Dominic and William Allen (2003). Tyneside English. Journal of the International Phonetic Association , Volume 33 , Issue 2 , December 2003 , pp. 267 - 271. DOI: https://doi.org/10.1017/S0025100303001397

Watt, Dominic and Jennifer Tillotson (2001). A spectrographic analysis of vowel fronting in Bradford English. English World-Wide 22:2 (2001), 269–302. DOI: https://doi.org/10.1075/eww.22.2.05wat

Yang, Yu’an. 2022. Are you asking me or telling me? Learning clause types and speech acts in English and Mandarin. Doctoral dissertation, University of Maryland, College Park. URL: https://lingbuzz.net/lingbuzz/006730

Yong, Zheng-Xin, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia, Thamar Solorio, Alham Fikri Aji (2023). Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages. Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63. URL: https://aclanthology.org/2023.calcs-1.5.pdf

Becky's Linguistic Reads Recommendations!

Adger, David. 2019. Language Unlimited. Oxford: Oxford University Press

McCulloch, Gretchen. 2019. Because Internet. London: Penguin

Sutton-Spence and Bencie Woll. 1999. The Linguistics of British Sign Language: An introduction. Cambridge: Cambridge University Press

Yang, Charles. 2016. The price of linguistic productivity: how children learn to break the rules of language. Cambridge, MA: MIT Press

Becky's blog 😉: https://blogs.ncl.ac.uk/rebeccawoods/thoughts-on-language/

IPA links

A really good IPA keyboard: https://ipa.typeit.org/full/

"Seeing speech" - IPA charts with ultrasound/MRI videos for each sound: https://www.seeingspeech.ac.uk/ipa-charts/

TRANSCRIPT:

KERRY MCINERNEY:

Hi! I’m Dr Kerry McInerney. Dr Eleanor Drage and I are the hosts of The Good Robot podcast. Join us as we ask the experts: what is good technology? Is it even possible? And how can feminism help us work towards it? If you want to learn more about today's topic, head over to our website, www.thegoodrobot.co.uk, where we've got a full transcript of the episode and a specially curated reading list by every guest. We love hearing from listeners, so feel free to tweet or email us, and we’d also so appreciate you leaving us a review on the podcast app. But until then, sit back, relax, and enjoy the episode!

ELEANOR DRAGE:

She's also an expert in how children learn languages, and she compares this to teaching AI how to process language. It's a super fun episode, and we hope you enjoy the show.

KERRY MCINERNEY:

Thank you so much for coming on the show. I've really been looking forward to this one because you study something that is of utmost relevance to all of us now. So could you introduce yourself, what you do and tell us a little bit about what's brought you to studying language and technology?

REBECCA WOODS:

Yeah, no, I've been so excited to join you as well. So thank you for having me. My name is Rebecca Woods. I'm a senior lecturer in language and cognition at Newcastle University, and one of my main research interests is actually how children acquire language. And it's through that, as well as having some excellent colleagues here that I've come to think a little bit more about how language and technology interact.

I think a lot of our students, when they join us here, think that language and technology means mostly things like how technology can influence new words, like additions into our lexicon, like, of course, the ever present Google. One fun example I like to pull out that intersects with my research, is how the word ' like', in Icelandic has actually completely changed its morphological pattern, just because of Facebook. So we can have a chat about that.

KERRY MCINERNEY:

Wait, what does that mean? Sorry...

I'm like very un languagey, like I just about managed to communicate in like very normal tones. So what do you mean it's changed its morphological pattern?

REBECCA WOODS:

Right, so, the verb like, líkar, in Icelandic, was, typically a verb, well, it's a bit like the English 'like', right, where it expresses not really an action. When you like something, you're not necessarily saying that you are performing some action in order to like it, but you're saying you are expressing some kind of feeling.

And so, in Icelandic, rather than using the equivalent of English I you'd actually use something like the equivalent of English to me as a subject. So, to me likes, and then whatever the thing you like is, because it's really expressing this relationship of sort of feeling and experiencing rather than doing.

But then along comes Facebook and I think the Icelandic translation of Facebook comes along in the late noughties. And they need some way of labeling their Icelandic like button, right? And that is an action, right? And they choose the verb líkar. And it means that Icelandic goes from this language that can't possibly say I like something to a language which now can say that I like something and not just to me something is liked.

Because you express the action of liking by clicking that button labeled líkar. So technology in that sense has changed up the shape of the grammar of Icelandic -in a small way, but in a way that's still pretty cool.

KERRY MCINERNEY:

Oh my gosh, that's nuts. Sorry. It's amazing. Cause I feel like I was thinking more 'LOL' or whatever, like just the kind of kinds of internet parlance that like seep our way into everyday life.

But I also very rudely interrupted you. So you were saying that's how we often think about tech and language, but how are you also thinking about it beyond sort of changing language forms?

REBECCA WOODS:

So, a big debate or a big topic of interest in child language acquisition is, like, what tools are children actually using in order to take all of the language that they're hearing or seeing, and break it down, make sense of it in units such that they can build it back up again into whatever it is they want to say or sign.

There's a question and this was really sort of introduced by Chomsky in the 50s and 60s about whether or not there's some part of our cognition that is specialized for language and language alone. It's not just part of our general learning capacities. It's quite a, it's a vigorous debate.

It can get occasionally quite nasty. And some researchers who would fall very heavily on the side of no, there is no sort of language specific capacity in the brain have been using large language models to try and make this case and looking at how large language models clearly are able to produce language, and we'll stick at producing for the minute, without any kind of language specific component, they claim, because it principally runs off statistics, right? And my colleague, Chrissy Cuskley, here in Newcastle, and another colleague at Davidson College in the States, Molly Flaherty, took issue with this. So we wrote a paper to try and think a little bit more about how LLMs learn, and how they learn differently and/or similarly from humans, just trying to really break down some of the massive hype around LLMs at the moment. I mean, they're seeping into our working lives in just about every way imaginable, writing of essays and even writing of assignments in some cases. But certainly in, in terms of that, they're getting on the territory of my research here, and so I've got to understand better what's going on with them.

ELEANOR DRAGE:

Well, thank God we have experts like you to do that for us! So, can you help us answer our three good robot questions? What is good technology? Is it even possible? And how can feminism help us work towards it?

REBECCA WOODS:

I mean, I'm sure everybody starts by saying these are three massive questions. And I guess I started to plug into them a little bit by thinking about an example again about how technology and language really entrench some aspects of a society that are not positive. So I guess it's a negative way of going around thinking what good technology is in that for me, good technology would be technology, which, you know, enhances our lives, but it doesn't continually remind and reinforce in the user how they might be different from what, whatever arbitrary standard society has set up for us, you know, almost like technological microaggressions.

So an example, um, that we use quite a lot, I'm based in the Northeast, right? We have a very, very distinctive local accent and dialect, local variety. And for a long time, when automated banking systems and telephone systems started to be used, people in the Northeast were locked out and it's all because of one of the vowels in the variety, right?

So if you ask someone a question up here and they want to answer in the negative, they won't say no [nəʊ], they'll say, they'll say something more like no [no:], okay? And those vowels, if I'm understanding this right anyway, especially in these early systems were really, really crucial for speech recognition. So to determine between a yes and a no, all that the the computers were really working over were the parts of the sound wave that pertain to the vowels. So yes is easy, and that's the same in Tyneside English, Geordie, as it is in sort of standard either British or American English. But Geordie no is very, very different.

A standard British English no [nəʊ] is made up of two vowels [ə & ʊ] and there's a movement between them and that’s quite a nice thing to track in the sound wave for the automated system, but a Geordie no [no:] is just one vowel [o:] and it's not either of the vowels that you typically find in a British English no and so continually you're trying to just go about your day, you're being told that this is easier for you to use, then you're locked out just because of the way you speak, and that certainly for me, would be a barrier against something being judged a good technology.

ELEANOR DRAGE:

Can I ask really quickly, what are the two vowels?

REBECCA WOODS:

So the two vowels in a standard English ‘no' are the /ə/ at the start moving towards the /ʊ/ at the end.

So if you're watching the video, you'll have fun watching what my face is doing, but also my tongue's doing something slightly different as well. So we move from a relatively open vowel that's sort of central that /ə/, which is so central in the mouth. It's called a schwa. It's got a special name.

ELEANOR DRAGE:

Oh, that's a schwa.

REBECCA WOODS:

That's a schwa. Yeah. The /ə/ sound. And then, oh, now I'm not actually a phonologist - I need to make sure I get this right. The /ʊ/ sound, we end up with the tongue is high at the back, but my lips rounded as well. So my tongue has moved backwards. My lips have moved forwards. And those are the two vowels. So yeah, it's a very open position, a relaxed tongue to a high back at the tongue and lips at the front. Those are the two positions that you move between in producing the sound “o”.

ELEANOR DRAGE:

Okay, so like the Southern 'o' has two vowels.

And why, why is there that, this differentiation?

I've always been interested in how the Tower of Babel came to be. Do you have any idea why it's different?

REBECCA WOODS:

For those, for those sorts of questions, I'd really have to ask my colleagues who work more on regional variation and also who work on historical change.

But it's true that there are other varieties in the UK that have what we would call a monophthongal vowel there, which is to say one vowel as opposed to a diphthong, which is two. So I spent quite a lot of time in Yorkshire, that's where I did all of my studies. And when I was living in Leeds, it's more like a no [nɜ:].

So again, you just got that [ɜ:] - one single tongue position, lip position, and you push through and that's your [nɜ:]. So it's possible in the British context, it has to do with like contact, very early contact with other languages, so particularly Old Norse. But that is a guess, and you would definitely have to check with a historical linguist for that one.

KERRY MCINERNEY:

I mean, that's fascinating. And it's interesting because I feel like. From my non linguistics, non pronunciation perspective, like, I'm from New Zealand and I grew up there and I feel like diphthongs are something I really associate with like New Zealand English, like floor [flʊɐ][1], as opposed to like floor [flɔ:], however, like British, like floor, like I feel like that pattern.

So it's really interesting to hear that distinction in as fundamental a word as no in this context.

REBECCA WOODS:

Yeah, yeah, absolutely. And, there's all sorts of other examples of that. So, for example, people in parts of the North of England will, like, really drastically reduce certain words.

So, like, a really stereotypical one for Yorkshire English is the word, the. So, if you want to say, “Oh, something's on t’telly”, as opposed to “on the telly”. And that, I mean, that also is a really fundamental function word in English, right? But, uh, it means that any time somebody from Yorkshire is using speech to text recognition or Alexa, or any digital sort of assistant, they're highly likely to have to modify in quite fundamental ways the way they're speaking in order to be understood, because digital assistance was something else I was thinking of, and I was thinking about a good technology, you know, a technology which has these applications in a context of disability, and, you know, creating access and inclusion, but again, if you're having to modify the way that you speak in these fundamental ways, it's inclusion at a price, it's inclusion but reminding you of just another way in which you might be excluded in some way.

Um, so yeah, these technologies are super, super complex.

KERRY MCINERNEY:

Yeah, I completely agree. It's one of the things that Eleanor and I see a lot with AI applications, that they have this promise of scale, this idea that they can be widely applied. And, you know, there's a lot of efficiencies that come with scale, but there's also huge risks, huge capacities for harms and exclusions.

Everything from someone might not be able to use this technology, through to actually now everyone is changing their vocal patterns to fit a particular predetermined cultural ideal or type. And that is itself like a really sad loss. And especially when it's something as fundamental as the word no, or as fundamental as accessing your finances.

And on a side note, you know, my current bank fail is my husband's bank thought that he was being scammed by me, his gold digger partner to the extent where they asked him if he had ever met his wife or his so called wife. And I can confirm he has met me. And he's a slam poet, which is not exactly the kind of career you go for if you're really trying to get a lot of money from someone, but I digress.

I mean, obviously I've been thinking really in depth about language and technology in lots of different ways. I want to bring you to something that you were talking about at the beginning, which was large language models. Most of you listening probably have heard of what a large language model is or have used one before, seen them in the news.

So chat GPT is the most famous example. There's a lot of them. You also mentioned there's a lot of hype around these models. So can you do some myth busting for us to start us off? Could you tell us as a linguist, what would you want us to know about large language models?

REBECCA WOODS:

Yeah, for sure. I guess the main one is about whether what they're producing is actually meaning.

Are they, are they creating meaning like humans? I mean, to the extent that I saw on, it might be the, the social media site that can no longer be named X. The other day, somebody had discussed their personal problems with ChatGPT and, and said, “Oh, well, people have talked to me about having therapy and, and surely this is it.”

There's a lot to unpack there. A really well known linguist who is incredibly experienced in AI, Emily Bender and her colleagues wrote the very famous paper about Stochastic Parrots, right, And I don't think actually it can be stressed enough the extent to which LLM's large language models really are parrots in this way, that as far as we can tell, and this becomes incredibly obvious as soon as you move out of English, they are not making meanings in the way that humans are making meaning, right?

Meaning is not restricted to semantic meaning on a lexical level. It's not just about getting a word with the right meaning and then stringing it up with another one that tends to follow it, right? Meaning is created, we'd call it compositionally, by bringing together various units of language.

In creating a unit of language, you might move actually quite a distance from what an individual part of that unit would mean, and you want, you need to look across these bigger units of language as to how they combine and then how they're interpreted by humans and that just does not happen in the way that LLMs work, right?

They have had input into them, all sorts of information about the semantic valency of words, so whether they are positively valent or they're negative dictionary definitions, and of course, an awful lot of human input. And I think that's something that also gets forgotten a lot, how many humans are actually working behind these things as well and just not being recognized for their labor and goodness only knows how well they're being paid for it. So it has a lot of information, um, but this still doesn't operate over the kind of the size of units where as humans we just naturally interpret language.

Hopefully that makes sense. Meaning really has to sort of project past the level of the word. And in terms of the kind of information that LLMs have access to reliably, it doesn't often get an awful lot past that.

ELEANOR DRAGE:

How much has that got to do with context?

REBECCA WOODS:

In what sense?

ELEANOR DRAGE:

In the sense that... understanding something. You know, when we create a sentence when we're trying to communicate with someone, it's very context dependent. So it also says a lot about where we are and what our context is. But what happens to context with GPT because it doesn't have its own, I mean, I guess you can argue it has a source of computational context. But yeah, how does GPT learn, what could it tell us about the way that we learn, and the importance of context.

REBECCA WOODS:

Yeah, this is a hugely important question and it's one that, that Chrissy and Molly and I really deal with in our paper that children do not learn from strings of language alone, in the way, certainly not in the way that large language learning models do.

In terms of how context affects child language acquisition, we know very early on things like learning to follow a point with your eyes is key in children starting to learn more about how words like I and you can shift meaning within an utterance. And so, of course, I am always I to me. But then somebody else is I to them and never to me.

I may be you to the person I'm talking to, but then they are you to me. So pronouns and how pronouns shift are really, really fascinating. They're an area where children show lots of different stages of development and we can even, we think, pinpoint sometimes in individual children how things like a family holiday where you talk with more people and more people at once can actually push on their acquisition of you and how to understand what you means just because they have heard it in context.

That's also one reason, right, why children don't require, you know, 3. 1 trillion words worth of input in order to acquire language, right, because the context does so much work for them. There's this super interesting dissertation that came out of the University of Maryland by a friend of mine called Yu'an Yang, and she was interested in how children learn to make assertions, ask questions, and make commands, and also recognize how each of those is structured differently.

And when she created a small computational model and when she fed it with just strings of words, so sentences that children have heard, real child directed sentences, the models were not very accurate at all. When she added in pragmatic information about how these strings were being used, the model’s accuracy improved dramatically.

And this, this just goes to show how crucial a role context plays in child language acquisition and large language models don't have access to any of that. So you can give them a string like, let's think of a good one, oh, what's the name of Taylor Swift's new, uh, squeeze? Um, what's his name?

KERRY MCINERNEY:

Oh, the footballer, uh, Travis... I had to pronounce his last name. Is it Kelce? But it begins, like, I know what it looks like, I just don't know how to pronounce it.

REBECCA WOODS:

Right, okay, we'll go with Kelce, why not? "Didn't Travis Kelce play well at the weekend?" Right, and depending on the intonation I give to that utterance and context, so what I already know, what I know about what you know, that that can actually mean some quite different things.

Like, I could be being patronizing, I could be being astonished, I could really be asking for your opinion, right, based on all of these factors. And again, none of this is information that LLMs have access to because they operate over text and text is such a specific mode of language in use.

I'm not entirely sure I've answered your question.

ELEANOR DRAGE:

It does answer the question of why there's so many miscommunications over WhatsApp. I'm dating at the moment. And I think, I like, obviously have no idea how to message, but my messages are always being communicated wrong, interpreted not in the way I intended, and like, I don't understand like- jokes and sarcasm are totally lost on me, and I think this means that, you know, the success of my romantic life hinges on this thing that I have no control over.

KERRY MCINERNEY:

I also felt like during COVID as well, when it was so much more text- based communication, I feel like I saw so many different relationships break down in different ways.

There were just so many more capacities for miscommunication. I do think it's something to do with the specific modality of being text only was making life pretty hard for people.

REBECCA WOODS:

Yeah, I think it also means that conventions are like sort of developing and arising all the time and they're not arising the same way in different communities, especially when it comes to texting, right?

Because we've got generations who grew up entirely without mobile phones, those who grew up with much less complex mobile phones, and now we've got instant messaging in a very accessible way, and the whole suite of emojis and all the wonderful things you can do with those.

Yeah, there's a couple of people have talked about punctuation, particularly how punctuation is used in WhatsApp messages just recently, like... I wouldn't necessarily be averse to sending a WhatsApp message with a full stop at the end if I finish, but apparently if my students did that, my 18 year old first years, um, then they would be signaling, “No, I am annoyed at you”.

Right. So all these sorts of conventions that are just arising within groups, it's almost like lexical variation. You know, sort of words that different generations use, but it's on a more subtle level than that. It's on the kind of syntactic, structural sort of level that these things are being interpreted.

Yeah, so it makes it really hard for those of us who've grown up somewhere in the middle.

ELEANOR DRAGE:

I had a Californian boyfriend at uni. He didn't use punctuation at all in his messages. And so I kind of started doing that because I was lame. And, um, now I'll send messages to people and they'll be like, “no punctuation!”

You know, like, “are you going out later” But like, I don't want, I'm not too interested, there's not a, there's not a question mark. You know, um, it's terrible. It's, yeah, it's a disease. Anyway, I'm sorry.

KERRY MCINERNEY:

It reminds me of like that line in that Lorde song where she's like, “I overthink your punctuation use”.

And I feel like that's the like integral part now of like relationships work. I actually wanted to ask you a little bit as well about large language models, because you mentioned Stochastic Parrots, which is a fantastic paper, and I will be linking that in the full transcript of this episode, which is on our website, www.thegoodrobot.co.uk. We'll also have some readings, some books, ones produced by Becky, but also her recommendations, so we'll get you to send those through after. But in Stochastic Parrots, one of the main issues that the authors flag is not only the way that these models, they argue, don't really produce meaning, but also that there's a lot of harmful and discriminatory effects that come from these models, and they might harm certain communities than others.

And so I wanted to ask you around, um, large language models, I guess, sort of, who's included in the possibilities that they offer, and who is excluded?

REBECCA WOODS:

That's a fab question, I think I've already talked a little bit about how large language models are necessarily based off one mode of language, they're just based off text, right? And thousands of languages that are used in the world do not have writing systems. So to some extent those people are automatically excluded, and sign language users as well. Sign languages typically don't have written forms and they are obviously, their structure is completely different because when you're using the visual-manual modality that sign languages use, you can actually express multiple units of language at the same time, right, because your hands might be doing one thing, your face might be doing, or different parts of your face might be doing another one or two things, to create a whole complex of meaning in one space. It's like a quantum language, certainly relative to the spoken modality where necessarily physically you can only produce one sound at a time. Largely, that's what happens. I think you can actually co-articulate sounds, especially in Geordie. But, you know, your meaningful units of language tend to actually ultimately then emerge in a linear fashion as opposed to at the same time. So the users of languages that have no writing system and particularly sign language users are excluded.

I mean, you might want to ask whether or not this is actually a problem for them in some ways because I was actually in preparation for this chat, reading over a little bit of your interview with Su Lin Blodgett. She pointed out, a really important point that is inclusion doesn't necessarily mean actually getting everybody in sort of an accessible space online because, you know, those languages could be used for surveillance, for example, so it's not necessarily a positive thing to be included because these technologies, I guess, in themselves are not necessarily always being used in positive ways.

And I think that's true for signed language users to a really great extent because there's lots of technological sort of fiddling in sign languages, like, “Oh, I know, let's create a sign language glove that can help interpret sign languages” and sign language users sort of uniformly seem to be going, “No, these things are really, really not helpful”.

What we need is frankly much, much more basic, sometimes not even technological interventions, just like greater access to interpreters, in the spaces in which they're moving, um, you know, inclusion in legal bills, for example. So, uh, British Sign Language now, well, is, is still just on its way to becoming a legally recognized language of the UK, right?

And that really in no small part is due to Strictly Come Dancing from, uh, when Rose Ayling- Ellis was involved in it. Um, yeah, BSL has been recognized in a greater capacity in Scotland for longer, but not in England and Wales. Right, so, these users of these languages are excluded, but at the same time, actually, when thinking about exclusion, we want to think whether or not they would want to be included, and whether or not they would be included on the terms and in the terms that they need.

ELEANOR DRAGE:

It's that constant debate about whether we should be leaning in or out, whether we should be resisting these technologies or, or whether we want some kind of reform. And I think it's always a really interesting point of contention because, you know, we've been working with organizations that are trying to help more languages be represented in AI, but there are all these sorts of consequences to being part of the system. So it's kind of interesting to think with that tension and the temporalities that they register, whether it's like a long-term improvement or just a short-term intervention. I'm really interested in bilingualism, um, because I've got a lot of bilingual friends and a few of them just can't finish a sentence in one language. They always kind of go for the easiest word or the word that comes to them first. And there's been some interesting work around large language models and bilingualism trying to improve the way that they process bilingual speech.

But are there any other issues around bilingualism? How can LLMs improve their understanding of the way that multilingual and bilingual people communicate?

REBECCA WOODS:

Yeah, I'm so glad you're asking me about this. I'm just, I love thinking about the multilingual context, partly because one of the things that I like to look at in my research, as well as code switching, so the mixing of multiple languages, exactly like you just mentioned, how somebody who has multiple languages will rarely, given the right situation, confine themselves to just one because there's certain things that you feel you can express better in a different language that you can express with fewer words in certain languages or simply it's just more fun and it represents who you are better.

So yeah, it would seem like LLMs on the whole, are not so good with multilingual data sets, which again is kind of not surprising - there's so few languages represented and as soon as you introduce multilingual texts it's going to mess up totally your probability calculations in terms of words follow on from each other. So yeah, a couple of recent papers have suggested that in terms of LLMs being able to provide translations of code switching they just do a really, really poor job of it. And in terms of producing code switched texts, a really interesting paper by Zheng Xin Yong and her colleagues noted that actually it really depended on the combination of the languages as well, which probably has an awful lot to do with how well they're represented in text and especially in internet text.

So, for example, when they asked ChatGPT to produce a short passage in Singlish, which is a variety of English spoken in Singapore that has a lot of influence from Mandarin Chinese and Malay in particular, it did an okay job with Singlish, but when it asked ChatGPT to produce a text that was Tamil and English code switching, it was full of grammatical errors. It was semantically not meaningful because again, right, semantics is not just about the individual words, but how they actually combine. It even started mixing scripts, which suggested it didn't really fully understand what sorts of text it was dealing with. And then in some other examples, it started introducing the wrong languages.

Like it, you know, language tagging is not a straightforward process, especially in highly multilingual situations. So an example that Yong and her colleagues gave was when they asked ChatGPT to give a Chinese-English mixed text as a Malaysian speaker. And as soon as ChatGPT saw Malaysian, it went, “Oh, well, let's throw a whole load of words from Malay in there.”

Then it got lost along the way and was naming half, tagging half the Malay words as Chinese. And again, we're getting very close to the areas of stereotypes and other sorts of potentially offensive uses of language because it's wading into a political and social context that it has no way of dealing with.

So I thought that was really interesting. And, and as you say, Eleanor, like, these are ways in which people speak all the time. That said, it's not a way in which people necessarily write all that much. And so large language models are just not being fed with this kind of language at all.

KERRY MCINERNEY:

That's so fascinating. And I want to ask you a little bit more about that just because for like undesirable, unnecessary context about myself, I grew up around a lot of different kinds of pidgin Englishes. So forms of English that are spoken in different parts of the world that like bear a lot of their words and some of those similarities from English is a root language, but they sound and they operate very differently in a lot of ways. So my dad is fluent in Bislama, which is a pidgin English spoken in Vanuatu.

I spent a couple of years growing up in Sierra Leone, which speaks a pidgin English called Krio. And my own family who are from Fiji, they, when they spoke English, I would say they spoke it with a different kind of intonation, different grammatical structures that I wouldn't say that they counted it as like pidgin English itself.

Um, but you know, I think there was definitely a difference in how they spoke English compared to say, how I would speak English, like at school. And I know there's like so many other kinds of pidgin English and the ones that I've named. And so I guess I want to ask you like two very difficult questions just to hear your thoughts on that.

I guess the first was thinking about like, you know, what would it mean for large language models to actually be able to like meaningfully grapple with not only writing say in English, but also understanding like the complexities that go beyond not necessarily even like the specific context of where a phrase is being used, but also the context of how the language itself has traveled.

But then also, I guess the big zoom out question that I really wanted to ask you was, when you're thinking about these issues to do with language and technology, like, how do they help you think about the broader relationship between history and politics and languages themselves? And they are often very deeply complex and colonial histories.

REBECCA WOODS:

Absolutely. Um, so I mentioned, and Eleanor mentioned in her sort of intro to the question as well that identity is so, so key in code-mixing and it leads to meaning being created on these really sort of micro- levels within specific communities, right.

Because even if you take two of the languages that are in constant contact today on a very wide scale like Spanish and English, you still will find very, very different patterns of code mixing, especially in a lexical sense between Mexican Spanish and U.S. English speakers on the one hand, and then British English speakers and say Castilian Spanish on the other.

There's a real point to be made here about how language and language names themselves, language designations are really, really slippery things. We don't really have a definition for what a language is as opposed to a dialect, as opposed to an idiolect even.

It's very fluid, the definition, the distinction between a language and a dialect. Without being embodied in some way and in use in a, in a community, there's, there's always going to be exclusion in the way that LLMs use language in general, but certainly in multilingual ways.

It's definitely the case that thinking about the intersection of language and technology has really made me stop and think a lot harder about politics, social issues and history and how they impact on language because I guess we're in a situation here where, demonstrably given some of the controversies in, in the tech industry, there is a ruling dominant group that is determining exactly what goes into these things and is gatekeeping them.

It's an analogy, I guess, for what has been happening for years and years and years in all of these different spheres, and the difference is, I guess, that we can see it happening and it's very, very hard to stop. It's very, very hard to navigate access into those spaces and, and to, to cause change, especially because to cause change in a meaningful way, what you're looking at is slowing down, spending more money, asking and involving more people and getting views that are not necessarily going to align with your own and with your aims. And there's clearly a real unwillingness to do that, to come back to the point about inclusion, and how inclusion doesn't just mean putting somebody in a space - that's so clear in the case of LLMs, um, because you can feed one of these models with all of the language data that you may have, and of course, so many languages, like I said, don't have the kind of data you can put in, but then what they put out is still not necessarily going to do that community any good.

ELEANOR DRAGE:

I think if I'd known this I would have done English language A- level and not just English literature because I didn't realize there's a politics to language until I studied French at uni. And I, I arrived in Paris and I was staying in the 13th, and a lot of the people in my building were French Moroccan, and they used a different language called Verlan, which is like an inverse language, I guess.

And you know, I was like, this is so much more fun than just pure Académie Française French. Académie Française, right, is this literal building on the banks of the Seine, and it's very beautiful and very grand, and they decide which are the words that will be official French and which aren't, and kind of wonderfully, I think it's getting less and less important every year, hopefully, because French is remarkably diverse and is so beautiful in its diversity, and I've always loved poets like and thinkers like Edouard Glissant, who play so much with the language to expose the shame and the sadness of colonization and the traveling of French across the waters.

And then when I was doing my PhD, I discovered this document from 2007, which is super recent, and it showed how the French government was still trying to push the French language on Vietnamese elites, French language and culture. And I suddenly thought, Oh God, why didn't I do English language A- level, you know, like it would have been so interesting.

And I'm such a pedant anyway, that I think it really would have suited me. I had a debate with our Centre's Director Stephen over first versus firstly, and like, I love him lots. He's a great, he's a great boss, but I still think he's wrong. So tell us, go on advertise English language and linguistics. I hope if we haven't convinced listeners already with this podcast, why should they go into it?

REBECCA WOODS:

Just to give like a very brief definition, we'd call it basically the scientific study of language, in all its parts and understanding why it is, to the best of our knowledge, it still is a uniquely human behavior.

Because of that, you're able to study human nature in the most direct way possible, right? Everything else is mediated through language, which means if you don't understand language deeply, and where it comes from, how the mind takes these units and puts them together to produce meaning, then to some extent - I feel I'm going to go out and say it -you'll never really understand people, and what they're doing, because again, so much of what they do is mediated through what they're saying, and being able to hear that in the finest detail, to be able to pick up on the intonation, to be able to pick up on the subtext, on some of the subtle choices of grammar, that's crucial to being able to really understand them.

And it's... I mean, that part of it is about being pedantic in terms of getting down to that level. But getting down to that level itself is really, really important because that's where the, that's where all those differences lie. Yeah, getting down into the really small details in language, just helps you understand how they all come together to actually, you know, express some really, really big, important, different ideas about who we are as people.

Um, and, you know, to be fair, like, I think French and the other languages that you've studied, Eleanor, as well, like, those things complement linguistics so much. So linguistics, it's also good if you, if you're really, really bad at making a choice, right? Because if you choose to come and study linguistics, you can do everything else you might possibly want to do within that lens.

So you can study, um, it's not just about studying English language, you know, English is just one language that could be an object of study. You could study French, you could study German, you can study Bislama, any, any of these languages through the lens of linguistics, but you can also look at the biology of language, in terms of the vocal tract or, you know, the manual side of sign language, you can study psychology, sociology, computer science, um, education, literally every other field of studying through the lens of linguistics.

So, you know, you can't make up your mind about what you want to do, come and study linguistics and we'll help you, we'll help you get there.

KERRY MCINERNEY:

I mean, I feel like this whole episode is just the most wonderful advertisement for linguistics as a field of study. And I say this as someone who's like the complete opposite of a grammar nerd as Eleanor, she very kindly proofreads all our work.

But before you go, so you've given like the really good deep answers why you should do linguistics. What's everyone's favorite word or word fact, to give a very surface level reason of why you should do linguistics. So, Becky, you're up first.

REBECCA WOODS:

Um, so something that, uh, children do when they are learning their first language is they've got to try and work out what we call predicates, so words essentially that relate things in the world - exactly how they work and what the structure of them is. And so it's really, really common for kids learning their first language to you know make a guess, get it wrong and produce these really lovely new utterances.

So just this morning my two and a half year old said “My toy happies me”, because he clearly is still working out what's a verb, what's an adjective, they're both types of predicates, but what's the structure in which they can occur? So he's gone and done something called transitivization where he's decided that this thing is capable of happying him, of making him happy.

It's something that verbs can do in English and it just made me really smile. So I share that with you.

KERRY MCINERNEY:

It happied you.

REBECCA WOODS:

Exactly. It happied me.

ELEANOR DRAGE:

Oh, that's so lovely. I hope he says that forever.

Um, okay. I didn't have time to think about this before. But a couple of things, one, so, I had a French boyfriend who wrote me the best ever French oral at uni for my like final exam, and it was on words that the French won't use because they think they came from English, but actually they come from the old French.

And, uh, I think the only one I remember is competition. So they'll say, the French will try not to say compétition, they tried to say concours instead. But compétition actually does come from the old French, well it's Latin. But I think the earliest usage was in French to mean rivalry.

Um, and I love that. This is why like being snooty about language is just so silly.

KERRY MCINERNEY:

That's super interesting. My very un fun one, but really links very well to like what we're talking about, like bilingualism and multilingualism in terms of like how you combine languages is in Fijian, goodbye is moce. It's M O C E, I think, because C is “th” /θ/, and then stupid is doce. And obviously I don't speak any real languages like fully fluently apart from English, but I know stupid and many different words in Fijian because that is my family summed up. So our family goodbye is always “Moce moce don't be doce!”.

And so I don't think ChatGPT would be able to produce that level of, you know, linguistic flair. So there we go. Most importantly, thank you so much, Becky, for coming on. This has been incredibly informative. You have really sold big linguistics to the world, but more importantly, I hope given lots of people things to think about when it comes to language and technology and sort of all the promises and risks and provocations that come with that.

REBECCA WOODS:

Well, thank you so much for having me. It's been a ton of fun.

ELEANOR DRAGE:

This episode was made possible thanks to our previous funder, Christina Gaw, and our current funder Mercator Stiftung, a private and independent foundation promoting science, education and international understanding. It was written and produced by Dr Eleanor Drage and Dr Kerry Mackereth, and edited by Laura Samulionyte.

[1] Thanks to Rory Turnbull (Newcastle University) for help with the transcription here.

Rebecca Woods on Large Language Models, Language and Meaning, and How Children Learn Languages

Recent Posts

Join our mailing list