Subregular Languages

Jacob Johnson

Suppose you are given the following words and told that they all come from an unknown language:

shashekushsasokosshakushushsakesas
sheshukushsesukosshekeshushsekoses
shishokeshsisekosshikishoshsikisis
shoshokeshsosakosshokasheshsokusas
shushakishsusukosshukoshoshsukasus
shashokushsasakusshakushishsakisos
sheshokishsesukisshekeshoshsekeses
shishukoshsisokosshikeshashsikasos
shoshokushsosikasshokeshishsokosis
shushakeshsusokisshukosheshsukesas

Now suppose you are asked to determine, for each of the below pairs, which word in the pair also came from this same language:

shoshukoshshosukosh
sukesussukeshus
sisakussishakus
shakashoshshakasosh
shokushoshshokusosh
susekussushekus
sikasussikashus
shashakoshshasakosh
soshakassosakas
shusekishshushekish
sakushessakuses
shekosashshekoshash
sashukessasukes
shesokashsheshokash
sokashassokasas
shukesishshukeshish

Take a minute to look through this list for yourself. Pronounce them out loud to get a better feel for them and consider writing down your answers to check back later. This may seem like a strange task, but it was real for participants in the studies Lai (2015) and Avcu & Hestvik (2020). And if you think about it, you have been doing the same thing your whole life as you have learned languages: for English, when you hear the words "gnocchi", "tzatziki" and "sriracha", you can probably tell that they originated from another language, based on the way they sound. Conversely, when you hear the words "blick", "nord", or "saf" you might get the impression that they could have been natural-sounding English words — maybe a name, or an item — even though they don't have any English meaning to you. However, you likely don't get that same impression about "bnick", "ngord", or "sapf".

Why is this the case? And what were researchers Lai and Avcu & Hestvik investigating with their studies?

In this article, we will investigate the answers to these questions, look at foundational material in phonotactics and computability theory, and gain insight along the way into the connection between language and computation. By the end, you will understand what these researchers predicted about your answers to the above questions, and why they predicted that.

Phonotactics

Phonotactics describes the patterns of sounds (called phonemes when referring to types of sounds, or segments when referring to individual instances of a phoneme) that are used by a language. We've already seen examples of this: a phonologist (a linguist who studes speech sounds) would describe our observation that "bnick" doesn't sound English by saying that "bnick" violates English phonotactics. Phonologists refer to this property of a string of sounds, that a speaker of the language in question would say sounds strange, as ungrammaticality or ill-formedness.

Phonotactics is an additional dimension of a language's phonology (its complete set of sound patterns) beyond its phonemic inventory (the set of types of sounds, or phonemes, it contains). For example, you might associate Spanish with rolled r's, French with nasalized vowels, or German and Arabic with throaty sounds — and it's true that different languages do have different phonemic inventories (English is uncommon in having its "th" sounds - both the kind in "that" and in "think"), but this is a separate phonetic property of a language. Instead, phonotactics has to do with the sounds that a language allows to cooccur next to each other or within a word — the "bn" consonant cluster from "bnick" is not allowed in English, for example. These phonotactic restrictions, just like phonemic inventories, vary from language to language: German allows clusters like "kn" and "pf" which English prohibits, while a lot of languages allow the "ng" sound from the end of "sing" to occur in places other than at the end of words, which is the only place where English permits it. Some languages, on the other hand, are more restrictive than English in some ways, like Spanish, which doesn't allow clusters of "s" followed by a consonant (except if it bridges a syllable boundary), or Turkish, which has restrictions on which vowels can cooccur in the same word.

Local Restrictions

Many of these phonotactic rules, you may have noticed, prohibit certain phonemes from occurring adjacent to each other in a word. Computational phonologists call these sort of restrictions, and the languages that contain them Strictly Local (McNaughton & Papert, 1971). This name derives from the fact that you can tell whether a word is well formed or not simply by looking at each substring (that is, each piece of the word, without skipping any segments) of the word, and making sure there isn't anything that should be restricted: "blick" sounds ok to English speakers because "bl" is ok (as evidenced by "blend", "blunder", "able"), "li" is ok ("little", "flint", "alliteration") and "ick" is okay ("sick", "tricky", "pick" - notice that we treat "ck" as just one sound).

Non-Local Restrictions

However, many languages have restrictions which occur at a distance, like Turkish, which I introduced as having restrictions on which vowels can cooccur in the same word. Specifically, vowels in Turkish words agree, or harmonize, in their qualities of backness: whether the speaker's tongue is forward in the mouth, as in 'e' (rhymes with 'cake'), 'i' ('see'), 'ü', and 'ö', or back in the mouth, as in 'a' ('father'), 'ı', 'u' ('goose'), and 'o' ('yawn'), as well as rounding: whether the speakers lips are rounded, as in 'ü', 'ö', 'u' and 'o' or unrounded, as in 'e', 'i', 'a', and ı'. This interaction occurs between a suffix's vowel and its preceding vowel, even though there are other consonants in between. For instance, the linked Wikipedia article shows that 'kapıdır' is grammatical ('o' and 'u' are both back and rounded), however 'kapıdür' (backness mismatch) and 'kapıdur' (rounding mismatch) are not. Heinz et al. (2011) proposes to account for this observation by introducing the idea of a phonological tier: "a level of representation where not all speech sounds are present". That is, Turkish vowels might be separated by consonants when they are pronounced, but we can also think of them as existing within a tier, or projection, from the word, which only contains the subsequence of vowels that participate in these interactions ('aıı' for 'kapıdır' in this case). Within this tier, the restrictions on front vowels cooccuring with back vowels is still the same Strictly Local - without the consonants in the way, the vowels are right next to each other. This very important idea introduced the concept of Tier-based Strictly Local (also shortened to TSL) languages - ones whose patterns could be expressed as a tier and a set of locally restricted substrings on that tier. Heinz et al. (2011) also introduced what is known as the Subregular Hypothesis, which starts to bring us back to the study question from the beginning of the article, but we are not quite ready to understand what that means, or why it is called that, without some introductory computability theory.

Computability Theory

Lamp

Suppose you have a lamp. When you first get it, it is turned off, but you can turn it on by pressing a button, or off again by pressing the same button. You might represent this lamp with the following diagram:

G Lamp Off Lamp Off ->Lamp Off Lamp On Lamp On Lamp On->Lamp Off press button Lamp Off->Lamp On press button

Let's notice a few things about this diagram. First, there are two states: "Lamp Off" and "Lamp On", represented here by circles. Secondly, there is an extra arrow pointing to the "Lamp Off" state, indicating that the lamp is turned off when you first begin using it. Finally, there are transitions between the states, which in this case both represent the action of pressing the button. Specifically, if you are in the "Lamp Off" state and you want to know what state you will be in after the "Press Button" action, you follow the arrow with the "Press Button" Label to arrive in the "Lamp On" state.

Computer scientists refer to the sort of systems that can be represented by this sort of diagram as Finite-State-Automata or FSAs. In this example, there are only two states: "Lamp On" and "Lamp Off". But in theory, you can imagine a FSA with arbitrarily (but finitely!) many states. The only requirements are that this FSA should have:

RGB Lamp I

Suppose you decide that a single-color lamp is too boring, so you decide to upgrade. You buy a lamp that allows you to select between three colors: red, green and blue. Correspondingly, this new lamp has more than just one button; it has four: one for each color (labeled 'r','g','b', respectively), plus one to turn the lamp off (labeled 'x'). However, unfortunately for you, this lamp shipped with an unfortunate defect: if you ever press the 'g' button while the lamp is already shining green, the whole lamp will short-circuit and stop working. To visualize this lamp as an FSA, your diagram might look something like the following:

G Off Off ->Off Green Green Green->Off x Blue Blue Green->Blue b Red Red Green->Red r Broken Broken Green->Broken g Off->Green g Off->Off x Off->Blue b Off->Red r Blue->Green g Blue->Off x Blue->Blue b Blue->Red r Red->Green g Red->Off x Red->Blue b Red->Red r Broken->Broken r Broken->Broken g Broken->Broken b Broken->Broken x

Looking at this new diagram, we see how, even though we only have three more states and three more symbols than before, the automaton is much more complicated than before! I used Jove, and edotor.net to create and render the dot code, and they keep up with it, but you can already tell that this is a difficult problem to visualize.

But importantly, look at the function performed by the final vs non-final states: any string of transitions that ends with the FSA in a final state corresponds to a string of button presses that ends with a functional lamp, while any string of transitions that ends with the FSA in the non-final "Broken" state, corresponds to a string of button presses that leaves you with a broken lamp. Try this for yourself: the strings 'rgbrgbrgb', 'rrrrrrrrrrr', and 'gbgrgxgbgrgxgbgrgx' all leave you in a final state and with a still-functioning lamp. Because the FSA accepts these strings, they are said to be in the FSA's formal language, (a set of strings which are defined by a mathematical rule, in this case, the rule of being accepted by an FSA). On the other hand, 'gg', 'rgbrgbrgbrrggbb', and 'grgbgg' are strings not in the language of this FSA.

RGB Lamp II

Suppose you take your new lamp apart and reassemble it, and you don't succeed at fixing the short in the green button, but you do manage to unlock the hidden colors that the manufacturer was trying to keep from you: Yellow, Cyan, Magenta, and White. Now, instead of a new color immediately deactivating an old color, multiple colors are allowed to shine at the same time, which means you can have Red+Green to make Yellow, Red+Blue to make Magenta, and Green+Blue to make Cyan. If you press a color's button a second time, it will turn that color's bulb off (except for green, which still causes the lamp to short). The 'x' button shuts off all three bulbs. Here is an FSA diagram to model your upgraded lamp:

G Off Off ->Off Broken Broken Broken->Broken r Broken->Broken g Broken->Broken b Broken->Broken x Green Green Green->Broken g Cyan Cyan Green->Cyan b Yellow Yellow Green->Yellow r Green->Off x Red Red Magenta Magenta Red->Magenta b Red->Yellow g Red->Off r Red->Off x Cyan->Broken g Cyan->Green b White White Cyan->White r Cyan->Off x Magenta->Red b Magenta->White g Magenta->Off x Blue Blue Magenta->Blue r Yellow->Broken g Yellow->Green r Yellow->White b Yellow->Off x White->Broken g White->Cyan r White->Yellow b White->Off x Off->Green g Off->Red r Off->Off x Off->Blue b Blue->Cyan g Blue->Magenta r Blue->Off b Blue->Off x

Notice that you now have to be even more careful about when you press the 'g' button: no matter how many times you press the 'r' and 'b' buttons in between two presses of the 'g' button, you will still break your lamp, unless if you press the 'x' button in between! This means that the strings 'rgbrgbrgb' and 'gbgrgxgbgrgxgbgrgx', which were in the language of the previous FSA, are not in the language of this new FSA. Strings such as 'rbrbrb', 'gxgxgx', and 'rgbrxgbrrbbxgxbbbbbb' still are accepted by this FSA (Try them out for yourself!).

Regular Languages and the Chomsky Hierarchy

There are infinitely many FSAs possible, and each FSA defines a corresponding language of every string they accept. The set of all sets of strings for which an FSA can be devised, is known as the Regular Language Family. These are generally considered to be the simplest models for computation: If you have a system that can exist in one of finitely many states, and transitions from one state to the next based on some input, it can be considered a finite state automaton. This allows the system to act as a classifier for strings of inputs as within or outside of their language.

However, not every possible pattern of symbols in strings can be modeled as transitions between states of an FSA. Consider the language of properly-nested brackets, which contains the strings'[ ]', '[ [ ] ]', '[ ] [ ]', and '[ [ [ ] ] [ ] ] [ [ [ ] ] ]', but not '[ [ [ [ [', '] ] ] [ [ [', or '[ ] ] ] ] ] ]'. Try to design an FSA where each transition is labeled with '[' or ']', and where only valid arrngements of open and close brackets brings it to a final state. It can't be done; a valid string in this language might have arbitrarily many open brackets, but each FSA only has finitely many states to keep track of how many open brackets have been used so far, so any FSA with \(n\)-states that you might claim encodes the language of properly nested brackets, I can just come up with strings consisting of \(2*n\) layers of properly-nested brackets, which your FSA will eventually get wrong. As this bracket language shows, this is only a subset of all possible formal languages. This bracket language is known to be a Context-Free Language, which is a more computationally complex model for computation. Where regular languages can be modeled with a finite state automaton, context-free languages require a push-down automaton, which is very similar to an FSA, but with an extra capacity to remember which states it has passed through, which allows it to, for instance, keep track of how many open brackets it has read so far. Both regular languages and context-free languages are part of the Chomsky hierarchy of formal language complexity, where the most complex automaton is the Turing Machine, which you might have heard of in connection to the operation of computers. A Turing Machine is able to compute anything that can possibly be computed.

Although the regular languages are the least complex of the formal language families, they still allow for an incredible range of detailed languages. Chess, for instance, contains:

So the set of all possible sequences of moves in chess is a regular language, despite how mind-bogglingly complex it is! This vast diversity of the Regular Language Family brings us back to the Subregular Hypothesis from earlier.

Subregular Hypothesis

If you were reading closely in both the Phonotactics and Computability Theory sections above, you might have noticed something similar about the natural language phonotactic restrictions and the formal languages of the FSAs devised to model the RGB lamps: the same kinds of restrictions existed in both cases! In English, the substring 'bn' of phonemes is restricted, just like the sequence 'gg' is restricted in the first RGB lamp. In Turkish, the subsequence 'ıü' is restricted (over the tier consisting of all vowels), while in your modified lamp, the subsequence 'gg' is restricted, (over the tier consisting of 'g' and 'x')). This shows us that regular language patterns can be used to model natural language phonotactics! In particular, we don't even need the full breadth of regular languages: Heinz et al. 2011 posits that a particular subset of regular languages, including TSL patterns, is enough to model these sort of restrictions. This subset, known as the Subregular Languages might be able to model all phonotactic patterns. This is the subregular hypothesis.

A lot of the work that has gone into Subregular Phonology has been related to drawing connections between these formal subsets of the regular languages and real phonotactic patterns from natural languages. De Santo & Graf (2019), for instance, define Input-Sensitive Tier-based Strictly-Local (ITSL) languages, where the decision whether a segment is projected to a tier depends not only on what phoneme it represents, but also on the phoneme represented by the adjacent segments. This same paper also pointed out that real-world natural languages have multiple phonotactic restrictions simultaneously (perhaps some language has a Turkish-like vowel harmony, as well as restricting 'bn', like English does), and so devised the Multiple Tier-based Strictly-Local (MTSL) and Multiple Input-sensitive Tier-based Strictly-Local (MITSL) language classes, which could be the subject of their own article.

Additionally, there has been work related to Grammatical Inference, a field that was introduced by Gold (1967) (see also De La Higuera, 2010). Grammatical inference is concerned with developing algorithms to derive a model for a formal language (such as an FSA) based on a set of strings in that language. Human language acquisition can be seen as a complex case of grammatical inference, and particular algorithms can help prove that certain patterns can be learned in theory. In particular, it is well known that humans learn primarily from positive data: you were never explicitly told that "bnick" is not a word, and have just figured that out from the distribution of sounds you have heard in things that are words. Aksënova (2020) De Santo & Aksënova (2021), and Lambert (2021) are all recent works in this field that relate to the development and evaluation of these Grammatical Inference Algorithms and have been very important to my own research (Johnson & De Santo, 2023).

So where did the experiment from the beginning of this article come from, and how does it connect to Subregular Languages?

If you look closely at the language sample that was given to you at first, you might notice that each word either contains "sh" OR "s" but never both (note that the sound English-speakers represent with "s" is not part of the sound represented by "sh" - so even though the letter "s" is technically reused, this is a quirk of English orthography and does not indicate anything deeper about the speech sounds themselves). That is, the subsequences "s", "sh" and "sh", "s" are forbidden on the tier consisting of {"s","sh"}. The subregular hypothesis predicts that people learning to recognize these words will pick up on this and, when these sounds are later found to cooccur in words in the test samples later on, notice that it does not match the inputs. Did you pick up on this pattern? Or maybe some other pattern? Most of the participants in these studies did!

Conclusion

The findings of subregular phonology demonstrate a pattern underlying human language behavior. It is hoped that further research will provide insight into human language faculty and word learning allow us to create better models of phonotactics and other language phenomena.

But even just as a thought experiment, it is also an interesting journey into ideas about how to quantify the complexity of patterns, and how to design the minimal computer-like system that can detect these patterns! It shows us how interesting it is to try to create mathematical models of systems, even ones so complicated as language.

References

Aksënova, A. (2020). Tool-assisted induction of subregular languages and mappings (Doctoral dissertation, State University of New York at Stony Brook).

Avcu, E., & Hestvik, A. (2020). Unlearnable phonotactics. Glossa: a journal of general linguistics, 5(1).

De la Higuera, C. (2010). Grammatical inference: learning automata and grammars. Cambridge University Press.

De Santo, A., & Aksënova, A. (2021, February). Learning Interactions of Local and Non-Local Phonotactic Constraints from Positive Input. In Proceedings of the Society for Computation in Linguistics 2021 (pp. 167-176).

De Santo, A., & Graf, T. (2019). Structure sensitive tier projection: Applications and formal properties. In Formal Grammar: 24th International Conference, FG 2019, Riga, Latvia, August 11, 2019, Proceedings 24 (pp. 35-50). Springer Berlin Heidelberg.

Gold, E. M. (1967). Language identification in the limit. Information and control, 10(5), 447-474.

Heinz, J., Rawal, C., & Tanner, H. G. (2011, June). Tier-based strictly local constraints for phonology. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies (pp. 58-64).

Johnson, Jacob K. and De Santo, Aniello (2023) "Evaluating a Phonotactic Learner for MITSL-(2,2) Languages," Proceedings of the Society for Computation in Linguistics: Vol. 6, Article 37. DOI: https://doi.org/10.7275/crgk-6g04

Lai, R. (2015). Learnable vs. unlearnable harmony patterns. Linguistic Inquiry, 46(3), 425-451.

Lambert, D. (2021, August). Grammar interpretations and learning TSL online. In International Conference on Grammatical Inference (pp. 81-91). PMLR.

McNaughton, R., & Papert, S. A. (1971). Counter-Free Automata (MIT research monograph no. 65). The MIT Press.

Shannon, C. E. (1950). XXII. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314), 256-275.