Published: April 25, 2022

Author: Olivia Walt
Nominator: Rebecca Scarborough
Course: LING 3100 Language Sound Structures, Fall 2021
LURA 2022

Anyone who takes a language class at some point in their life is practically guaranteed to use music to help facilitate their learning, so common practice has it become in language classrooms across the globe. Language learners everywhere can be found learning songs to help them memorize specific, often unique linguistic features – from alphabets of sound structures to eccentric colloquial phrases. This is possible because like understanding language, listening to music involves carefully interpreting complex patterns of sound. 

Considering this evident overlap of music and spoken language, it may seem reasonable to assume that having an affinity for one automatically grants an advantage to the other. However, it turns out that this is not always the case. 

In the attempt to discover how much (if at all) having an affinity for music affects a person’s ability to recognize/process English vowels, I conducted an independent study wherein I asked a variety of individuals – both of musical and non-musical backgrounds – to listen to, identify, and then replicate three synthesized vowel sounds made to emulate real spoken vowels. Vowels are a lot like musical chords in that each consists of a complex combination of multiple different frequencies, or pitches, produced at the same time.  Therefore, I started my research expecting to see a generally positive relationship between the extent of a person’s musicality and the accuracy with which they were able to perceive and reproduce the study sounds. 

Created using Praat software, the sounds I provided for my study participants to hear were combinations of “pure” tones (i.e., sounds with a single frequency) layered on top of one another, resulting in highly digitalized-sounding composite tones when played all together. For example, to create a synthesized vowel /i/ (as in “beet”), I combined 3 tones with measured frequencies of 280 Hz, 2207 Hz, and 2254 Hz. You can hear the resulting sound, as well as see a visual representation of it, in the recording and graph given below: 

Link to recording of Synthesized Vowel 1 (/i/):!AiwsjjI_oD8lzzQ8-XzfI_NEr_7y?e=qLMtKH 


Spectrogram of Synthesized Vowel 1 (/i/)

The lines of red dots shown on the above spectrogram represent the individual “pure” tone frequencies (also called “formants” in natural speech) that when played simultaneously, produce a synthesized version of the correlating vowel (EDUHK, 2021, Piché, 1997, and “Using Formants”, 2019). In addition to /i/, I also created synthesized versions of the vowels /o/ (as in “orchard”) and /æ/ (as in “hat”).  All were pretty simple to create and were intended to be relatively easy for an individual to distinguish from each other, regardless of whether they proved easy to identify. 

While “easy” might not be the best descriptor to use when characterizing the participants’ actual responses to the study activity, the overall feedback was nonetheless significant. Interestingly, out of all the individuals who participated, those who identified as “non-musicians” appeared the most likely to create each study sound as they heard it (i.e., their perceptions of the sounds matched their own reproductions of them).  By a similar token, the participants whose perceptions didn’t match their productions often ended up adding more acoustic features to their sound productions than they described as hearing – e.g., giving the sound a more nasalized quality, or combining two sounds together. Many of these individuals seemed to do this unconsciously, perhaps further endorsing their affinity to decode sounds through a perspective centered around musicality as opposed to linguistic meaning. 

In terms of correctness of identification, out of the three synthesized vowels, /i/ was accurately identified the most often. It also had the fewest occasions of perceptions that did not align with productions. This may indicate that there is an inverse relationship between the number of accurate identifications of a synthesized vowel and the frequency at which it is perceived one way but produced another.  

Upon considering all the above conclusions, it appears more likely that a person with a higher degree of musical background may experience more difficulty in attempting to identify a vowel from a set of raw, pure tone frequencies alone. This goes against my initial prediction in the sense that, even as musicians become quite skilled at interpreting tones as individual “musical” notes, this might get in the way of their ability to interpret the tones together as a spoken vowel sound. 

How, then, do we best treat language learners who may find it especially difficult to learn a language’s sound system due to their stronger inclination towards processing individual speech sounds in terms of perceived “musical” attributes rather than phonemic meaning? Using my initial research as a starting point, I hope to dive further into this question, as well as countless others that have arisen as a result, all with the intent of continuing to bridge the gap between language and music within the scope of education and beyond.

Header image credit:

Boersma, P. & Weenink, D. (2016). Praat (Version 6.1.16). University of Amsterdam.

EdUHK. (2021). 2.2 Formants of vowels. Phonetics and phonology.

Piché, J. (Ed.). 1997. Table III: Formant values. The csound manual (version 3.48): A manual for the audio processing system and supporting programs with tutorials. Analog Devices Incorporated.

Using formants to synthesize vowel sounds. (2019, July 17). SoundBridge.