Learning About wav and the Human Voice

Monday November 18, 2013

I'm playing with audio tracks for a project.

First, I read the wikipedia on the standard method for digitizing an audio signal, Pulse-code modulation. Remember that sound is just vibrations, which is just a sine wave, which we can therefore digitize in normal waves. It's pretty amazing that we can get the huge variety of sounds that we hear just as the overlaying of different frequencies in a sine curve. Really mind-blowing.

i go on wikipedia - human voice ranges from 300 Hz to 3400 Hz (according to the same wikipedia page). By the Nyquist-Shannon sampling theorem, this means that the sampling rate necessary to reconstruct the audio is roughly 8000 Hz. Alright, hypothesis: 8000 Hz will still sound good, but start getting to 4000 Hz or below and it will be pretty crappy. Fire up Audacity. Tracks -> Resample (to 8000 Hz). Static-y, but everything is perfectly clear. Now let's go down to 4000 Hz. It's notably worse, but still understandable. Now let's really start taxing it - crank it down to 2000 Hz. And now I'm noticing something interesting - primarily the lowest tones are coming through. And of course this makes perfect sense, because they have lower frequency, so they can survive worse subsampling. This is so satisfying. Subsampling further (1000 Hz, etc) is interesting to see the resultant effects, and the extent to which the audio is understandable.

Alright, now let's try to understand how to read wav files. This Stanford research group's page has a simply phenomenal explanation.

First off, if you need an audio file to play with, I found a bunch of samples by googling speech wav. Here's a good example wav file.

First take a look at the clip with the file command. The output is below:

meet_parents_bring_u_down.wav: RIFF (little-endian) data, WAVE audio, MPEG Layer 3, mono 22050 Hz

Now we know what we're dealing with. Alright, let's use hexdump meet_parents_bring_u_down.wav | head. Looking at the output and reading the website, you can start to see all the bytes line up. First the ChunkID, the ChunkSize, etc. I used Calculator.app to convert between dec and hex to help understand it. And then based on your knowledge of the file, you can start predicting the header values before you actually confirm they're there in the hex dump. And then you know you understand it.

In short, I now know a bit more about sound, the human voice, audio encoding, and the wav file format. I'll be using this to start reading wav files in order to work on my speech recognition engine.

Want to receive similar articles? (No spam, promise!)