Categories
AI/ML Behaviour bio chicken_research control deep dev ears evolution highly_speculative neuro UI

Hierarchical Temporal Memory

Here I’m continuing with the task of unsupervised detection of audio anomalies, hopefully for the purpose of detecting chicken stress vocalisations.

After much fussing around with the old Numenta NuPic codebase, I’m porting the older nupic.audio and nupic.critic code, over to the more recent htm.core.

These are the main parts:

  • Sparse Distributed Representation (SDR)
  • Encoders
  • Spatial Pooler (SP)
  • Temporal Memory (TM)

I’ve come across a very intricate implementation and documentation, about understanding the important parts in the HTM model, way deep, like how did I get here? I will try implement the ‘critic’ code, first. Or rather, I’ll try port it from nupic to htm. After further investigation, there’s a few options, and I’m going to try edit the hotgym example, and try shove wav files frequency band scalars through it instead of power consumption data. I’m simplifying the investigation. But I need to make some progress.

I’m using this docker to get in, mapping my code and wav file folder in:

docker run -d -p 8888:8888 --name jupyter -v /media/chrx/0FEC49A4317DA4DA/sounds/:/home/jovyan/work 3rdman/htm.core-jupyter:latest



So I've got some code working that writes to 'nupic format' (.csv) and code that reads the amplitudes from the csv file, and then runs it through htm.core. 

So it takes a while, and it's just for 1 band (of 10 bands). I see it also uses the first 1/4 of so of the time to know what it's dealing with.  Probably need to run it through twice to get predictive results in the first 1/4. 

Ok no, after a few weeks, I've come back to this point, and realise that the top graph is the important one.  Prediction is what's important.  The bottom graphs are the anomaly scores, used by the prediction.  
Frequency Band 0

The idea in nupic.critic, was to threshold changes in X bands. Let’s see the other graphs…

Frequency band 0: 0-480Hz ?
Frequency band 2: 960-1440Hz ?
Frequency band 3: 1440-1920Hz ?
Frequency band 4: 1920-2400Hz ?
Frequency band 5: 2400-2880Hz ?
Frequency band 6: 2880-3360Hz ?

Ok Frequency bands 7, 8, 9 were all zero amplitude. So that’s the highest the frequencies went. Just gotta check what those frequencies are, again…

Opening 307.wav
Sample width (bytes): 2
Frame rate (sampling frequency): 48000
Number of frames: 20771840
Signal length: 20771840
Seconds: 432
Dimensions of periodogram: 4801 x 2163

Ok with 10 buckets, 4801 would divide into 
Frequency band 0: 0-480Hz
Frequency band 1: 480-960Hz
Frequency band 2: 960-1440Hz
Frequency band 3: 1440-1920Hz
Frequency band 4: 1920-2400Hz
Frequency band 5: 2400-2880Hz
Frequency band 6: 2880-3360Hz

Ok what else. We could try segment the audio by band, so we can narrow in on the relevant frequency range, and then maybe just focus on that smaller range, again, in higher detail.

Learning features with some labeled data, is probably the correct way to do chicken stress vocalisation detections.

Unsupervised anomaly detection might be totally off, in terms of what an anomaly is. It is probably best, to zoom in on the relevant bands and to demonstrate a minimal example of what a stressed chicken sounds like, vs a chilled chicken, and compare the spectrograms to see if there’s a tell-tale visualisable feature.

A score from 1 to 5 for example, is going to be anomalous in arbitrary ways, without labelled data. Maybe the chickens are usually stressed, and the anomalies are when they are unstressed, for example.

A change in timing in music might be defined, in some way. like 4 out of 7 bands exhibiting anomalous amplitudes. But that probably won’t help for this. It’s probably just going to come down to a very narrow band of interest. Possibly pointing it out on a spectrogram that’s zoomed in on the feature, and then feeding the htm with an encoding of that narrow band of relevant data.


I’ll continue here, with some notes on filtering. After much fuss, the sox app (apt-get install sox) does it, sort of. Still working on python version.

                                                                              $ sox 307_0_50.wav filtered_50_0.wav sinc -n 32767 0-480
$ sox 307_0_50.wav filtered_50_1.wav sinc -n 32767 480-960
$ sox 307_0_50.wav filtered_50_2.wav sinc -n 32767 960-1440
$ sox 307_0_50.wav filtered_50_3.wav sinc -n 32767 1440-1920
$ sox 307_0_50.wav filtered_50_4.wav sinc -n 32767 1920-2400
$ sox 307_0_50.wav filtered_50_5.wav sinc -n 32767 2400-2880
$ sox 307_0_50.wav filtered_50_6.wav sinc -n 32767 2880-3360


So, sox does seem to be working.  The mel spectrogram is logarithmic, which is why it looks like this.

Visually, it looks like I'm interested in 2048 to 4096 Hz.  That's where I can see the chirps.

Hmm. So I think the spectrogram is confusing everything.

So where does 4800 come from? 48 kHz. 48,000 Hz (48 kHz) is the sample rate “used for DVDs“.

Ah. Right. The spectrogram values represent buckets of 5 samples each, and the full range is to 24000…?

Sample width (bytes): 2
0.     5.    10.    15.    20.    25.    30.    35.    40.    45.    50.    55.    60.    65.    70.    75.    80.    85.    90.    95.    100.   105.   110.   115.   120.   125.   130.   135.   140.   145.
...
 23950. 23955. 23960. 23965. 23970. 23975. 23980. 23985. 23990. 23995. 24000.]

ok. So 2 x 24000. Maybe 2 channels? Anyway, full range is to 48000Hz. In that case, are the bands actually…

Frequency band 0: 0-4800Hz
Frequency band 1: 4800-9600Hz
Frequency band 2: 9600-14400Hz
Frequency band 3: 14400-19200Hz
Frequency band 4: 19200-24000Hz
Frequency band 5: 24000-28800Hz
Frequency band 6: 28800-33600Hz

Ok so no, it’s half the above because of the sample width of 2.

Frequency band 0: 0-2400Hz
Frequency band 1: 2400-4800Hz
Frequency band 2: 4800-7200Hz
Frequency band 3: 7200-9600Hz
Frequency band 4: 9600-12000Hz
Frequency band 5: 12000-14400Hz
Frequency band 6: 14400-16800Hz

So why is the spectrogram maxing at 8192Hz? Must be spectrogram sampling related.

ol_hann_win
From Berkeley document

So the original signal is 0 to 24000Hz, and the spectrogram must be 8192Hz because… the spectrogram is made some way. I’ll try get back to this when I understand it.

sox 307_0_50.wav filtered_50_0.wav sinc -n 32767 0-2400
sox 307_0_50.wav filtered_50_1.wav sinc -n 32767 2400-4800
sox 307_0_50.wav filtered_50_2.wav sinc -n 32767 4800-7200
sox 307_0_50.wav filtered_50_3.wav sinc -n 32767 7200-9600
sox 307_0_50.wav filtered_50_4.wav sinc -n 32767 9600-12000
sox 307_0_50.wav filtered_50_5.wav sinc -n 32767 12000-14400
sox 307_0_50.wav filtered_50_6.wav sinc -n 32767 14400-16800

Ok I get it now.

Ok i don’t entirely understand the last two. But basically the mel spectrogram is logarithmic, so those high frequencies really don’t get much love on the mel spectrogram graph. Buggy maybe.

But I can estimate now the chirp frequencies…

sox 307_0_50.wav filtered_bird.wav sinc -n 32767 1800-5200

Beautiful. So, now to ‘extract the features’…

So, the nupic.critic code with 1 bucket managed to get something resembling the spectrogram. Ignore the blue.

But it looks like maybe, we can even just threshold and count peaks. That might be it.

sox 307.wav filtered_307.wav sinc -n 32767 1800-5200
sox 3072.wav filtered_3072.wav sinc -n 32767 1800-5200
sox 237.wav filtered_237.wav sinc -n 32767 1800-5200
sox 98.wav filtered_98.wav sinc -n 32767 1800-5200

Let’s do the big files…

Ok looks good enough.

So now I’m plotting the ‘chirp density’ (basically volume).

’98.wav’
‘237.wav’
‘307.wav’
‘3072.wav’

In this scheme, we just proxy chirp volume density as a variable representing stress.  We don’t know if it is a true proxy.
As you can see, some recordings have more variation than others.  

Some heuristic could be decided upon, for rating the stress from 1 to 5.  The heuristic depends on how the program would be used.  For example, if it were streaming audio, for an alert system, it might alert upon some duration of time spent above one standard deviation from the rolling mean. I’m not sure how the program would be used though.

If the goal were to differentiate stressed and not stressed vocalisations, that would require labelled audio data.   

(Also, basically didn’t end up using HTM, lol)