UPDATE 29 Jan - ignore all this, problem solved. But I'll leave it here for posterity.
So, I have a whole pile of threads at different priorities, but the ones I care about for the purpose of this little mystery are the producer thread that does synthesis (next highest priority), and the consumer thread that does output to ALSA, which runs at the highest priority of the threads I spawn. There's a double buffer between them, 64 frames (i.e. samples) long, some 1.45 ms at 44.1kHz, and the ALSA thread works to keep latency under control by managing the maximum depth to which the ALSA FIFO is filled.
So the ALSA thread waits for the FIFO to reach certain amount of emptiness - set currently to 256 frames, longer than I'd like - then grabs the read half of the double buffer, spews it down the throat of the ALSA device, then loops, forever. Really simple. The 'wait for emptiness' is done by checking snd_pcm_delay, if it's higher than the 256 frame threshold it sleeps for a quarter buffer period. That's 1/4 of 1.45ms which is some 360us.
And all sorts of weirdness is going on when I try to sleep. I'm at priority 99 - and I actually *am* at priority 99 after yesterday's stupidity elimination - and yet when I sleep, sometimes I don't wake up for over 3ms, which is the reason my FIFO threshold is set so high. I'd like to set it to 128 sample times, but it goes titsup when I do that.
I was puzzled, because other stats out of the program were indicating that my consumer thread was never waiting on an unready buffer - literally, every single time the ALSA thread needs data, the synth thread has delivered it, no exceptions. And that's the biggest change since thread priorities got fixed, the synth never goes walkabout for multiple ms at a time. But my nanosleep in my highest priority thread wakes up very, very late, and I'm puzzled by that. Here's a dump of what I'm seeing - I'm capturing a histogram of 'lateness', so I snapshot the time, do a nanosleep, resnaphot the time, and index the table by how many microseconds late I have woken up, (now-then)-sleeprequest. Each bin is 256us wide, so 'late[0]' is anywhere from zero to 256 us of lateness. And as you can see below, there's a big spread up to some 2ms, and even worse, some outliers are up at worse than 8ms.
So in summary, I ask the CPU, whilst running at the highest priority I'm able to muster, to go sleep for 360us, and I wake up 8ms later.
The nasty outliers only happen during program startup, where the SD card is being read hard, and if I turn off the GPU thread so I'm never doing any OpenGL there's a slight shift to faster behaviour, but still this spread of up to 3ms happens.
I do wonder if this is finally a case for needing an RT kernel?
late[00 = 0000]=0185248
late[01 = 0256]=0001388
late[02 = 0512]=0000077
late[03 = 0768]=0003682
late[04 = 1024]=0058063
late[05 = 1280]=0018317
late[06 = 1536]=0003403
late[07 = 1792]=0000089
late[08 = 2048]=0000020
late[09 = 2304]=0000005
late[10 = 2560]=0000011
late[11 = 2816]=0000019
late[12 = 3072]=0000002
late[13 = 3328]=0000001
...
late[23 = 5888]=0000000
late[24 = 6144]=0000000
late[25 = 6400]=0000000
late[26 = 6656]=0000000
late[27 = 6912]=0000001
late[28 = 7168]=0000001
late[29 = 7424]=0000000
late[30 = 7680]=0000000
late[31 = 7936]=0000008
late[01 = 0256]=0001388
late[02 = 0512]=0000077
late[03 = 0768]=0003682
late[04 = 1024]=0058063
late[05 = 1280]=0018317
late[06 = 1536]=0003403
late[07 = 1792]=0000089
late[08 = 2048]=0000020
late[09 = 2304]=0000005
late[10 = 2560]=0000011
late[11 = 2816]=0000019
late[12 = 3072]=0000002
late[13 = 3328]=0000001
...
late[23 = 5888]=0000000
late[24 = 6144]=0000000
late[25 = 6400]=0000000
late[26 = 6656]=0000000
late[27 = 6912]=0000001
late[28 = 7168]=0000001
late[29 = 7424]=0000000
late[30 = 7680]=0000000
late[31 = 7936]=0000008
Reading the post I'm afraid I'm more puzzled by call to nanosleep() at all. Are you really calling a sleep function directly or is it an indirect call within the ALSA library.
ReplyDeleteAssuming there's nothing in the RPi drivers to make ALSA in some way "bad" then I'd normally run audio code like this with one thread and snd_pcm_wait() to get the thread to sleep until the hardware was ready.
If you want to keep the 64 sample DSP grain size then you can run the single ALSA thread as eight 64 sample buffers (rather than double buffering 256 sample buffers).
Hi
ReplyDeleteApologies for barging in here, I can't find a contact email address for you. I've been following your blog for a while, I think what you're doing with Piana is really cool.
I'm helping to organise the Edinburgh Mini Maker Faire in April, would you be interested in exhibiting/demoing Piana? Here's more stuff on the Faire: http://makerfaireedinburgh.com/call-for-makers/
Cheers,
Al
Hi Al
DeleteGood to hear from you. Love Edinburgh (as does my wife) but it's a heck of a distance ...
ping me on pisynth at omenie dot com
cheers
Phil
Yes, I am explicitly calling sleep. The double buffer (between producer and consumer) is 64 samples in size, the 256 sample buffer is the USB audio device FIFO. It's actually a triple buffer of 64 samples in that the synth computes into a private buffer then copies into the double buffer, but that's a detail that is trivially optimizable away if necessary.
ReplyDeleteBut I am certainly open to any suggestions to improve this. My ALSA experience is extremely limited. I'm more used to Core Audio, and on the iPad when I sleep for 320us it pretty much sleeps for 320us, any overruns are in microseconds not milliseconds ... so in summary, I am definitely open to suggestions on how to modify this, and if snd_pcm_wait is a better way to do the backend I'll take a look at refactoring it. But even if there are better ways to achieve this, the approach I've taken is reasonable, and I think a request issued at the highest level of priority to sleep for 320 us should not result in a wake up 8ms later.
By the way, when you say 'normally', are you talking a reasonable performance PC or a low-performance embedded device like the Pi? The key issue is responsiveness under thread rescheduling - irrespective of whether it's a single thread with multiple buffers or multiple threads, at some point the thread feeding the FIFO will get swapped out, and when the FIFO gets low it needs to get swapped back in fast enough to not underrun.
Damnit, blogspot seems to hate me (I always seem to lost the first post and have to retype it).
ReplyDeleteMy first post was elegantly structured, this one is short(ish) and terse...
1. Low performance embedded (set top box and BD-ROM). We know about these
sort of latencies (though they've got a lot better in the last few years)
but, unless we're doing karaoke we can just use long buffers to survive
these latency spikes without any artefact.
2. You are totally right, using snd_pcm_wait() will not improve scheduler
response. It merely simplifies the threading model a bit.
3. In terms of absorbing one off latency spikes (8-1)*64 is better than
(2-1)*256.
4. USB might conspire to prevent interrupt every 64 samples (my memory is
muddy here, USB has limited interrupt rates, but 64 samples will not
exceed what I think the max interrupt rate is).
5. rt kernel is a good idea...
Hi Daniel
DeleteYes, blogger can be a bugger! Especially when it wrecks your piece of TLS wordmanship and forces you into Daily Mail mode in time-is-short / cannot-be-arsed frustration ...
The original structure (and it's a keystroke away from being this again) was 8x 64 buffers (or at least nx 64, n settable at boot time), but without a fat buffer at the back underruns always happened, so I just tied it into 2 buffers between the two threads and a much deeper FIFO threshold. I'm beginning to think it really does need an RT kernel, but apparently the USB subsystem on the Pi SOC is a bit of a nightmare, and ditto the SD card, so it maybe an RT kernel would only partially solve the problem, and that the only way to defend against this is to have a deeper than I would like threshold on the ALSA FIFO.
And the thread structure is pretty complex anyway - well, compared to a car infotainment system it's trivial, but there are 5 to 8 threads depending on the configuration - so it actually made coding the thing much easier, and offered more flexibility, to separate out synthesis and ALSA playback with a tiny bit of elastic between them and a bigger piece of elastic within the ALSA playback thread.
And an explicit 2 buffers between threads makes keeping control of multiple concurrent synths easier (e.g a drum machine plus 2 mono PD synths on one Pi), in terms of consistency of fullness of the buffers and hence maximum availability of bufer space for load spikes. I didn't explain that well but I know what I mean ...
DeleteAll of which obviously stems from the original design decision anyway, but it's now much easier to have an explicit double buffer between each synth and the ALSA thread.
By adding extra buffering between ALSA and you engine all you do is
ReplyDeletea) making your life more complicated
b) adding additional latency (and possibly fooling yourself about the real synth latency).
If understand correctly your explanations, you have a 64 samples latency directly on the ALSA side and add 256 samples in your buffering.
If I'd be you, I'd dump all the intermediate buffering and work straight from the alsa callback. That is where you SHOULD compute the audio. If you underrun, simply augment the native alsa buffer size until you get none and then you will know exactly your latency and won't have to do all that juggling. I use it on the rpi through RtAudio and it run just fine.
On the thread count level there's no real reason to have much more than 2: one for the audio, the other for user interaction.
hope this helps
Marc.
Marc
DeleteI want to keep the the architecture the same on this and all other supported platforms for ease of maintenance, so I need the computation to be in threads rather than a single callback to take advantage of multicore machines. OK, 'need' is a bit strong, but for me personally this is way easier managed in a bnch of threads.
FYI if it helps in comparison with what you are doing, I wait for the ALSA FIFO to drain until there are no more than 64 frames remaining, at which point I grab from the synth 2 packets of 64 samples - these come out of the double buffer, so worst case latency is
Delete64 samples being computed in the synth
2 64 sample buffers
64 sample FIFO draining
so 320 samples. IIRC you are running at 256 sample latency in your synth, so I'm a little behind. On a quad-core CPU running 4 synth engines I grab 4 sets of double buffers on the FIFO hitting its low water mark then mix the buffers, so am able to keep 4 cores busy with synthesis.
Interestingly if I wait for the FIFO to drain then try to grab just one buffer and loop, I always underrun. Entertaining, eh?!
Unless you run a heavy synth, if it runs directly from the audio loop on the RPi, it should pretty much do it on any other you might want to use sothere's no need for extra-core processing.
DeleteFWIW, my system allows to set the buffer size at startup so it really depends on the type of machine, the synth that is instanciated and the polyphony you want to achieve.
I just like the idea of going as low as you can and the extra simplicity in code handling ;)
With you on all the above, but having a single application architecture where on different platforms I can just instantiate more synths in more threads and exploit more cores was too much of a temptation. Increased complexity was minimal. Well, minimal in theory ... the number of f*ckups I've managed to squeeze in says more about my distractedness during the project than about any code complexity!!
DeleteMy previously razor-sharp mental arithmetic has descended into the realms somewhere beneath CSE grade 1, apparently - 64 + 64x2 + 64 is 256, not 320.
DeleteOld age - there's no defence against it, it's as unstoppable as Tyson in his prime - it just keeps on coming, and soon it will just get you.
Question...: WHEN?!?
ReplyDeleteWhen what?
Delete