Tuesday, 26 August 2014

I just had to wrestle another few percent out of PIANA

It just wasn't fast enough on a 950MHz device, and I really want to target 950 rather than 1GHz. So more tweaking and tuning while watching the non-event that was the Guardian Kate Bush liveblog.

Not yet tested on a Pi, but I think I have got another 6% or so out of it. For sure its 'low load' performance will now be much better, but that's of less interest than the key question - 'will it be able to play Popcorn with 5-10% CPU left over for a GUI, clocked at 950MHz?'

Testing in about 20 minutes I think.

UPDATE : I broke my build system, had to wait overnight to run tests. Which are positive - here's PIANA running Popcorn at 950MHz, all synths active and bashing away - and I have 12% of the CPU free, as evidenced by the two 'thrash' threads. This was captured during one of those 'double time drumming' chunks of the track, where CPU load is highest.


The big question now is, is this 'under 90%' of the un-overclocked state, or the overclocked state, given how the Pi CPU governor works? So now I need to run the same test as 800MHz and see if I run out of puff. 

Tried it - 800MHz falls down in a heap when the drums come in, which is where not only do I get 4 more notes of polyphony, but they are all filtered and hence more expensive. So can't do 800MHz. 900MHz - sort of worked, sort of almost. No audio drop outs that I could hear, but that tiny drop in CPU relative to 950MHz is enough for the MIDI timing to go a tiny bit astray, as the CPU load squeezes out MIDI response in favour of synthesis. 

So there you have it. We look reliable at 950MHz - but there is no GUI - and very nearly reliable at 900. So as long as the GUI refresh is kept at a lower priority than MIDI input, it should pass the Popcorn test if nothing else. 

Also it's worth pointing out, based on the screen grab above, not only is 'top -H' eating up about 2% of my precious CPU, but I have 2 SSH sessions active - 3 actually, as the file system is SSH-mounted - so the network will be getting in my way. 

And one final thing - I just re-ran at 1GHz / full-on Turbo overclocking, and the performance difference is non-linear, which is to be expected - systems do collapse non-linearly as they hit saturation, so no surprise that a non-linear amount of CPU is freed up by a wee bit more clock rate - but even so, nice to see almost 30% of the CPU free all the time. Sort of all the time - occasional dips to 29% free. 

And this stupid denormal issue has not yet gone away completely ... I see about 4% CPU variance after I play then stop vs no sounds at all, so there is still work to be done there. But when I run in fixed-point I don't have enough performance - maybe the ARM code exiting the compiler isn't quite optimal for fixed-point, but whatever, I need to stay on floats now. 

One more tweak has resulted in more reliable behaviour with slightly fewer explicit exponent checks - I now clamp floats at strategic points in the pipeline to a rather hideous 2^-24 - so I now get float performance, fixed-point precision as all my old audio work was done in Apple's 8.24 format.  

Tuesday, 19 August 2014

Making the most of the Pi's limits

With maximum overclock ('Turbo') I can synthesize reliably 8 notes of polyphony, with 3 or 4 of these filter-enabled. I think an 8-note polysynth with the filter on will stutter, sadly. These 8 notes (or monosynths, however you want to think of them) all have to be in 'cheap' mode (44.1kHz rendering, no oversample) or the Pi will run out of grunt, but 'cheap' mode sounds pretty good. Not as good as 'expensive' mode, but the up front aliasing minimization makes it sound sweet enough. The two recent posts with audio featured all synths in 'cheap' mode.

This 8 note limit isn't bad to be honest, because the synth has 128 rapidly-accessible presets, and for the cost of 2 audio packets worth of rendering a preset can be swapped in. So mid-song, if you issue a 'Program Change' on MIDI channel 4, synth 4 will transplant its personality for a whole new one, within just 3ms, with no interruption of audio rendering. The 2 audio packets is a block on further incoming commands, not a block on rendering. So worst case the next Note On might trigger 3ms late.

Basically you can swap synths in and out between musical phrases. This is computationally less expensive than leaving an extra synth running and hardly using it, so within a fixed polyphony gives you access to a much wider audio palette. Of course and reverb tail from the old synth will continue to sound after you swap it out, so the transitions sound effective. If you are careful with what is sounding when, you can potentially get a big, complex arrangement out of these 8 notes.

Which is nice.

Being Boiled on a Raspberry Pi - excerpt

Those of you with long memories may recollect that I had 2 sign-off tests for my little baby synthcluster. One was that it should be able to perform a 'passable' glamsynth version of Mama Weer All Crazee Now. Whatever 'passable' means in that context - that was always sort of the comedy goal, driven by my exercise music loop. 

But this - THIS - was the serious one - can PIANA running on a 'humble' (read 'feeble'!) Raspberry Pi do a reasonable emulation of one of the most iconic and classic analog synth tracks from 1978? And most important of all - hence the Slade thing - can it pass the glam rock handclap test?!?

What do you think? Listen to the voice of Buddha and judge - just 5 synths here, just 2 of them pitched, 3 percussive. Not bad is it, for £20 of hardware? No samples, all computed, made with code indeed. Synth music totally beats a plastic 3D printed bracelet ... 

Monday, 18 August 2014

Beware denormalized floats on the Pi folks

Word of warning folks. My project had what I thought were all the right settings, i.e. -Ofast -ffast-math - but still I was being absolutely hammered by a performance problem associated with denormalized floats. When injecting the Popcorn MIDI into the Pi, I print out an occasional 'last packet rendered at effective %f samples/sec', and most of the time everything was comfortable, running between 60-75k, bottoming out mid-50s. But as soon as I hit stop, performance fell through the floor - 40.5k, 30k, suddenly a remarkably horrible 2.3k samples/sec as ALSA also got annoyed with me for failing to deliver packets in a timely manner. As soon as I introduced a denorm fixer into 2 key places - the reverb unit and the stereo delay - it all sorted itself. And obviously, reverting to my previous fixed-point delay and reverb implementations also worked great.

So, buyer beware - I don't know how to set compiler flags to stop this happening in other projects, but at least I have an emergency drop-in denorm fixer, which looks like this -


static inline void denormFloat ( float *s )
{
int exp = (((int *) s)[0]) & 0x7f800000;
if (exp < 0x1000000) (*s)=0;

}

and which, if my exponent head is working correctly, has a whole power of two guard band in it before going denormalized.

p.s. this isn't a Pi 'problem' apart from the question 'are compiler settings not being honoured' - my Mac does exactly the same, so the consistent IEEE 754 implementation between Intel and ARM is to be commended. But it is a massive Pi problem if it hits you, because you have absolutely no performance headroom. On the Mac I can afford to drop 20x and will never miss a packet deadline. Se be pure, be vigilant, behave.

Sunday, 17 August 2014

RESULT!!!

And it is done.

May I present to you, one Raspberry Pi Model B, one $5 USB MIDI interface, one £20(ish) Behringer USB audio interface, 7 Virtual Analog synthesizers, 9 notes of polyphony, a bunch (4 or 5?) stereo delays, a global reverb straight out of the upcoming Jordantron, and ladies and gentlemen - Popcorn!

Recorded straight out of the phono outputs of the Behringer into my Mac, no processing, exactly the bytes emitted by the Pi. Here we go ...




No glitches - amazing. Worst-case micropacket of 64 samples rendered at 47,300 samples per second. I instrumented every 64-sample chunk, the worst one took 93% of the Pi to calculate, leaving very little room for the other 8 threads ... thank heavens for a bit of elastic buffering, eh? But the point is, there is no headroom AT ALL here. But it worked. Hell, the whole thing works - all those synths, all those delays, reverb, all on a teeny tiny Pi.

Two years on, my work here is almost done.

UPDATE : for those who care, the difference between this version and the last one is that I went back to a fixed-point reverb and fixed-point delay. The performance crash was happening - bizarrely - after all notes stopped being voiced. I could see by inspection that all the oscillators were idle, so the only thing consuming any compute was the reverb unit, and running Instruments on the Mac yielded the same result - suddenly, terrible performance after all the notes stopped. So I am thankful for consistent floating-point implementations between Intel and ARM! Floating-point was entering some bogosity via denormals and causing a performance plummet, even though I believed I had all the 'force denormals to zero' flags set. So, a quick revert to fixed-point made it all work (single compiler flag, in the makefile hoorah!) and once it was clear that this was causing the problem I dug in and I have ended up having to manually flush denormals at the input side of reverb and the delay. And now I can build with floating-point on and it still works, without the sudden drop to 20% of performance. Mighty bogus though - I have -ffast-math on and -Ofast, which according to my cursory reading around should deliver minimum checking, minimum adherence to spec and maximum performance. No such luck.

p.s. the 'Tau' platform plays this sequence with 80% of both cores free. As opposed to 7% of the single core free. It is pretty immense. And I don't need to manually flush denormals. This whole thing with denormalization / compiler settings  / performance in the toilet remains a puzzle, and I hate having makefile variations like this, but there you go.

Friday, 15 August 2014

Popcorn

This tune was the first time I ever heard a Moog. Fitting that I should try to get my Pi to play it. Almost successful ... but I'm assuming it's a momentary CPU load spike that lets it down, then RtAudio and ALSA get their knickers into a right old twist and it all goes to hell in a handbasket.


It's all in the video description, but for those who can't be bothered to click through, Logic is sequencing and recording the Pi audio output. The Pi is being fed USB MIDI and is delivering USB audio. 6 synths are active, one of them is three note polyphonic for total 8 note polyphony, and 2 other synths are configured but doing nothing (as I miscounted when I set it up!). On screen are Logic, and a terminal with 2 shells open, one to launch piana and one to run top -H to keep an eye on CPU burden. And again, no samples, everything is computed live, everything has a delay but these are turned way down, reverb is also down low, the BPCVOs are doing their alias management thing, doing wild and crazy Phase Distortion, but only the two percussive instruments have a filter turned on. 

Enjoy!

UPDATE : I'm onto the track of something quite strange here. It seems to not be ALSA's problem, nor RtMidi (although USB MIDI still eats my cycles like they are going out of fashion), there is some state being entered inside the synthesis loop that is consuming way too many cycles, and the problem only manifests on the Raspberry Pi, because it has so few cycles to spare. Testing and developing on a woefully underpowered platform can be a bloody good thing ... 

A bit more in the loop diagnostics indicates that the thing runs fine apart from a couple of sections of this song, where rendering performance suddenly plummets, even though no oscillators are sounding. Something is going subtlely wrong internally, and it may take time to find, but I'll track it down. And once I do, there will be more popcorn, without hiccups. 

Thursday, 14 August 2014

USB MIDI is still ugly on the Pi

Here's a snapshot of PIANA running, playing the usual Fox on the Run, but this time configured with a 'Plump Lady' configuration, eliminating a synth ('Eerie Noise' has gone) and making Phil Collins monophonic rather than duophonis, to give it the thing more headroom. Basically, can I get this tune to play in a recognisable way on the Pi?

The top snapshot in yellow is the steady state, bottom one in green is a couple of seconds after I power off the attached USB MIDI keyboard. It's not sending anything - no timing, no active sensing - but now it's not connected I'm consuming an entire 9% less CPU. 9% of the CPU being burned just because a device is attached. Yikes ...


I'm starting to wonder if the USB issues I'm seeing with both audio and video are brown-outs caused by the Pi not being able to get enough power into the USB devices. I'll run it through a hub tomorrow morning before putting it away for another month. But the most recent PIANA / POLYANA mission is successfully accomplished, which is get the codebase back onto linux, get it running, get a handle on performance. So now it can go to sleep again quite happily for four more weeks. 

Wednesday, 13 August 2014

Performance snapshot

This is from the unPi / Tau - on the Pi this workload (11 synths, 2 active playing left hand and right hand parts of Fox on the Run) consumes 80% of the CPU, a chunk of that I'm sure down to the poor USB implementation. Here 80% of each core on the dual-core machine is free. So instead of 20% of an ARM11 free, I have 160% of a Cortex A9. Big, big difference.

Brilliant. This will make the most A-FLIPPING-MAZING multisynth / LeagueStation. And it has enough performance to throw a number of mellotron-like sample players on it at the same time, for Linn drums plus Sopranos plus quality sampled piano plus tons and tons of synths, everything with a private delay and with a global reverb chucked in. Should be good for at least 30 notes of full-on synth polyphony, probably 200 notes of sampled polyphony, or mix'n'match.

Totally, totally, totally flipping brilliant.



For the nerdcurious : there are 2 dummy workloads 'thrash1' and 'thrash2', to ensure a dual-core machine is kept busy and that percentage measurements are truly capturing percentage of peak. 

Three synth threads - all called 'synththread' - are there to loadbalance the synthesis work. They do all the oscillator, EG, LFO and mod matrix stuff. The load balancing algorithm is far from ideal - I think it distributes work across threads per-synth rather than per-oscillator - but it does at least scatter chunks of synthesizing across cores, as evidenced by the dump above. These threads also do the per-synth delay unit. The thread called 'piana' is mainly a high hit because it grabs all the outputs from the synth threads, which are separated out into a 'raw' and a 'reverb send', and reverbs them before combining them for output. I'm not entirely sure where the ALSA callbacks are accounted for, probably in 'nativeaudio' (which is the RtAudio / RtMidi launcher) as I can't imagine any other reason for that to show up high in the list. 'feedcallback' is a dumb little thing that isolates the reverb from the elastic decoupler in the ALSA callback, buffering 64 sample pairs at a time, so adds minimal latency. And 'synthMIDI' handles all MIDI packets, routing them to the right synth, and actually - not sure why I did it this way! - goes grubbing around inside the oscillators to hand over notes and controllers. There must be a good reason ... 

But it all hangs together nicely and sounds glorious. Audio to follow soon. 

UPDATE : it plays the Fat Lady with pretty much the same load - I'm seeing 80% free on each core consistently. The Pi just dies. 

Tuesday, 12 August 2014

So damn close - but not quite

Well, PIANA's up and running again, building natively on the Pi, and making noises. And sounding divine. RtAudio for sound, RtMidi for MIDI, both MIDI and audio over USB, some 10 lines of code that are wrappered in #ifdef iOS, so the portification was a giant success, and - dammit dammit dammit!!!! - the Pi doesn't quite have enough performance to do the Fat Lady.

The workload shown by top -H quickly gets up into the 90s, then audio starts to stutter and break up. And this with the full-on 1 GHz turbo mode. I suspect a big pile of effort optimizing the linux image may squeeze a bit of reliability back, but really, I've been in the world of diminishing returns for a long time, this will just have to do. No Fat Lady. Should be comfy for 8 notes of 'no filter' polyphony, maybe 8 of 'low quality filter', some less ambitious Human League-lite performances will still be achievable from a single Pi, but at this point, it will do what it does. There shall be no further tuning.

HOWEVER - my Tangerine Tau is TOTALLY ROCKING. It doesn't have a snd_seq module in the kernel so RtMidi can't initialize a MIDI input port for me, but my test sequence (yes, it's still Fox On The Run) runs with the Fat Lady configuration, and is storming along.

A reminder - the Fat lady configures 11 synths, with IIRC 14 notes of polyphony, so 14 oscillators are trying to sound. All but 2 of them silent but are still burning cycles as they do their thing, because they are never quite certain whether they are silent or not (envelope generators keep ticking until the amplitude gets down to 1 part in 64k, delays run forever), but even with no notes playing it's a big workload. Playing Fox on the Run the Pi burns about 70% of itself. The Tau (no, that isn't its real name!) is showing around 24% utilization. So the Tau is turning out to be about an iPad 2, as expected.

So - big disappointment on the Pi front, it was always going to be a stretch but I had hoped it would make it.

Sunday, 10 August 2014

... it seems the only way

Portified. Now running on command-line OSX, MIDI in via keys, audio out via Behringer USB audio thing. Still on GDC semaphores as OSX semaphores are in a state ...

And now working with named semaphores. Anonymous semaphores simply don't work on OSX - I tend to forget that. But on a positive note, RtAudio and RtMidi are completely and utterly brilliant. It all just worked. Sounds came out, keyboard inputs were recognised, all for a few dozen lines of code.

This should just compile and run on a Pi ... literally, zero source code changes. I should need to just grab all the files into one place and construct a giant CFLAGS line for all the #defines and there it will be - for the first time in over 12 months. Blimey. In fact the Pi is the least interesting platform right now, *really* interested to get it onto my Tangerine Tau!

Update update - still something amiss with my named semaphores, but on linux I shall be using anonymous semaphores anyway, so there's nothing to worry about. But annoying, I will have a whole half dozen lines of code that differ between OSX and linux.