COCKOS
CONFEDERATED FORUMS
Cockos : REAPER : NINJAM : Forums
Forum Home : Register : FAQ : Members List : Search :
Old 03-30-2021, 02:37 PM   #1
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default WDL convolution and very small buffers

I've been successfully using WDL_ConvolutionEngine_Div for short impulses for a while now. Recently, I've needed to handle longer (4 seconds plus) IRs. Switching to Tale's WDL_ConvolutionEngine_Thread works on my reasonably fast Windows desktop, but I'm having trouble getting it to run without underruns on a less powerful Raspberry Pi 4.

At first I thought that the Raspberry Pi just didn't have enough processing power, but it isn't maxing out the CPU. And it works if I increase the processing buffer size to something large like 1024 (rather than my normal 32-sample buffer), but then latency in unacceptable.

Any ideas for other things I could try? Thanks!
ladron is offline   Reply With Quote
Old 04-01-2021, 07:01 AM   #2
Ric Vega
Human being with feelings
 
Join Date: May 2020
Posts: 19
Default

I've been using the Convolution Engine with up to 250,000 points per channel on a 2018 MacBook Pro 2,3 GHz Dual-Core Intel Core i5, and it works perfectly but it uses a big chunk of CPU. My buffer size is usually at 128, which I think has negligible latency. But a buffer size of 32 is way too small for this kind of algorithms.
Ric Vega is offline   Reply With Quote
Old 04-02-2021, 11:08 AM   #3
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

Quote:
Originally Posted by Ric Vega View Post
But a buffer size of 32 is way too small for this kind of algorithms.
A buffer size of 32 samples is aggressive, but not at all unachievable on modern computers and audio interfaces.

I am running guitar fx processing using a 32-sample buffer without dropouts on a humble Raspberry Pi 4 with a USB audio interface using convolution for guitar cabinet impulse responses.
ladron is offline   Reply With Quote
Old 04-02-2021, 11:39 AM   #4
Ric Vega
Human being with feelings
 
Join Date: May 2020
Posts: 19
Default

Quote:
Originally Posted by ladron View Post
A buffer size of 32 samples is aggressive, but not at all unachievable on modern computers and audio interfaces.

I am running guitar fx processing using a 32-sample buffer without dropouts on a humble Raspberry Pi 4 with a USB audio interface using convolution for guitar cabinet impulse responses.
Oh I see. But in that case you certainly don't need such a long IR. I'd say 10ms is more than enough to completely profile a guitar cabinet. I've been probing some digital cabs with a delta function and after 100 or 200 points (@ 44.1kHz) the IR is pretty much zero.
Ric Vega is offline   Reply With Quote
Old 04-02-2021, 11:46 AM   #5
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

Quote:
Originally Posted by Ric Vega View Post
Oh I see. But in that case you certainly don't need such a long IR.
Yes - the cab IRs work fine. I'm trying to add spring reverb, though, so I need longer ones...
ladron is offline   Reply With Quote
Old 04-02-2021, 12:08 PM   #6
Ric Vega
Human being with feelings
 
Join Date: May 2020
Posts: 19
Default

Quote:
Originally Posted by ladron View Post
Yes - the cab IRs work fine. I'm trying to add spring reverb, though, so I need longer ones...
Unfortunately I'm not very technical, so I don't know how different our setups are, but I carried out a test and was able to process a stereo 250,000 point IR in my setup at buffer size 32 without problems. Maybe you could check your ProcessBlock and FFTConvolution functions to see that there isn't anything slowing the algorithm down?
Ric Vega is offline   Reply With Quote
Old 04-02-2021, 12:31 PM   #7
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

Quote:
Originally Posted by Ric Vega View Post
I carried out a test and was able to process a stereo 250,000 point IR in my setup at buffer size 32 without problems.
It works fine for me on a fast computer, too. My issues are on a much slower Raspberry Pi 4.

For longer IRs, I'm getting inconsistent performance from buffer pass to buffer pass. Worst case is over 3x the CPU usage of the best case, and is enough to push me into dropout territory on the Raspberry Pi.
ladron is offline   Reply With Quote
Old 04-02-2021, 12:41 PM   #8
Ric Vega
Human being with feelings
 
Join Date: May 2020
Posts: 19
Default

Quote:
Originally Posted by ladron View Post
It works fine for me on a fast computer, too. My issues are on a much slower Raspberry Pi 4.

For longer IRs, I'm getting inconsistent performance from buffer pass to buffer pass. Worst case is over 3x the CPU usage of the best case, and is enough to push me into dropout territory on the Raspberry Pi.
Try 128 sample buffer size or even 256 in that case. Latency should be a problem there.
Ric Vega is offline   Reply With Quote
Old 04-03-2021, 01:30 AM   #9
Tale
Human being with feelings
 
Tale's Avatar
 
Join Date: Jul 2008
Location: The Netherlands
Posts: 3,252
Default

Assuming the Raspberry Pi 4 is fast enough to pull this off (which I don't know, but it might), then maybe tweaking thread priorities would help? Ideally the worker thread needs to be of lower priority than the main audio thread, but higher than any GUI threads.
Tale is offline   Reply With Quote
Old 04-05-2021, 02:38 PM   #10
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

I'm pretty sure the issue is that the Raspberry Pi is only fast enough to do a WDL_ConvolutionEngine_Div of a certain size before it can no longer handle the larger FFT chunks in time for a 32-sample buffer.

Forcing maxfft_size to 2048 in WDL_ConvolutionEngine_Thread::SetImpulse() gives me a workable WDL_ConvolutionEngine_Div, but that makes for a smaller 4096 FFT size for the threaded convolution, so it is less efficient. Still, it lets me get stable 1-second convolutions at a 64-sample buffer, which is an improvement. Much past 1 seconds, though, and the threaded convolution gets too expensive.

Does the threaded convolution's FFT size have to be twice the realtime WDL_ConvolutionEngine_Div size? I've tried making it a different (larger) size, but the code stopped working...
ladron is offline   Reply With Quote
Old 04-05-2021, 04:20 PM   #11
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

I think the best solution may be to have a WDL_ConvolutionEngine_Div engine for both real-time and background processing, with an FFT size threshold to cut over from one to the other.

That seems to be the way this implementation is working:

https://chromium.googlesource.com/ch...bConvolver.cpp
ladron is offline   Reply With Quote
Old 11-22-2021, 03:24 AM   #12
KeroLine
Banned
 
Join Date: Nov 2021
Posts: 10
Default

Have you found a solution to your problem? Can you give me a hint to fix this?
KeroLine is offline   Reply With Quote
Old 11-22-2021, 10:06 AM   #13
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

No, I haven't done any more work on this. I'm pretty sure that it can be improved, though.
ladron is offline   Reply With Quote
Old 11-22-2021, 11:50 AM   #14
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 13,697
Default

A solution:

Run two separate convolutions:

1) Short (the first thousand or so samples of the impulse response), runs in the realtime thread directly. It can be brute force + a short FFT if using _Div, or just a short FFT if you can accept a little latency.

2) The rest of the impulse response, processed in a worker thread at lower priority. This can be processed at a larger FFT size or using _Div depending on how short/long the short impulse response is.

The second impulse response, since it is the remainder of the impulse and does not include the beginning, can be delayed (allowing time for the data to be transferred to the worker thread, processed, and transferred back)...
Justin is offline   Reply With Quote
Old 11-22-2021, 04:37 PM   #15
ladron
Human being with feelings
 
Join Date: Oct 2020
Posts: 9
Default

Thanks for the response, Justin.

I think what you are suggesting is pretty much what Tale's threaded version is doing. It uses _Div for the short section, and a regular FFT convolution for the background thread.

The issue I was having is that it seems to require the long FFT size to be twice the size of the short one. On the Raspberry Pi, I couldn't get a short FFT small enough to complete in time while having the long FFT be sufficiently large to be performant enough...
ladron is offline   Reply With Quote
Old Yesterday, 04:53 PM   #16
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 13,697
Default

For the secondary processor you can use a _Div instance which uses a short fft to keep the latency manageable but a long fft for the big tails
Justin is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 10:22 AM.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.