Menu Close

Fast and Efficient Pitch Detection: Revisited

Reload

Last year, I wrote a series of articles about Bitstream Autocorrelation:

  1. Synth Tracking
  2. Bliss!
  3. Bitstream Autocorrelation

The code has matured significantly from its inception. It is long overdue, but now, as promised, it’s time to write about the technical details. Check out the Q Audio DSP Library where this is being actively developed.

Note: Reading up on the previous articles is a prerequisite for understanding this post.

Quick Overview

For those who tuned in late, Bitstream Autocorrelation, or BACF, is an accurate, extremely fast and efficient, time-domain pitch detection algorithm. BACF can be at least as accurate as any Autocorrelation based pitch detection schemes. Unlike the standard Autocorrelation function, or ACF, BACF works on zero crossings —the Bitstream, using bitwise operations.

BACF is extremely fast. The computation consumes an average of 50 nanoseconds per sample, on a 2016 MacBook Pro, tracking complex guitar phrases that exhibit typical playing techniques such as legato, hammer-on, pull-off, vibrato, staccato, and right-hand tapping.

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals. (From Wikipedia)

I was not the first… Initially, I noted that “As far as I can tell, this is my invention, but please tell me if there’s something I am missing and if I’m not the first to actually do it this way”. Recently, I was informed that there exists a prior invention in the 80s for a hardware based approach. The inventor is David Warrender and here’s the patent for his invention: Pitch analyzer. The patent is now expired.

Note: While I developed the code for a very specific source input: the guitar, the algorithm is generic and can be applied to other sources. Venturing outside the guitar realm is something I am very interested in, but unfortunately I have no time for… Yet…

Data Flow

Zero Crossing

A zero_crossing class saves zero-crossing information necessary to extract accurate timing information such as periods between pulses for performing analysis.

Zero Crossing Event

Each zero crossing pulse is saved in a ring buffer of info elements. Data include the maximum height of the waveform bounded by the pulse, the pulse width, as well as the leading edge and trailing edge frame positions (number of samples from the start) and y coordinates (the sample values before and after each zero crossing) of the zero crossings.

Sub-sample accurate period computation can be obtained using linear-interpolation of the x and y locations of each zero-crossing event. See image at the right. The two points P1 and P2 are the previous and current positions of the zero-crossing event from the raw data samples. The actual zero-crossing is the point where the dark blue line intersects the horizontal dotted line.

Bitset

After collecting twice the BACF window’s worth of zero-crossing information, where the BACF window size is the period of the lowest frequency we want to detect, we transfer the information to a temporary bitset class, which stores the zero crossing bits as 32-bit or 64-bit integers (using the native integer size of the platform). Only significant pulses —those whose pulse heights are above a certain percentage (currently 80%) of the maximum pulse height of all pulses collected, are transferred to the bitset. This will eliminate spurious lower-level pulses and significantly improve detection.

BACF

Now we compute the bitstream autocorrelation using the XOR operation. A single XOR operation works on N bits of an integer. We get a speedup factor of N (e.g. 64 for a 64 bit machine) compared to standard ACF, and moreover, integer bit operations are a lot faster than floating-point multiplications.

With XOR you get a one when there’s a mismatch:

0 ^ 0 = 0
0 ^ 1 = 1
1 ^ 0 = 1
1 ^ 1 = 0

After XOR, the number of bits (set to 1) is counted. The lower the count, the higher the periodicity. A count of zero gives perfect correlation: there is no mismatch. Modern CPUs provide operations for counting bits in an integer. These are exposed by modern C++ compilers as built-in functions (e.g. __builtin_popcountll). The bit count, together with the window size, are used to provide a normalized periodicity value that ranges from 0.0 to 1.0, where 1.0 signifies perfect correlation.

Sub-Collector

Unlike standard autocorrelation (image at the right) where the signal window is successively shifted/delayed and correlated against the unshifted reference signal to get the peaks (or valleys, depending on detection scheme), we only need to correlate using the zero-crossing edges —another significant speedup.

Potential periods are the distances from each of the collected zero-crossing leading edge. For each potential period, we compute the BACF, discard ones with low periodicity, and collect the most significant results. Longer periods typically (but not always) have better BACF results due to better resolution from more samples. But longer periods may also be sub-harmonics of shorter periods. This is taken into account, and detected by determining if the durations are close to integer multiples of a known shorter period.

The sub-collector returns the period (taking advantage of sub-sample accurate computation) of the leading-edge-pair with the best periodicity index, divided by the integer multiple of the period of the shortest leading-edge-pair, with a sufficiently good periodicity index. The sub-collector also returns the periodicity index.

Essentially, the BACF result is only used as a hint used to choose the best edges that demarcate the correct cycle. The accuracy obtained from computing the period from the zero-crossing edges is superior to that found by extrapolating the period from the BACF’s periodicity peak.

For illustration, the figure below exemplifies a specific BACF in action.

The waveform at the top is the raw audio. The middle is the zero-crossing waveform, with non-significant pulses removed by the zero-crossing to bitset transfer filter as described in the Bitset section. The third is the computed BACF periodicity.

There are two distinct candidate periods, p1 and p2, demarcated by edge-pairs, with a duration less than the BACF window (the third, p3 is not distinct; it has the same duration as p2). These periods correspond to the dotted red BACF sections in the graph. It is obvious in this example that the second period, p2, is the best candidate since it has the best periodicity index. The period of p2 is returned, along with its periodicity index (the peak in the BACF result).

Suppose, however that there are more edges that correspond to multiple peaks in the BACF. As described in the Sub-Collector section, we return the period of the best edge-pair (best periodicity), divided by the integer multiple of the period of the shortest edge-pair (with sufficiently good periodicity). The image below is an example with multiple BACF peaks.

Higher frequencies will tend to have more peaks within the BACF window of interest. An octave above the lowest frequency of interest will have two significant peaks. Two octaves above, like in the example above, will have four. We take advantage of the multiple significant peaks to improve precision due to the fact that longer periods will average out to better precision —there are more samples to count (the period is proportional to the number of samples between two edges). That, and sub-sample accuracy from linear interpolation of the zero-crossing edges, contribute to excellent overall precision from the lowest to the highest frequency. 

Bias

Consider this BACF snapshot:

That is the BACF of the guitar’s D string as it evolves around 14 seconds after note onset. I amplified the raw waveform to make it clear that the second harmonic dominates the fundamental. Here, it is clear that the BACF still detects the fundamental (the higher peak), but at some point, the second harmonic will become more powerful and its BACF peak will eventually catch up. At that point, without taking into account the history of detected frequencies, the second harmonic will be detected as the fundamental. There will be a sudden octave jump.

The bias section takes into account the history and deals with abrupt harmonic shifts (not limited to the first harmonic only). Sudden jumps are not normal. The bias section compares the incoming frequency with the current detected frequency. If a sudden jump is detected, and the incoming frequency is a harmonic of the previous frequency, within a specific deviation (plus minus a few cents), the bias section will prefer the older stable frequency, using the incoming frequency (which may have shifted a bit), divided by the integer multiple of the previous stable frequency.

Finally, the bias section also employs a 3-point median filter in case there’s true (non-harmonic) shift. The median filter eliminates spurious spikes that can happen every once in a while.

Results

The Q Audio DSP Library includes a comprehensive test suite. Here are the results from the latest tests involving pitch detection:

Test CaseAverage Error (cents)Min Error (cents)Max Error (cents)
Test_middle_C 0.001236 0.00009617 0.00231753
Test_middle_A 0.001998 0.0 0.00552
Test_low_E 0.00003447 0.00003447 0.00003447
Test_E_12th 0.0000359 0.00003447 0.0001258
Test_E_24th 0.00018 0.00003447 0.000606
Test_A 0.00012 0.00012 0.00012
Test_A_12th 0.000001413 0.000120 0.00012
Test_A_24th 0.000264 0.0 0.00108
Test_D 0.000189 0.0000207 0.000339
Test_D_12th 0.000733 0.0000207 0.00164
Test_D_24th 0.00151 0.0000207 0.009556
Test_G 0.0000601 0.0000601 0.0000601
Test_G_12th 0.000206 0.0000601 0.00021
Test_G_24th 0.000345 0.0000601 0.00668
Test_B 0.000643 0.00000166 0.0015
Test_B_12th 0.00436 0.00000166 0.0104
Test_B_24th 0.00332 0.00000166 0.0295
Test_high_E 0.00076 0.0000344 0.00308
Test_high_E_12th 0.00282 0.0000345 0.00805
Test_high_E_24th 0.01705 0.000286 0.0404
Test_phase_offsets0.0001210.0000340.000286
Test_missing_fundamental0.0011810.0000340.004614
Test_non_integer_harmonics1.0215050.9355031.127085

Phrase Tests

Here are the updated phrase tests from audio recordings. The left channel is the raw audio file. The right channel pitch-tracks the left channel, synthesizing a square wave, and an envelope-follower and envelope shaper, with control of attack, decay, sustain and release, which then controls gain and filter cutoff.

The tests exhibit typical playing techniques such as legato, hammer-on, pull-off, vibrato, staccato, and right-hand tapping. All the rest are recorded using the Nu multichannel pickup, except for the ShortStaccato which uses a short synthesized sine-burst.

Future Improvements

The pitch detector is very usable now, but it is still undergoing continuous refinement. One thing I want to improve is early detection on or immediately before onsets. Sometimes, there’s a very short “glitch” when the pitch detector does not have a good estimate yet. It is barely perceptible and can just be taken as part of the attack.

Let me quote my previous post on the matter. For reference, here’s the latest updated “Low E” sample:

Prediction and Zero-Latency Attacks

In theory, any Autocorrelation based frequency detector requires at least two cycles. Essentially, a cycle needs to be correlated with the next cycle in order to determine the period. However, one thing I noticed is that we can predict the frequency earlier by looking at the running stream of peaks or the zero-crossings on, or immediately preceding note onset. For the guitar, the prediction is quite accurate given one full cycle. For the low-E string, that amounts to a latency of 12ms (1/82.41Hz). That should be the minimum latency, right? Well, ehmmm…

Go back and listen to the “Low E” sample. Did you notice that there’s no latency at all? Yes, zero latency. How can that happen? Well, I faked it! Attack transients are special. They need not be periodic at all! Non-periodic attack transients are common. For example, Hammond organs are known to have a “key click” (regarded as an unfortunate defect by its designer, but embraced by users as part of the Hammond sound). That violin attack? That’s basically bow noise. The human voice may have unvoiced consonants. That guitar pick attack may sound like “clik”, “clak”, “clok”, “cluk”, depending on where you pick.

Any percussive or maybe even semi-periodic sound below 20ms is fair game and will not sound out of tune to the human ear, regardless of frequency. They will impart character, but will not contribute in establishing the note’s fundamental frequency. Listen to the “Low E” sample again. Did you notice a pitch shift at note onset? No, you can’t. The human ear can’t detect tones with such a short fragment.

There is still that very brief pitch shift if you look at the waveform:

One additional thing that I should mention is my observation that while transitions from a wrong higher frequency down to a correct lower frequency is barely perceptible, the opposite is not true. Transitions from a wrong lower frequency to a correct higher frequency are more pronounced. There are a few of those in the phrase tests above.

Here’s a specific example:

One possible solution is to bias the initial prediction to a higher frequency when the probability of obtaining a good prediction is low. Another possibility is to randomize the initial prediction, if it is indeterminate to begin with. Whatever the correct solution is, just-do-something is better than wait-and-do-nothing. A latent onset, especially with fast attacks, is displeasing to the ear. At any rate, this is one issue that needs further investigation.

Beyond the Guitar

These tests and the development surrounding it, are obviously guitar-centric. I do have tests extended for lower frequencies from the bass guitar, as well as 8-string guitar samples, but those are still guitar samples. I’d definitely love to get hold of samples from other instruments!

If you’ve read this far, then I suppose you must be truly interested. If you want this technology applied to other instruments, then the first thing that needs to be done is to add tests; and lots of it. You can help by sending me recordings of single notes covering all frequencies as well as phrase samples covering various styles typical of the instrument.