Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Extend the pitch range of RVC models #63

Open
TheTrustedComputer opened this issue Jun 24, 2024 · 9 comments
Open

[Suggestion] Extend the pitch range of RVC models #63

TheTrustedComputer opened this issue Jun 24, 2024 · 9 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed

Comments

@TheTrustedComputer
Copy link

Is your feature request related to a problem? Please describe.
Currently, RVC models with pitch guidance seem to have an f0 range from 50 Hz to 1.1 kHz. When I feed an audio sample outside of this range, it produces distorted breath sounds or subharmonics of the fundamental.

Describe the solution you'd like
I would like to see this extended since humans are more than capable of going higher and lower than that (vocal fry, women's screams, and whistles are some examples). RMVPE's range appears to be set from 30 Hz to 8 kHz.

Describe alternatives you've considered
I have tried changing f0_min and f0_max to something else to test in real-time inference, but it has no effect. RMVPE does react to this a little, but it's not clear the changes are obvious.

Additional context
I recommend assigning the lower bound to 20 Hz (the minimum range of human hearing) and the upper bound to roughly 8 kHz like RMVPE to cover the entire vocal range and beyond. Better yet, consider allowing the user to fully customize this in training and inference.

@fumiama fumiama added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Jun 24, 2024
@alexlnkp
Copy link
Contributor

alexlnkp commented Jun 26, 2024

See, the pitch range that is used for f0 analysis is the root pitch of the voice. In other words: no overtones. That essentially means that the root note of the voice can be in the range from 50hz to 1.1khz. Now, it's important to state that no humans are capable of producing noises that are below their lowest speaking range, you can test this yourself:

  1. Open a guitar/piano tuner app on your phone
  2. Try to produce as low of a sound as you can
  3. Try to produce as high of a sound as you can

You will notice, that no matter how hard you try (don't try too hard, you might hurt your vocal chords), you are unable to go above 1100Hz, as well as you are unable to go below 10Hz. There's science in that, but generally speaking, there's the lowest pitch a human can produce, and there's the highest pitch a human can produce.
My range, for example, is from 110Hz to 670Hz, for that the range 10Hz to 1100Hz is fitting enough. Generally speaking, the human vocal range can only go from ~30Hz to ~900Hz, so I'd say that the range that is used in RVC is quite sparing.

However, I have implemented a similar "change the range" setting in my own fork a while ago, so I do think that it is useful, however I also think that a lot of people will misunderstand it, like how you did. I don't mean to insult or anything, it's just a common misconception regarding the f0 that a lot of people seem to have :)

@alexlnkp
Copy link
Contributor

Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc

@TheTrustedComputer
Copy link
Author

TheTrustedComputer commented Jun 26, 2024

RVC's pitch range, as it stands, is probably designed to cover the broad spectrum of normal speech and singing without considering techniques to increase their range like vocal fry and whistling. Although some might view extending the range as unnecessary, it may be essential for things like whistling, to be honest.

As for the f0 misconception, these are just my (and perhaps many others) interpretations of their behavior as a regular user. I'm curious to know how the myth came about. Are you basically saying it filters out frequencies that may introduce noise out of the range, especially in non-speech vocalizations? Please explain as if I know nothing, but I do know how harmonics or overtones work.

I can understand these types of vocalizations such as yells, screams, and grows are quite noisy and thus cause pitch detection algorithms to struggle when they're made for singing; whistles, on the other hand, are less noisy and should be theoretically easier to capture. This may be wishful thinking, but I still want RVC to convert them better than it does now for added realism. For example: depending on the model, it inferred coughs and sneezes (also noisy) quite decently in my experiments, but not perfectly. Meanwhile, BigVGAN's potential promises better clarity.

If a pitch range adjustment feature is eventually included in RVC, which I more or less doubt, stick it to the supplied defaults of a minimum of 50 Hz and a maximum of 1.1 kHz.

@alexlnkp
Copy link
Contributor

alexlnkp commented Jun 26, 2024

@TheTrustedComputer
Let's take a look at a Fourier Transform of a wave using Python

import numpy as np
from matplotlib import pyplot as plt
from scipy.fft import fft, fftfreq
import scipy.io.wavfile as wav
import json

NOTES_MAP = json.load(open("notes_map.json", "r"))

WAVE_LOCATION = "rd.wav"
DURATION = 5  # Seconds
wav_file = open(WAVE_LOCATION, "rb")
SAMPLE_RATE, data = wav.read(wav_file)

yf = fft(data[: SAMPLE_RATE * DURATION])
xf = fftfreq(SAMPLE_RATE * DURATION, 1 / SAMPLE_RATE)
plt.plot(xf, np.abs(yf))
plt.xlim([0, 3e3])

# Set a threshold for the magnitude
threshold = 0.05  # Try reducing the threshold value

# Map frequencies to magnitude
y = np.abs(yf)

d = {}
for i in range(0, len(y)):
    if xf[i] > 0:
        d[f"{xf[i]}"] = y[i]

# Sort the dict so highest frequencies are at the top
d = sorted(d, key=d.get, reverse=True)

# Get the top 10 notes
bucket = []
for i in d:
    if len(bucket) == 10:
        break
    i = round(float(i))
    if i not in bucket:
        bucket.append(i)

# Map to notes
notes = []
for i in bucket:
    for note in NOTES_MAP:
        note_freq = NOTES_MAP[note]

        l_r = i - 4
        h_r = i + 4
        if l_r < note_freq and h_r > note_freq:
            notes.append(note)
            break

# Add labels to the plot

for i in bucket:
    for note in NOTES_MAP:
        note_freq = NOTES_MAP[note]

        l_r = i - 4
        h_r = i + 4
        if l_r < note_freq and h_r > note_freq:
            idx = np.argmin(np.abs(xf - note_freq))
            if y[idx] > threshold:
                plt.scatter(xf[idx], y[idx], c="r")
                plt.annotate(
                    note,
                    (xf[idx], y[idx]),
                    textcoords="offset points",
                    xytext=(0, 10),
                    ha="center",
                )

plt.show()

This code calculates fourier transform using scipy and plots it using matplotlib. The output result for an audio of me just singing one consecutive note is here:
Figure_1

The axis from 0.0 to 1.0 is the magnitude, the axis from 0 to 3000 is the frequency. The most prominent notes are labeled under the frequency they map to. Here's the note mapping to frequency list:

{
   "C0":   16.35,
  "C#0":   17.32,
  "Db0":   17.32,
   "D0":   18.35,
  "D#0":   19.45,
  "Eb0":   19.45,
   "E0":   20.60,
   "F0":   21.83,
  "F#0":   23.12,
  "Gb0":   23.12,
   "G0":   24.50,
  "G#0":   25.96,
  "Ab0":   25.96,
   "A0":   27.50,
  "A#0":   29.14,
  "Bb0":   29.14,
   "B0":   30.87,
   "C1":   32.70,
  "C#1":   34.65,
  "Db1":   34.65,
   "D1":   36.71,
  "D#1":   38.89,
  "Eb1":   38.89,
   "E1":   41.20,
   "F1":   43.65,
  "F#1":   46.25,
  "Gb1":   46.25,
   "G1":   49.00,
  "G#1":   51.91,
  "Ab1":   51.91,
   "A1":   55.00,
  "A#1":   58.27,
  "Bb1":   58.27,
   "B1":   61.74,
   "C2":   65.41,
  "C#2":   69.30,
  "Db2":   69.30,
   "D2":   73.42,
  "D#2":   77.78,
  "Eb2":   77.78,
   "E2":   82.41,
   "F2":   87.31,
  "F#2":   92.50,
  "Gb2":   92.50,
   "G2":   98.00,
  "G#2":  103.83,
  "Ab2":  103.83,
   "A2":  110.00,
  "A#2":  116.54,
  "Bb2":  116.54,
   "B2":  123.47,
   "C3":  130.81,
  "C#3":  138.59,
  "Db3":  138.59,
   "D3":  146.83,
  "D#3":  155.56,
  "Eb3":  155.56,
   "E3":  164.81,
   "F3":  174.61,
  "F#3":  185.00,
  "Gb3":  185.00,
   "G3":  196.00,
  "G#3":  207.65,
  "Ab3":  207.65,
   "A3":  220.00,
  "A#3":  233.08,
  "Bb3":  233.08,
   "B3":  246.94,
   "C4":  261.63,
  "C#4":  277.18,
  "Db4":  277.18,
   "D4":  293.66,
  "D#4":  311.13,
  "Eb4":  311.13,
   "E4":  329.63,
   "F4":  349.23,
  "F#4":  369.99,
  "Gb4":  369.99,
   "G4":  392.00,
  "G#4":  415.30,
  "Ab4":  415.30,
   "A4":  440.00,
  "A#4":  466.16,
  "Bb4":  466.16,
   "B4":  493.88,
   "C5":  523.25,
  "C#5":  554.37,
  "Db5":  554.37,
   "D5":  587.33,
  "D#5":  622.25,
  "Eb5":  622.25,
   "E5":  659.26,
   "F5":  698.46,
  "F#5":  739.99,
  "Gb5":  739.99,
   "G5":  783.99,
  "G#5":  830.61,
  "Ab5":  830.61,
   "A5":  880.00,
  "A#5":  932.33,
  "Bb5":  932.33,
   "B5":  987.77,
   "C6": 1046.50,
  "C#6": 1108.73,
  "Db6": 1108.73,
   "D6": 1174.66,
  "D#6": 1244.51,
  "Eb6": 1244.51,
   "E6": 1318.51,
   "F6": 1396.91,
  "F#6": 1479.98,
  "Gb6": 1479.98,
   "G6": 1567.98,
  "G#6": 1661.22,
  "Ab6": 1661.22,
   "A6": 1760.00,
  "A#6": 1864.66,
  "Bb6": 1864.66,
   "B6": 1975.53,
   "C7": 2093.00,
  "C#7": 2217.46,
  "Db7": 2217.46,
   "D7": 2349.32,
  "D#7": 2489.02,
  "Eb7": 2489.02,
   "E7": 2637.02,
   "F7": 2793.83,
  "F#7": 2959.96,
  "Gb7": 2959.96,
   "G7": 3135.96,
  "G#7": 3322.44,
  "Ab7": 3322.44,
   "A7": 3520.00,
  "A#7": 3729.31,
  "Bb7": 3729.31,
   "B7": 3951.07,
   "C8": 4186.01,
  "C#8": 4434.92,
  "Db8": 4434.92,
   "D8": 4698.64,
  "D#8": 4978.03,
  "Eb8": 4978.03

The logic is that, the dominant frequency (the one with highest amplitude) IS the f0, a.k.a. as the root note. The other notes and frequencies are overtones and noise that is introduced by a lot of factors. Where exactly overtones come from is a bit of a boring thing to cover, but if you wish - there's a lot of research papers on the matter :)

Please take a notice of this:
image

The highest peak is a bit below the G4 note, however since it's still a valid frequency - it IS considered the F0 for the wave, even though it does not map to the standard 12-note scale system. The note labels are used for demonstration purposes.

As to where the confusion comes from - my only guess is the misconception of what a frequency is, combined with people mixing up samplerate and the frequencies. So, if you were to record your own voice in 44.1kHz samplerate, it would sound natural to you. However, if you were to resample the recording of your voice to 2.2kHz (2200Hz since we want to be able to represent frequencies up to 1100Hz, see Nyquist–Shannon sampling theorem), then the recording would sound muffled and unnatural to you, which is understandable and even expected. Humans expect to hear everything in the ranges they're familiar with, so from ~16Hz to ~20kHz. If the human ear does not hear frequencies above a certain point - it assumes that something is wrong, which, in my opinion, might explain where the confusion comes from. You see, during F0 curve computation, we don't care about overtones, as those are specific from voice-to-voice; therefore, to perform a more accurate voice conversion that takes in account the notes, we have to remove everything except for the root frequency (F0) from the audio.

@Mojobones
Copy link

Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc

Is this why laughing is often so poor on RVC? I assume it'd fall under a similar umbrella as things like yelling or other such noisy noises that it's trying to convert.

@alexlnkp
Copy link
Contributor

Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc

Is this why laughing is often so poor on RVC? I assume it'd fall under a similar umbrella as things like yelling or other such noisy noises that it's trying to convert.

Yeah! Although laughter is much harder to deal with. To get a perfect result, You'd have to make the high frequency cut slide up and down as the vocal cords contract and relax rapidly;
You could still get away with just setting the higher frequency cut to the frequency the vocal cords produce as they're most relaxed, but it might still sound a bit unnatural.

@Mojobones
Copy link

Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc

Is this why laughing is often so poor on RVC? I assume it'd fall under a similar umbrella as things like yelling or other such noisy noises that it's trying to convert.

Yeah! Although laughter is much harder to deal with. To get a perfect result, You'd have to make the high frequency cut slide up and down as the vocal cords contract and relax rapidly; You could still get away with just setting the higher frequency cut to the frequency the vocal cords produce as they're most relaxed, but it might still sound a bit unnatural.

makes sense, thank you! is that option you mentioned in your fork where it cuts off frequencies something that'd be worth porting to this repository? sounds like it could be useful!

@alexlnkp
Copy link
Contributor

Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc

Is this why laughing is often so poor on RVC? I assume it'd fall under a similar umbrella as things like yelling or other such noisy noises that it's trying to convert.

Yeah! Although laughter is much harder to deal with. To get a perfect result, You'd have to make the high frequency cut slide up and down as the vocal cords contract and relax rapidly; You could still get away with just setting the higher frequency cut to the frequency the vocal cords produce as they're most relaxed, but it might still sound a bit unnatural.

makes sense, thank you! is that option you mentioned in your fork where it cuts off frequencies something that'd be worth porting to this repository? sounds like it could be useful!

I'd say it would be nice to have, but ultimately the current development stage of this repo is attempting to make the original into a more "refined" look so that it can be made into a package; after that though, I'd say there's no reason not to add that.

@Mojobones
Copy link

Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc

Is this why laughing is often so poor on RVC? I assume it'd fall under a similar umbrella as things like yelling or other such noisy noises that it's trying to convert.

Yeah! Although laughter is much harder to deal with. To get a perfect result, You'd have to make the high frequency cut slide up and down as the vocal cords contract and relax rapidly; You could still get away with just setting the higher frequency cut to the frequency the vocal cords produce as they're most relaxed, but it might still sound a bit unnatural.

makes sense, thank you! is that option you mentioned in your fork where it cuts off frequencies something that'd be worth porting to this repository? sounds like it could be useful!

I'd say it would be nice to have, but ultimately the current development stage of this repo is attempting to make the original into a more "refined" look so that it can be made into a package; after that though, I'd say there's no reason not to add that.

Understood, thank you so much for the insight!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants