-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Extend the pitch range of RVC models #63
Comments
See, the pitch range that is used for f0 analysis is the root pitch of the voice. In other words: no overtones. That essentially means that the root note of the voice can be in the range from 50hz to 1.1khz. Now, it's important to state that no humans are capable of producing noises that are below their lowest speaking range, you can test this yourself:
You will notice, that no matter how hard you try (don't try too hard, you might hurt your vocal chords), you are unable to go above 1100Hz, as well as you are unable to go below 10Hz. There's science in that, but generally speaking, there's the lowest pitch a human can produce, and there's the highest pitch a human can produce. However, I have implemented a similar "change the range" setting in my own fork a while ago, so I do think that it is useful, however I also think that a lot of people will misunderstand it, like how you did. I don't mean to insult or anything, it's just a common misconception regarding the f0 that a lot of people seem to have :) |
Regarding distorted audio on seemingly outside ranges - the culprit IS the f0 range slider, but not in the way you assumed. Generally speaking, a scream (whether it's a fry scream, false cord, or even just a regular human yell) is extremely noisy compared to what the f0 methods are intended for, regular singing. So, to get cleaner results, you have to limit the range even more, as to minimize the noise (you may look at this as a sort of noise gate that cuts off all frequencies above/below specified ones). I've experimented with this on my Mangio-RVC-Tweaks fork, results are always in the favor of limiting the ceiling range in order to reduce the noise caused by yelling/screaming/growling/etc |
RVC's pitch range, as it stands, is probably designed to cover the broad spectrum of normal speech and singing without considering techniques to increase their range like vocal fry and whistling. Although some might view extending the range as unnecessary, it may be essential for things like whistling, to be honest. As for the f0 misconception, these are just my (and perhaps many others) interpretations of their behavior as a regular user. I'm curious to know how the myth came about. Are you basically saying it filters out frequencies that may introduce noise out of the range, especially in non-speech vocalizations? Please explain as if I know nothing, but I do know how harmonics or overtones work. I can understand these types of vocalizations such as yells, screams, and grows are quite noisy and thus cause pitch detection algorithms to struggle when they're made for singing; whistles, on the other hand, are less noisy and should be theoretically easier to capture. This may be wishful thinking, but I still want RVC to convert them better than it does now for added realism. For example: depending on the model, it inferred coughs and sneezes (also noisy) quite decently in my experiments, but not perfectly. Meanwhile, BigVGAN's potential promises better clarity. If a pitch range adjustment feature is eventually included in RVC, which I more or less doubt, stick it to the supplied defaults of a minimum of 50 Hz and a maximum of 1.1 kHz. |
@TheTrustedComputer import numpy as np
from matplotlib import pyplot as plt
from scipy.fft import fft, fftfreq
import scipy.io.wavfile as wav
import json
NOTES_MAP = json.load(open("notes_map.json", "r"))
WAVE_LOCATION = "rd.wav"
DURATION = 5 # Seconds
wav_file = open(WAVE_LOCATION, "rb")
SAMPLE_RATE, data = wav.read(wav_file)
yf = fft(data[: SAMPLE_RATE * DURATION])
xf = fftfreq(SAMPLE_RATE * DURATION, 1 / SAMPLE_RATE)
plt.plot(xf, np.abs(yf))
plt.xlim([0, 3e3])
# Set a threshold for the magnitude
threshold = 0.05 # Try reducing the threshold value
# Map frequencies to magnitude
y = np.abs(yf)
d = {}
for i in range(0, len(y)):
if xf[i] > 0:
d[f"{xf[i]}"] = y[i]
# Sort the dict so highest frequencies are at the top
d = sorted(d, key=d.get, reverse=True)
# Get the top 10 notes
bucket = []
for i in d:
if len(bucket) == 10:
break
i = round(float(i))
if i not in bucket:
bucket.append(i)
# Map to notes
notes = []
for i in bucket:
for note in NOTES_MAP:
note_freq = NOTES_MAP[note]
l_r = i - 4
h_r = i + 4
if l_r < note_freq and h_r > note_freq:
notes.append(note)
break
# Add labels to the plot
for i in bucket:
for note in NOTES_MAP:
note_freq = NOTES_MAP[note]
l_r = i - 4
h_r = i + 4
if l_r < note_freq and h_r > note_freq:
idx = np.argmin(np.abs(xf - note_freq))
if y[idx] > threshold:
plt.scatter(xf[idx], y[idx], c="r")
plt.annotate(
note,
(xf[idx], y[idx]),
textcoords="offset points",
xytext=(0, 10),
ha="center",
)
plt.show() This code calculates fourier transform using scipy and plots it using matplotlib. The output result for an audio of me just singing one consecutive note is here: The axis from 0.0 to 1.0 is the magnitude, the axis from 0 to 3000 is the frequency. The most prominent notes are labeled under the frequency they map to. Here's the note mapping to frequency list: {
"C0": 16.35,
"C#0": 17.32,
"Db0": 17.32,
"D0": 18.35,
"D#0": 19.45,
"Eb0": 19.45,
"E0": 20.60,
"F0": 21.83,
"F#0": 23.12,
"Gb0": 23.12,
"G0": 24.50,
"G#0": 25.96,
"Ab0": 25.96,
"A0": 27.50,
"A#0": 29.14,
"Bb0": 29.14,
"B0": 30.87,
"C1": 32.70,
"C#1": 34.65,
"Db1": 34.65,
"D1": 36.71,
"D#1": 38.89,
"Eb1": 38.89,
"E1": 41.20,
"F1": 43.65,
"F#1": 46.25,
"Gb1": 46.25,
"G1": 49.00,
"G#1": 51.91,
"Ab1": 51.91,
"A1": 55.00,
"A#1": 58.27,
"Bb1": 58.27,
"B1": 61.74,
"C2": 65.41,
"C#2": 69.30,
"Db2": 69.30,
"D2": 73.42,
"D#2": 77.78,
"Eb2": 77.78,
"E2": 82.41,
"F2": 87.31,
"F#2": 92.50,
"Gb2": 92.50,
"G2": 98.00,
"G#2": 103.83,
"Ab2": 103.83,
"A2": 110.00,
"A#2": 116.54,
"Bb2": 116.54,
"B2": 123.47,
"C3": 130.81,
"C#3": 138.59,
"Db3": 138.59,
"D3": 146.83,
"D#3": 155.56,
"Eb3": 155.56,
"E3": 164.81,
"F3": 174.61,
"F#3": 185.00,
"Gb3": 185.00,
"G3": 196.00,
"G#3": 207.65,
"Ab3": 207.65,
"A3": 220.00,
"A#3": 233.08,
"Bb3": 233.08,
"B3": 246.94,
"C4": 261.63,
"C#4": 277.18,
"Db4": 277.18,
"D4": 293.66,
"D#4": 311.13,
"Eb4": 311.13,
"E4": 329.63,
"F4": 349.23,
"F#4": 369.99,
"Gb4": 369.99,
"G4": 392.00,
"G#4": 415.30,
"Ab4": 415.30,
"A4": 440.00,
"A#4": 466.16,
"Bb4": 466.16,
"B4": 493.88,
"C5": 523.25,
"C#5": 554.37,
"Db5": 554.37,
"D5": 587.33,
"D#5": 622.25,
"Eb5": 622.25,
"E5": 659.26,
"F5": 698.46,
"F#5": 739.99,
"Gb5": 739.99,
"G5": 783.99,
"G#5": 830.61,
"Ab5": 830.61,
"A5": 880.00,
"A#5": 932.33,
"Bb5": 932.33,
"B5": 987.77,
"C6": 1046.50,
"C#6": 1108.73,
"Db6": 1108.73,
"D6": 1174.66,
"D#6": 1244.51,
"Eb6": 1244.51,
"E6": 1318.51,
"F6": 1396.91,
"F#6": 1479.98,
"Gb6": 1479.98,
"G6": 1567.98,
"G#6": 1661.22,
"Ab6": 1661.22,
"A6": 1760.00,
"A#6": 1864.66,
"Bb6": 1864.66,
"B6": 1975.53,
"C7": 2093.00,
"C#7": 2217.46,
"Db7": 2217.46,
"D7": 2349.32,
"D#7": 2489.02,
"Eb7": 2489.02,
"E7": 2637.02,
"F7": 2793.83,
"F#7": 2959.96,
"Gb7": 2959.96,
"G7": 3135.96,
"G#7": 3322.44,
"Ab7": 3322.44,
"A7": 3520.00,
"A#7": 3729.31,
"Bb7": 3729.31,
"B7": 3951.07,
"C8": 4186.01,
"C#8": 4434.92,
"Db8": 4434.92,
"D8": 4698.64,
"D#8": 4978.03,
"Eb8": 4978.03 The logic is that, the dominant frequency (the one with highest amplitude) IS the f0, a.k.a. as the root note. The other notes and frequencies are overtones and noise that is introduced by a lot of factors. Where exactly overtones come from is a bit of a boring thing to cover, but if you wish - there's a lot of research papers on the matter :) The highest peak is a bit below the G4 note, however since it's still a valid frequency - it IS considered the F0 for the wave, even though it does not map to the standard 12-note scale system. The note labels are used for demonstration purposes. As to where the confusion comes from - my only guess is the misconception of what a frequency is, combined with people mixing up samplerate and the frequencies. So, if you were to record your own voice in 44.1kHz samplerate, it would sound natural to you. However, if you were to resample the recording of your voice to 2.2kHz (2200Hz since we want to be able to represent frequencies up to 1100Hz, see Nyquist–Shannon sampling theorem), then the recording would sound muffled and unnatural to you, which is understandable and even expected. Humans expect to hear everything in the ranges they're familiar with, so from ~16Hz to ~20kHz. If the human ear does not hear frequencies above a certain point - it assumes that something is wrong, which, in my opinion, might explain where the confusion comes from. You see, during F0 curve computation, we don't care about overtones, as those are specific from voice-to-voice; therefore, to perform a more accurate voice conversion that takes in account the notes, we have to remove everything except for the root frequency (F0) from the audio. |
Is this why laughing is often so poor on RVC? I assume it'd fall under a similar umbrella as things like yelling or other such noisy noises that it's trying to convert. |
Yeah! Although laughter is much harder to deal with. To get a perfect result, You'd have to make the high frequency cut slide up and down as the vocal cords contract and relax rapidly; |
makes sense, thank you! is that option you mentioned in your fork where it cuts off frequencies something that'd be worth porting to this repository? sounds like it could be useful! |
I'd say it would be nice to have, but ultimately the current development stage of this repo is attempting to make the original into a more "refined" look so that it can be made into a package; after that though, I'd say there's no reason not to add that. |
Understood, thank you so much for the insight! |
Is your feature request related to a problem? Please describe.
Currently, RVC models with pitch guidance seem to have an f0 range from 50 Hz to 1.1 kHz. When I feed an audio sample outside of this range, it produces distorted breath sounds or subharmonics of the fundamental.
Describe the solution you'd like
I would like to see this extended since humans are more than capable of going higher and lower than that (vocal fry, women's screams, and whistles are some examples). RMVPE's range appears to be set from 30 Hz to 8 kHz.
Describe alternatives you've considered
I have tried changing
f0_min
andf0_max
to something else to test in real-time inference, but it has no effect. RMVPE does react to this a little, but it's not clear the changes are obvious.Additional context
I recommend assigning the lower bound to 20 Hz (the minimum range of human hearing) and the upper bound to roughly 8 kHz like RMVPE to cover the entire vocal range and beyond. Better yet, consider allowing the user to fully customize this in training and inference.
The text was updated successfully, but these errors were encountered: