-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changes in trim_silence function regarding the agressive trimming issue #46
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there, thanks for helping out with the project!
This is an interesting approach. Quite different to what I was expecting to address this issue.
The biggest question I have is whether the benefits of this approach outweigh the added complexity of the implementation?
The complexity arises in that we would be adding both numpy and pandas as requirements for this small task. It's also more complex for people wanting to read, understand and work on the code in the future.
Interested to hear your thoughts, and what the benefits that this approach brings over doing something like adding a manual buffer of x
seconds to the start and end trims.
Thanks again :)
backend/app/audio.py
Outdated
@@ -1,28 +1,28 @@ | |||
"""audio processing, etc""" | |||
from pydub import AudioSegment | |||
import pandas as pd | |||
import numpy as np | |||
import soundfile as sf | |||
|
|||
|
|||
class Audio: | |||
silence_threshold = -50.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this would no longer be used anywhere is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it will be used in saving the audio
backend/app/audio.py
Outdated
while sound[trim_ms:trim_ms + Audio.chunk_size].dBFS \ | ||
< Audio.silence_threshold and trim_ms < len(sound): | ||
trim_ms += Audio.chunk_size | ||
def _detect_leading_silence(sound: bytearray, sample_rate: int) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method should probably be changed to reflect the new nature of it. It is no longer "detecting leading silence" it is generating a mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I am working on it
backend/app/audio.py
Outdated
|
||
|
||
class Audio: | ||
silence_threshold = -50.0 | ||
chunk_size = 10 | ||
threshold_value = 0.0007 # Tweak the value of threshold to get the aggressive trimming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this threshold value represent? ie what are the units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the mask threshold value it will cut the mean value got from rolling windows and discard the value below threshold
backend/app/audio.py
Outdated
y_mean = y.rolling(window=int(sample_rate / 20), | ||
min_periods=1, | ||
center=True).max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you break this down for me? What are we getting from this and how do we know that detects silence?
backend/app/audio.py
Outdated
|
||
return trim_ms | ||
return [True if mean > Audio.threshold_value else False for mean in y_mean] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of a comparison is already a Boolean, eg x > y
evaluates to either True or False. So this can be simplified to something like:
return [mean > Audio.threshold_value for mean in y_mean]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup I think this is right
backend/app/audio.py
Outdated
trimmed_sound = sound[start_trim:duration - end_trim] | ||
sound, rate = sf.read(path + ".wav") | ||
mask = Audio._detect_leading_silence(sound, rate) | ||
trimmed_sound = sound[mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to remove all periods of silence, rather than only silence at the beginning and end of the clip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will remove all the silence from signal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misunderstanding but won't this also remove any apparent "silence" during the speech? Meaning any natural pauses will be cut out and the speech will sound very run together?
I don't think this is the behaviour we are after for TTS recording. Is that your intended use case?
backend/app/audio.py
Outdated
import pandas as pd | ||
import numpy as np | ||
import soundfile as sf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally keep these as their full length names. This makes the code throughout more readable as you don't need to return and check what sf
refers to for example.
I think you above statement is right NumPy and pandas are adding stress into the alpine version of python. meanwhile, I am also working on a different approach which is less heavy and computationally efficient |
above approach also need some tweak in dockerfile |
why are we using 2 channels while saving the file |
Hey, glad that you're finding the project helpful and able to modify it to fit your use case. The removal of all silence makes more sense now as it seems you're recording single words rather than whole sentences. It feels like these changes would be well suited to be configuration options set in the docker-compose.yaml eg:
|
How to use this template
Under each heading below is a short outline of the information required. When submitting a PR, please delete this text and replace it with your own.
The CLA section can be deleted entirely.
Description
update the trimming function using rolling windows to create the mask around the signal
{“fixes #{35}”}
If needed follow up with as much detail as required.
Type of PR
Documentation
The most important and tweakable part is threshold_value do play around with the value to get your desired trimming
CLA
To protect you, the project, and those who choose to use Mycroft technologies in systems they build, we ask all contributors to sign a Contributor License Agreement.
This agreement clarifies that you are granting a license to the Mycroft Project to freely use your work. Additionally, it establishes that you retain the ownership of your contributed code and intellectual property. As the owner, you are free to use your code in other work, obtain patents, or do anything else you choose with it.
If you haven't already signed the agreement and been added to our public Contributors repo then please head to https://mycroft.ai/cla to initiate the signing process.