Skip to content

Conversation

@ChanJianHao
Copy link

@ChanJianHao ChanJianHao commented Dec 22, 2024

Hi @evermoving,

I've made some enhancements to your excellent project and would like to submit them as a pull request. These additions focus on expanding functionality, improving performance, and increasing user customization.

Here's a summary of the key contributions:

Expanded Model Support (Faster-Whisper): I've integrated additional faster-whisper models, providing users with more options for balancing speed and accuracy in their transcriptions. This should improve processing time, especially for longer audio/video files.

Added Translation Capabilities: This is a new feature. In addition to transcription, the program now supports translation. Users can now generate subtitles of foreign language films (or other audio/video content) into English. This opens up a whole new range of use cases for the project.

Increased Customization Options: I've added several options to allow users to fine-tune the program's behavior:

  • Thread Control: Users can now adjust the number of threads used for processing, allowing them to optimize performance based on their hardware.
  • Timeout Setting: A timeout setting has been added to prevent the program from hanging indefinitely on problematic files.
  • Source Language Selection: Users can now explicitly specify the source language, which can improve transcription accuracy in some cases.

Minor UI Improvements: I've made some small improvements to the user interface to enhance usability and clarity.

Blacklist for Hallucinations: Filters out certain common sentences caused by silence

I believe these changes would enhance the project's functionality and make it even more valuable to users. Thank you for creating such a fantastic project! I've really enjoyed working with it.

I'm eager to hear your feedback on these changes. Please let me know if there are any adjustments needed. If the updates are acceptable to you, please help to update the releases too!

image

Thank you :)

@evermoving
Copy link
Owner

Hi, thanks for your contribution. I will review it soon.

@ChanJianHao
Copy link
Author

ChanJianHao commented Dec 24, 2024

Hello,

I've made a few more commits in the past 24 hours to improve the handling of hallucination and made it optional. I have also made writing transcription to disk optional. However, these changes seems to be triggering some Windows Defender detection when building with PyInstaller.. I can't seem to resolve them at the moment without taking too much effort.

pyinstaller/pyinstaller#5668

Edit: Seems like the latest commit with refactoring fixed Microsoft false positive.

@fznx922
Copy link

fznx922 commented Dec 31, 2024

awesome work, been using your branch as its exactly what i was looking for, an issue i did find was when using the turbo models and translation from JA to EN it would display japanese instead of the translation, swapping to another model solved this, wondering if its a model limitation or code based? either way thanks so much for this, from both of you :)

Copy link
Owner

@evermoving evermoving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Some of the models you added to the dropdown menu (e.g. large-v3-turbo) don't appear to be supported; the console produces an error with it selected.

  • I don't want the auto language detection to be entirely replaced with explicit language declaration, there are situations where auto is useful, like when language is unknown or changes multiple times in a conversation. A better approach would be to have auto detection enabled with a checkbox by default, and language selection be optional.

  • The hallucinations.txt should be empty by default, users might or might not want to filter these words. It needs a different name as well, like filteredwords.txt, as it appears that it's filtering correctly detected words, as opposed to hallucinations.

Other than that I would be happy to approve the PR as the features seem useful.

@ChanJianHao
Copy link
Author

Hi @evermoving, thanks for the review. Thank you @fznx922 for the positive feedback too! I am glad that translation is something that other users are interested in 👍🏻

  • Some of the models you added to the dropdown menu (e.g. large-v3-turbo) don't appear to be supported; the console produces an error with it selected.

Sure, let's remove those non-working ones. Strangely they were listed as available models by Faster-Whisper. My bad.

  • I don't want the auto language detection to be entirely replaced with explicit language declaration, there are situations where auto is useful, like when language is unknown or changes multiple times in a conversation. A better approach would be to have auto detection enabled with a checkbox by default, and language selection be optional.

How about we leave source language textbox as as empty string "" by default and change the tooltip to let users know that it is optional? This way the program should default to auto language detection without the need for adding another checkbox (as there are already several checkboxes now).

If you are ok with this, I will proceed to code and test out if it can work.

  • The hallucinations.txt should be empty by default, users might or might not want to filter these words. It needs a different name as well, like filteredwords.txt, as it appears that it's filtering correctly detected words, as opposed to hallucinations.

I actually thought hallucinations.txt could be filled in advance for the convenience of all System Captioner users, while defaulting as optional. Those words/sentences are typical hallucinations whenever there are silences during translations. As mentioned in many issues and from my hours of testing with translation on foreign films:

openai/whisper#928
SYSTRAN/faster-whisper#826
openai/whisper#1606

It would take each end-user significantly more time to compile their own list just to get rid of the typical hallucinations. This pre-filled list allows them to just turn on and enjoy the feature, sort of like a "plug and play" convenience. I agree that it may filter out correct translation though, especially if the speaker really used sentences like "Thank you for watching", but that is simply due to the training data for Whisper models, and hence filter hallucination and the text file is something the user may freely edit/disable.

Sometimes the console log may also print that it is filtering because it detected extra spaces or new lines. I am still trying to find the right balance for this.

@evermoving
Copy link
Owner

@ChanJianHao The 'empty string = auto' approach is a good idea.

Regarding the hallucinations, the program already has a VAD (voice activity detection) filter that makes the program not process any major silence, which should eliminate the hallucinatory behavior seen in those more basic whisper implementations. Have you experienced those hallucinations with System Captioner?

@ChanJianHao
Copy link
Author

ChanJianHao commented Jan 2, 2025

Hi @evermoving

Happy New Year!

Great! I have made the relevant changes for language to make it optional, and have also removed the turbo models which are giving errors. As I lack a powerful GPU to test the larger models, please feel free to edit my branch should there be any more faulty models on the list.

I have also renamed the hallucination file to bring better clarity on the purpose as per your feedback. Indeed, VAD (voice activity detection) filter is very useful for English transcription and rarely has issues.

However, from my testing on several machines with Translation Mode for Japanese and Chinese, hallucination can be quite common, and hence my implementation of an optional pre-built filter.

To see hallucination when translating, you may test it with something like:
https://www.youtube.com/watch?v=D_DtKgsr9WQ

Try to run it for awhile with model 'small' or 'tiny', and pause from time to time to create silences. You'll notice that without hallucination filter, despite VAD, it will start providing strange outputs even though there's no audio. Frequently saying things like "Thank you" or "I am sorry" even though the speaker in the video is giving a very different speech.

Thank you.

@ChanJianHao ChanJianHao requested a review from evermoving January 4, 2025 02:28
@evermoving
Copy link
Owner

@ChanJianHao Hi, unfortunately, because of personal circumstances, including traveling without my main PC, I haven't been active on Github in the last few weeks. Thanks for letting me know that VAD doesn't solve the hallucinations for other languages, if so the filter is worth including. I will review your changes soon.

@ChanJianHao
Copy link
Author

@evermoving No worries, thanks for the update and looking forward to your return! 😊Please do not hesistate to let me know if there's any more changes required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants