Unable to stream the raw audio to stdin using ffmpeg #157

realyukii · 2024-12-06T04:07:19Z

realyukii
Dec 6, 2024

I tried a python script from this discussion with a few changes:

if __name__ == "__main__":
    import threading
    import pyaudio
    from RealtimeSTT import AudioToTextRecorder

    # Audio stream configuration constants
    CHUNK = 4 * 1024                  # Number of audio samples per buffer
    FORMAT = pyaudio.paInt16      # Sample format (16-bit integer)
    CHANNELS = 1                  # Mono audio
    RATE = 48000                  # Sampling rate in Hz (expected by the recorder)

    # Initialize the audio-to-text recorder without using the microphone directly
    # Since we are feeding audio data manually, set use_microphone to False
    recorder = AudioToTextRecorder(
        use_microphone=False,     # Disable built-in microphone usage
        spinner=False             # Disable spinner animation in the console
    )

    # Event to signal when to stop the threads
    stop_event = threading.Event()

    def feed_audio_thread():
        """Thread function to read audio data and feed it to the recorder."""
        p = pyaudio.PyAudio()

        # Open an input audio stream with the specified configuration
        stream = p.open(
            format=FORMAT,
            channels=CHANNELS,
            rate=RATE,
            input=True,
            frames_per_buffer=CHUNK
        )

        try:
            print("Speak now")
            while not stop_event.is_set():
                # Read audio data from the stream (in the expected format)
                data = stream.read(CHUNK, exception_on_overflow=False)
                # Feed the audio data to the recorder
                recorder.feed_audio(data)
        except Exception as e:
            print(f"feed_audio_thread encountered an error: {e}")
        finally:
            # Clean up the audio stream
            stream.stop_stream()
            stream.close()
            p.terminate()
            print("Audio stream closed.")

    def recorder_transcription_thread():
        """Thread function to handle transcription and process the text."""
        def process_text(full_sentence):
            """Callback function to process the transcribed text."""
            print("Transcribed text:", full_sentence)
            # Check for the stop command in the transcribed text
            if "stop recording" in full_sentence.lower():
                print("Stop command detected. Stopping threads...")
                stop_event.set()
                recorder.abort()
        try:
            while not stop_event.is_set():
                # Get transcribed text and process it using the callback
                recorder.text(process_text)
        except Exception as e:
            print(f"transcription_thread encountered an error: {e}")
        finally:
            print("Transcription thread exiting.")

    try:
        # Create and start the audio feeding thread
        audio_thread = threading.Thread(target=feed_audio_thread)
        audio_thread.daemon = False    # Ensure the thread doesn't exit prematurely
        audio_thread.start()

        # Create and start the transcription thread
        transcription_thread = threading.Thread(target=recorder_transcription_thread)
        transcription_thread.daemon = False    # Ensure the thread doesn't exit prematurely
        transcription_thread.start()

        # Wait for both threads to finish
        audio_thread.join()
        transcription_thread.join()
    except KeyboardInterrupt:
        print("Recording and transcription have stopped.")
        print("exiting...")
    finally:
        recorder.shutdown()

the changes I made are:

adjust the sample rate to 48kHz (as I'm always getting error invalid sample rate)
add exception for KeyboardInterrupt
increase size of chunk to 4 * 1024
set exception_on_overflow to False to prevent error feed_audio_thread encountered an error: [Errno -9981] Input overflowed

it previously works without any problem, however, when I tried again, it seems something changed and somehow the program didn't process the audio?

here's how I setup stuff before running the script

# create fake/virtual/dummy sink
$ pactl load-module module-null-sink sink_name=steam
# ensure the sink is created
$ pactl list short sinks
798     steam   PipeWire        float32le 2ch 48000Hz   RUNNING
# search audio index for chrome process
$ pactl list sink-inputs short
6573    798     6572    PipeWire        float32le 2ch 48000Hz
# redirect the audio to the dummy sink
$ pactl move-sink-input 6573 steam
# test stream ffmpeg from steam.monitor using pipe to ffplay (and it works as expected)
$ ffmpeg -f pulse -i steam.monitor -f s16le -ar 16k -acodec pcm_s16le -ac 1 -loglevel quiet - | ffplay -f s16le -ar 48k -
$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/reyuki/software/open-source/RealtimeSTT/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib/"
# setup virtual environment and activate it
$ ffmpeg -f pulse -i steam.monitor -f s16le -ar 48k -acodec pcm_s16le -ac 1 -loglevel quiet - | python ./main.py
ALSA lib pcm_dsnoop.c:567:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2722:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2722:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2722:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_dsnoop.c:567:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave
Speak now
Recording and transcription have stopped.
exiting...
RealtimeSTT shutting down
RealTimeSTT: root - ERROR - Error receiving data from connection: handle is closed
^CException ignored in: <module 'threading' from '/usr/lib/python3.12/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1624, in _shutdown
    lock.acquire()
KeyboardInterrupt: 
^CException ignored in atexit callback: <function _exit_function at 0x7b6bf7776020>
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/util.py", line 363, in _exit_function
    _run_finalizers()
  File "/usr/lib/python3.12/multiprocessing/util.py", line 303, in _run_finalizers
    finalizer()
  File "/usr/lib/python3.12/multiprocessing/util.py", line 227, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/queues.py", line 219, in _finalize_join
    thread.join()
  File "/usr/lib/python3.12/threading.py", line 1149, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1169, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:

I also tried the example browser demo to make sure it's not an issue of hardware, and well.. it works as expected and generate transcribed texts.

I'm new in audio stuff (just learn very basic and general stuff from a few days ago) and have zero knowledge in python and AI stuff, I definitely missing something, please point me to the right direction, thanks :)

some references I used:

Answered by KoljaB

Dec 6, 2024

feed_audio needs raw PCM chunks at 16,000 Hz, mono, 16-bit, or NumPy arrays with sample rate submitted as parameter. Ideally, the chunks are 1024 bytes, but it’s flexible with sizes.

Since you’re grabbing audio at 48,000 Hz from PyAudio, you’ll need to downsample it to 16,000 Hz first. That’s because Whisper and Silero are built for 16 kHz audio. Use something like scipy.signal.resample to handle the downsampling. Also, make sure the chunks stay in sync and are passed to feed_audio in real-time (should be given here since pyAudio records and delivers chunks in real-time).

View full answer

KoljaB · 2024-12-06T09:44:08Z

KoljaB
Dec 6, 2024
Maintainer

feed_audio needs raw PCM chunks at 16,000 Hz, mono, 16-bit, or NumPy arrays with sample rate submitted as parameter. Ideally, the chunks are 1024 bytes, but it’s flexible with sizes.

Since you’re grabbing audio at 48,000 Hz from PyAudio, you’ll need to downsample it to 16,000 Hz first. That’s because Whisper and Silero are built for 16 kHz audio. Use something like scipy.signal.resample to handle the downsampling. Also, make sure the chunks stay in sync and are passed to feed_audio in real-time (should be given here since pyAudio records and delivers chunks in real-time).

2 replies

realyukii Dec 6, 2024
Author

Thanks for the help! I just realize that the previous code is reading audio from my microphone lol, even though I specify use_microphone to False, maybe because I set input to True (and I thought input option is stdin, but I think I'm wrong here)

so here's the updated code, and now it works as expected, cool!

if __name__ == "__main__":
    import threading
    from sys import stdin
    from RealtimeSTT import AudioToTextRecorder

    # Initialize the audio-to-text recorder without using the microphone directly
    # Since we are feeding audio data manually, set use_microphone to False
    recorder = AudioToTextRecorder(
        debug_mode=True,
        use_microphone=False,     # Disable built-in microphone usage
        spinner=False             # Disable spinner animation in the console
    )

    # Event to signal when to stop the threads
    stop_event = threading.Event()

    def feed_audio_thread():
        """Thread function to read audio data and feed it to the recorder."""

        try:
            print("Speak now")
            while not stop_event.is_set():
                # Read audio data from the stream (in the expected format)
                data = stdin.buffer.read(1024)
                # Feed the audio data to the recorder
                recorder.feed_audio(data)
        except Exception as e:
            print(f"feed_audio_thread encountered an error: {e}")
        finally:
            print("Audio stream closed.")

    def recorder_transcription_thread():
        """Thread function to handle transcription and process the text."""
        def process_text(full_sentence):
            """Callback function to process the transcribed text."""
            print("Transcribed text:", full_sentence)
        try:
            while not stop_event.is_set():
                # Get transcribed text and process it using the callback
                recorder.text(process_text)
        except Exception as e:
            print(f"transcription_thread encountered an error: {e}")
        finally:
            print("Transcription thread exiting.")

    try:
        # Create and start the audio feeding thread
        audio_thread = threading.Thread(target=feed_audio_thread)
        audio_thread.daemon = False    # Ensure the thread doesn't exit prematurely
        audio_thread.start()

        # Create and start the transcription thread
        transcription_thread = threading.Thread(target=recorder_transcription_thread)
        transcription_thread.daemon = False    # Ensure the thread doesn't exit prematurely
        transcription_thread.start()

        # Wait for both threads to finish
        audio_thread.join()
        transcription_thread.join()
    except KeyboardInterrupt:
        print("exiting...")
    finally:
        stop_event.set()
        recorder.shutdown()

$ ffmpeg -f pulse -i steam.monitor -f s16le -ar 16k -acodec pcm_s16le -ac 1 -loglevel quiet - | python ./piped_stream.py
Speak now
RealTimeSTT: root - WARNING - Audio queue size exceeds latency limit. Current size: 107. Discarding old audio chunks.
Transcribed text: Mm-hmm. So this one...
Transcribed text: Is funny.
Transcribed text: Can we just like because it's an element for the dynamic array of branches?
Transcribed text: So, and the probability is over.
Transcribed text: All of the branches should add up to one, not anymore, but besides the point, so maybe we can actually store the probability sum in here.
Transcribed text: Right so this is something we can do like a store probability sum and as we define this thing we just add up those probabilities right so we actually add up those probabilities so and we need to do that in a place where you define those branches so I think I'm gonna put in the screen here to break all of the places where these things use so I can go in there and actually compute the probability sum and so probability it could be called wait that's not the bad idea actually a school to wait and this is gonna be a wait sum right so because the probability is a value from zero to one yeah so that makes sense okay so this is the place where we break everything and this is another place where we break everything so let's go so the compilation errors and we factor everything oh I mean there's quite a few errors like because of the macros and everything so the count the error count is slightly inflated right so one single fix went.

I already set the sample rate to 16 kHz, so I think resampling isn't needed here (although I get some warning that I'm not sure what does it means...)

KoljaB Dec 6, 2024
Maintainer

RealtimeSTT processes incoming audio chunks by placing them into a queue. Typically, these chunks come from the microphone, but in this case, we’ve disabled that and are using feed_audio to push chunks into the queue manually.

The warning indicates that RealtimeSTT isn’t processing the chunks quickly enough, causing the queue to grow. When the number of unprocessed chunks exceeds the default limit of 100, the warning is triggered. This can happen for several reasons, such as slow processing (e.g., CUDA not enabled, GPU too slow) or feeding chunks too rapidly.

To address this, you can either increase the allowed_latency_limit parameter beyond 100, set handle_buffer_overflow to False to disable the warning, or avoid feeding chunks at such a high rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to stream the raw audio to stdin using ffmpeg #157

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unable to stream the raw audio to stdin using ffmpeg #157

realyukii Dec 6, 2024

Replies: 1 comment · 2 replies

KoljaB Dec 6, 2024 Maintainer

realyukii Dec 6, 2024 Author

KoljaB Dec 6, 2024 Maintainer

realyukii
Dec 6, 2024

Replies: 1 comment 2 replies

KoljaB
Dec 6, 2024
Maintainer

realyukii Dec 6, 2024
Author

KoljaB Dec 6, 2024
Maintainer