I am using OpenAI’s real-time API (gpt-4o-realtime-preview-2024-12-17) in a React-based application for live transcription and response generation. However, I am facing an issue where the transcribed text and the generated speech output do not align properly. Sometimes the text appears earlier than expected, or the audio plays with a delay.
Implementation Details:
The application uses WebSockets to stream real-time audio to OpenAI.
I am using the RealtimeClient from OpenAI's API to send and receive live audio responses.
The WavRecorder and WavStreamPlayer are used to handle audio streaming and playback, since the audio is in 16bitPCM format
The text responses are updated dynamically as they arrive via the API.
this is the code for connecting the api
const connectConversation = useCallback(async () => {
const client = clientRef.current;
const wavRecorder = wavRecorderRef.current;
const wavStreamPlayer = wavStreamPlayerRef.current;
await wavRecorder.begin();
await wavStreamPlayer.connect();
try {
const response = await client.connect();
if (response) {
setLoading(false);
client.sendUserMessageContent([{ type: "input_text", text: "Hello!" }]);
if (client.getTurnDetectionType() === "server_vad") {
await wavRecorder.record((data) => client.appendInputAudio(data.mono));
}
}
} catch (error) {
console.error("Error connecting:", error);
}
}, []);
this is the code for getting the response
client.on("conversation.updated", async ({ item, delta }) => {
if (item.role === "assistant" && delta?.audio) {
wavStreamPlayer.add16BitPCM(delta.audio, item.id);
textRef.current = item.formatted.transcript; // Text updates immediately
} else if (delta?.text) {
textRef.current = item.formatted.transcript;
}
if (item.status === "completed" && item.formatted.audio?.length) {
const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000);
setAudiosrc(wavFile.url);
}
});
Problem observed
Couldn't scroll the text with sync to the audio
scrolling login based on duration as 150 words per minute
const scrollText = () => {
if (!scrollContainerRef.current) return;
const container = scrollContainerRef.current;
const currentTime = Date.now();
const elapsed = currentTime - scrollStartTimeRef.current;
const duration = getScrollDuration(text);
if (elapsed >= duration) {
container.scrollTop = container.scrollHeight - container.clientHeight;
return;
}
const progress = elapsed / duration;
const targetScrollTop = container.scrollHeight - container.clientHeight;
// Smooth easing function for better scrolling
const easeInOutQuad = (t) =>
t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;
container.scrollTop = targetScrollTop * easeInOutQuad(progress);
animationFrameRef.current = requestAnimationFrame(scrollText);
};
Approach taken
Converting 16-bit PCM into an audio source
const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000); setAudiosrc(wavFile.url);
- However, conversion takes time depending on the length of the response, causing desynchronization.
Scrolling based on word count (150 WPM rule)
const wordsPerMinute = 150; const words = text.split(" ").length; return (words / wordsPerMinute) * 60 * 1000;
This works for short responses but fails for larger responses due to variation in speech speed.
Questions:
- How can I accurately sync the text scroll with the real-time audio playback?
- Are there any existing libraries or best practices for handling text-audio synchronization in real-time applications?
Any insights or suggestions would be greatly appreciated!
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744372851a4571040.html
评论列表(0条)