Blog
Voximplant adds enhanced pipeline options for Voice AI

Voximplant adds enhanced pipeline options for Voice AI

2026-03-26 07:44:28

Voximplant now lets developers build full-cascade voice AI pipelines in VoxEngine without sacrificing turn-taking quality. Voximplant now has five new capabilities — Voice Activity Detection (VAD), end-of-turn detection, an OpenAI Chat Completions connector, an OpenAI Responses connector, and Bring your own LLM using OpenAI-compatible API support for third-party LLMs. Combined with extensive Cloud Communications and Voice AI capabilities, Voximplant developers have complete control over every stage of a speech-to-speech pipeline while keeping conversations fast and natural.

Voximplant’s existing Voice AI connectors bundle speech input handling into a single end-to-end integration. That works well when a provider's built-in models cover your language, dialect, and voice requirements. However, when you need a specialized speech-to-text (STT) engine, a specific Large Language Model (LLM), or a particular text-to-speech (TTS) voice, the ability to assemble your own pipeline choosing the best STT, LLM, and TTS components independently becomes critical. With today's release, Voximplant gives you the building blocks to create flexible Voice AI pipelines without sacrificing the conversational quality that makes voice agents feel human.

Highlights

End-of-turn detection — Understand when a caller has actually finished their thought, even through pauses, filler words like "ahh" and "ummm," and natural speech disfluencies. This prevents the agent from cutting in prematurely and enables rapid back-and-forth without awkward delays or crosstalk.
Voice Activity Detection (VAD) — Detects when a caller starts and stops speaking so your application knows exactly when to capture audio, route it to an STT engine, or stop recording. Available at no additional cost.
OpenAI Chat Completions API connector — A native VoxEngine module that connects directly to OpenAI's Chat Completions API. Ideal for developers who already use Chat Completions for text-based bots and want to extend the same LLM configuration to voice without rebuilding their infrastructure.
OpenAI Responses API connector — A native VoxEngine module for OpenAI's newer Responses API, which supports multi-turn state handling, built-in tools, and WebSocket transport. The WebSocket interface makes it particularly well-suited for long-running, tool-call-heavy voice workflows.
OpenAI-compatible connectors for third-party LLMs — Because several third-party LLM vendors implement the OpenAI API specification, both the Chat Completions and Responses connectors work with OpenAI-compatible models beyond OpenAI itself — allowing you to “Bring your own LLM” while using a consistent integration pattern.

About Voice AI pipelines

Voximplant's existing Voice AI connectors — for platforms like Grok, Deepgram, Cartesia Line, and others — handle speech input, reasoning, and speech output as a single integrated stream. They are the fastest way to ship a voice agent when the provider's built-in speech and model capabilities match your requirements.

However, many production voice applications need more flexibility. Developers may need a speech-to-text engine that handles a specific dialect or industry vocabulary. You may want to route reasoning through a fine-tuned model that is not available inside an integrated connector. Or you may need a particular TTS voice that matches your brand. A full cascade pipeline lets you choose each component independently: your preferred STT feeds text to your preferred LLM, which feeds text to your preferred TTS — all orchestrated through VoxEngine and connected to phone numbers, SIP trunks, WhatsApp, or WebRTC.

The challenge with cascaded pipelines has always been interactivity. Without integrated speech detection and turn-taking, the result feels robotic — the agent either talks over the caller or waits too long to respond. Today's VAD and end-of-turn detection modules solve this directly inside VoxEngine, so full cascade pipelines can deliver the same natural conversational flow as integrated connectors.

Developer notes

VAD module — Load with the Silero module. VAD detects voice activity on the audio stream and returns an event after a threshold and minimum silence duration you set is exceeded. You can use this event to trigger actions like starting STT capture or stopping a recording. The interface also includes a speech padding parameter that lets you adjust how aggressively to lip any audio.
Turn detection module — Load with the Pipecat module. Turn detection analyzes speech patterns to determine when a caller has finished speaking and expects a response. It handles variable pauses and speech disfluencies so the agent responds at the right moment. This API currently takes a single threshold parameter. We recommend using the turn taking helper referenced in the example below that integrates VAD with additional timers often needed in production Voice AI applications.
OpenAI Chat Completions client — Load the OpenAI module and create a client via OpenAI.createChatCompletionsClient(). Pass your API key, model, and messages array. The client manages the WebSocket connection and streams completion chunks back to your scenario for TTS playback.
OpenAI Responses client — Load the OpenAI module and create a client via OpenAI.createResponsesClient(). The Responses API supports multi-turn conversation state, built-in tools, and a persistent WebSocket connection — well suited for agentic workflows that require function calling and extended interactions.
Third-party OpenAI-compatible models — Both the Chat Completions and Responses clients accept a custom baseURL parameter, so you can point them at any LLM provider that implements the OpenAI API specification. This gives you a single integration pattern for multiple model providers.
Combining existing modules — VAD and turn detection work alongside Voximplant's existing STT modules (ASR, Deepgram ASR, etc.) and TTS modules (Cartesia, Inworld, ElevenLabs, etc.). Wire them together in a single VoxEngine scenario to build a complete pipeline.

Demo

See the demo and code walkthrough video below.

Pricing and availability

All five capabilities are generally available and ready for use inside VoxEngine today.

End-of-turn detection is priced at. $0.001 per stream for every 15-seconds of activity (0.4¢/min). We expect to halve this price in the very near future.

Everything else is free from Voximplant.

Voice Activity Detection (VAD) is completely free with no limits.

There is also no Voximplant charge for the OpenAI Chat Completions or Responses API Clients - as always, text-based communication over our WebSocket gateways is free of charge. You need to provide your own API key to OpenAI or other LLM connector and will be billed by that provider according to your account terms with them.

Code example

This example includes:

Full turn-taking controls
The OpenAI Responses AI Client
Use of an 3rd party LLM vendor (groq) using OpenAI compatibility
A full cascaded pipeline using Voximplant’s built-in speech recognition (AST / STT) options and streaming TTS.

Load the turn taking helper code from here into a new scenario. Then make another new scenario with the code below. Make sure the vox-turn-taking scenario is included in your routing rule with the scenario below.

See the full guide for mode details.

/**
 * Full-cascade Voice AI demo: Deepgram STT + Groq Llama Responses API + Inworld TTS
 * Scenario: answer an incoming call using VoxTurnTaking for turn management.
 *
 * Include `vox-turn-taking` in the routing rule sequence.
 *
 * Groq's Responses API is OpenAI-compatible, but it does not currently support
 * `previous_response_id`. To keep this example simple, each turn is submitted
 * independently instead of rebuilding prior conversation history locally.
 */

require(Modules.ASR);
require(Modules.OpenAI);
require(Modules.Inworld);
require(Modules.ApplicationStorage);

const SYSTEM_PROMPT = `
You are Voxi, a helpful phone assistant for Voximplant. Keep responses short, polite, and telephony-friendly (usually 1-2 sentences).
Reply in English.
`;

VoxEngine.addEventListener(AppEvents.CallAlerting, async ({call}) => {
    let stt;
    let responsesClient;
    let ttsPlayer;
    let turnTaking;
    const terminate = () => {
        stt?.stop();
        responsesClient?.close();
        turnTaking?.close();
        VoxEngine.terminate();
    };

    call.addEventListener(CallEvents.Disconnected, terminate);
    call.addEventListener(CallEvents.Failed, terminate);

    try {
        call.answer();
        call.record({hd_audio: true, stereo: true});        // optional recording

        stt = VoxEngine.createASR({
            profile: ASRProfileList.Deepgram.en_US,
            interimResults: true,
            request: {
                language: "en-US",
                model: "nova-2-phonecall",
                keywords: ["Voximplant:4", "OpenAI:2"],
            },
        });

        responsesClient = await OpenAI.createResponsesAPIClient({
            apiKey: (await ApplicationStorage.get("GROQ_API_KEY")).value,
            baseUrl: "https://api.groq.com/openai/v1",
            storeContext: false,
            onWebSocketClose: (event) => {
                Logger.write("===Groq.WebSocket.Close===");
                if (event) Logger.write(JSON.stringify(event));
                terminate();
            },
        });

        ttsPlayer = Inworld.createRealtimeTTSPlayer({
            createContextParameters: {
                create: {
                    voiceId: "Ashley",
                    modelId: "inworld-tts-1.5-mini",
                    speakingRate: 1.1,
                    temperature: 1.3,
                }
            }
        });

        // Load the VoxTurnTaking module as part of the routing rule
        turnTaking = await VoxTurnTaking.create({
            call,
            stt,
            vadOptions: {
                threshold: 0.5,                             // sensitivity for detecting speech vs silence
                minSilenceDurationMs: 350,                  // silence required before VAD marks speech end
                speechPadMs: 10,                            // small padding around detected speech
            },
            turnDetectorOptions: {
                threshold: 0.5,                             // end-of-turn probability needed from Pipecat
            },
            policy: {
                transcriptSettleMs: 500,                    // grace period for a final STT chunk after end-of-turn
                userSpeechTimeoutMs: 1000,                  // default fallback submit timeout after speech ends
                shortUtteranceExtensionMs: 1800,            // longer hold for fragments that may continue
                fastShortUtteranceTimeoutMs: 700,           // faster submit for short complete utterances like "hey"
                shortUtteranceMaxChars: 12,                 // max chars still treated as a short fragment
                shortUtteranceMaxWords: 2,                  // max words still treated as a short fragment
                lowConfidenceShortUtteranceThreshold: 0.75, // keep short low-confidence finals replaceable
            },
            enableLogging: true,
            onUserTurn: (input) => {                    // send the transcript text on end-of-turn
                responsesClient.createResponses({
                    model: "llama-3.3-70b-versatile",
                    instructions: SYSTEM_PROMPT,
                    input,
                });
            },
            onInterrupt: () => {
                ttsPlayer?.clearBuffer();                       // stop any in-progress TTS audio
            },
        });

        responsesClient.addEventListener(OpenAI.ResponsesAPIEvents.ResponseTextDelta, (event) => {
            const text = event?.data?.payload?.delta;
            if (!text || !turnTaking.canPlayAgentAudio()) return;
            ttsPlayer.send({send_text: {text}});
        });

        responsesClient.addEventListener(OpenAI.ResponsesAPIEvents.ResponseTextDone, (event) => {
            const text = event?.data?.payload?.text;
            Logger.write(`===AGENT=== ${text}`);
            ttsPlayer.send({flush_context: {}});        // Tell TTS to process all buffered text immediately
        });

        // Event logging to illustrate available OpenAI Responses API client events
        [
            OpenAI.ResponsesAPIEvents.ResponseCreated,
            OpenAI.ResponsesAPIEvents.ResponseFailed,
            OpenAI.ResponsesAPIEvents.ResponsesAPIError,
            OpenAI.ResponsesAPIEvents.ResponseInProgress,
            OpenAI.ResponsesAPIEvents.ResponseCompleted,
            OpenAI.ResponsesAPIEvents.ResponseOutputItemAdded,
            OpenAI.ResponsesAPIEvents.ResponseContentPartAdded,
            OpenAI.ResponsesAPIEvents.ConnectorInformation,
            OpenAI.ResponsesAPIEvents.Unknown,
            OpenAI.Events.WebSocketMediaStarted,
            OpenAI.Events.WebSocketMediaEnded,
        ].forEach((eventName) => {
            responsesClient.addEventListener(eventName, (event) => {
                Logger.write(`===${event?.name || eventName}===`);
                if (event?.data) Logger.write(JSON.stringify(event.data));
            });
        });

        // Attach the caller media
        call.sendMediaTo(stt);
        ttsPlayer.sendMediaTo(call);

        // Tell the LLM to talk first and greet the user
        responsesClient.createResponses({
            model: "llama-3.3-70b-versatile",
            instructions: SYSTEM_PROMPT,
            input: "Greet the caller briefly.",
        });


    } catch (error) {
        Logger.write("===UNHANDLED_ERROR===");
        Logger.write(error);
        terminate();
    }
});

References

General Voice AI

Voximplant Voice AI platform — https://voximplant.ai
Full-cascade with “bring your own LLM” guide — https://docs.voximplant.ai/voice-ai-connectors/openai/full-cascade-groq
Pricing information — https://voximplant.com/pricing
Sign up for Voximplant — https://manage.voximplant.com/auth/sign_up

OpenAI

OpenAI product page — https://voximplant.com/products/openai-client
Chat Completions API Client Guide — https://voximplant.com/docs/voice-ai/openai/chat-completions-client
Responses API Client Guide — https://voximplant.com/docs/voice-ai/openai/responses-client
OpenAI module API reference — https://voximplant.com/docs/references/voxengine/openai

VAD and Turn Detection

VAD and Turn Detection product page — https://voximplant.com/products/turn-detection
VAD and Turn Guides — https://docs.voximplant.ai/capabilities/speech-flow-control/
Silero Module (VAD) API reference — https://voximplant.com/docs/references/voxengine/silero
Pipecat Module (Turn detection) API reference — https://voximplant.com/docs/references/voxengine/pipecat

Recommendations

What Is a Voice AI Orchestration Platform?

Learn how a Voice AI Orchestration Platform connects LLMs, STT/TTS, turn‑taking, and telephony (PSTN, SIP, WebRTC) to build reliable real‑time voice agents. See benefits, architecture, and how Voximplant helps.

Ultravox adds SIP to its Voice AI Services using Voximplant

Today Ultravox announced they are directly integrating Voximplant into their platform to provide SIP capabilities. The integration builds on Voximplant’s deep telephony and Voice AI tooling

Cartesia Realtime TTS now available in Voximplant

Voximplant now includes a native Cartesia module for streaming, low-latency text-to-speech (TTS). You can use a single VoxEngine API to synthesize speech in real time, connect it to any call (PSTN, SIP, WebRTC, WhatsApp) and control playback from a Large Language Model (LLM) or other source, all inside VoxEngine.