VoiceBot API
  • 12 Jul 2022
  • 13 Minutes to read
  • Dark
    Light

VoiceBot API

  • Dark
    Light

Article summary

Summary

Voice Channel Adapter API

The channelURL can be any valid URL that implements this API, and it is configured in the system when a new channel needs to access.

Voice Service Input API

  • POST /voiceservice/sessions - Create Session. Voice service creates a session
  • POST /voiceservice/sessions/{sessionId}/inputs - Receive a input. Voice service receives call input.
  • DELETE /voiceservice/sessions/{sessionId} - Delete a session. Voice service deletes a session.

Voice Service Notification API

Voice Bot Service API

STT & TTS API

Provide STT (Speech to Text) and TTS (Text to Speech) capabilities.

Endpoints

Voice Channel Adapter API

The ChannelURL can be any valid URL that implements this API, and it is configured in the system when a new channel needs to access.

Voice Channel Adapter receives Input.

POST /{channelURL}

Parameters

Request body
Request body the request body contains data with the following structure:
Request body is Voice Message Object

NameTypeRequiredDefaultDescription
sessionIdstringyeschannelIdentifier .
contentVoiceAction[]yes

example:

  {     
          "sessionId":"d3f5b968-ad51-42af-b759-64c0afc40b84", 
          "content": [{ 
	        "type":"playAudioAction",
		"content":{ 
			"type":"voice",
                        "voice":"string", 
    	        	"voiceConfig":{ 
                 		 "encoding": "LINEAR16" , 
                  		"sampleRateHertz": 8000, 
              		 },
			 
                }
         }] 
} 

Response

Response
HTTP/1.1 200 OK

Voice Service API

Create Sessions

POST /voiceservice/sessions

Parameters

Request body
Request body the request body contains data with the following structure:

NameTypeRequiredDefaultDescription
channelstringyestype of the response, including Twilio, SIP
channelIdentifierstringyesThe Unique ID corresponding to voicebotId,such as phone number,SIP URI
visitorVisitor Objectnovisitor information
variablesVariable[]noVariables
ttsVoiceConfigTTSVoiceConfig Objecttext to speechThe configuration of the response audio.

example:

 {    
    "channel": "Twilio", 
    "channelIdentifier": "q3f5b438-xw31-44af-b729-64swaf3d0b56", 
    "visitor": { 
        "phone":"123-4355-212", 
      },
    "variables": [{ 
                  "name":"abc", 
                  "value": "123", 
      }]  
  } 

Response

The Response body contains data with the following structure:

NameTypeDescription
sessionIdGuidthe unique id of the call
contentVoiceAction[]Greeting output

Response

HTTP/1.1 200 OK 
  Content-Type:  application/json 
  {     
          "sessionId":"d3f5b968-ad51-42af-b759-64c0afc40b84", 
          "content": [{ 
	        "type":"playAudioAction",
		"content":{ 
			"type":"voice",
                        "voice":"string", 
    	        	"voiceConfig":{ 
                 		 "encoding": "LINEAR16" , 
                  		"sampleRateHertz": 8000, 
              		 } 
                }
    		}] 
  } 

Receive a input

POST /voiceservice/sessions/{sessionId}/inputs

Parameters

Path parameters

NameTypeRequiredDescription
sessionIdGuidyes

Request body
The request body contains data with the follow structure:

NameTypeRequiredDefaultDescription
typestringyestype of the response, including textInput, voiceInput
textInputstringWhen type is textInputText input to voice robot
voiceInputstringWhen type is voiceInputThe audio data bytes encoded as specified in VoiceConfig. Note: as with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use base64.A base64-encoded string.
sttVoiceConfigSTTVoiceConfig ObjectWhen type is voiceInputProvides information to the recognizer that specifies how to process the request.
ttsVoiceConfigTTSVoiceConfig Objecttext to speechThe configuration of the response audio.

example:

{  
	"type":"voiceInput",  
	"voiceInput":"string", 
    	"sttVoiceConfig":{ 
      		"encoding": "AMR" , 
      		"sampleRateHertz": 8000, 
	   } ,
	 "ttsVoiceConfig":{ 
      		"encoding": "AMR" , 
      		"sampleRateHertz": 8000, 
	   }    
} 

Response

The Response body contains data with the following structure:

NameTypeDescription
contentVoiceAction[]

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
 
  {     
          "content": [{ 
	        "type":"playAudioAction",
		"content":{ 
			"type":"voice",
                        "voice":"string", 
    	        	"voiceConfig":{ 
                 		 "encoding": "LINEAR16" , 
                  		"sampleRateHertz": 8000, 
              		 } 
                }
    		}] 
  }

Delete a Session

DELETE /voiceservice/sessions/{sessionsId}

Parameters

Path parameters

NameTypeRequiredDescription
sessionIdGuidyes

Response

HTTP/1.1 200 OK

Voice Service Notification API

Voice Service receives notifications

POST /voiceservice/sessions/{sessionId}/notifications

Parameters

Path parameters

NameTypeRequiredDescription
sessionIdGuidyes

Request body
The Request body contains data with the following structure:

NameTypeRequiredDefaultDescription
contentVoiceBotAction[]yes

example:

  {     
          "content":[{ 
              "type":"playText", 
              "content":{ 
                    "message": "Hi there! I'm a VoiceBot, here to help answer your questions", 
              } 
           }] 
  } 

Response

HTTP/1.1 200 OK

Voice Bot Service API

Create A New Voice Bot Session

POST /voicebot/voicebots/{VoicebotId}/sessions

Parameters

Path parameters

NameTypeRequiredDescription
voicebotIdGuidyes

Request body
The request body contains data with the follow structure:

NameTypeRequiredDefaultDescription
visitorVisitor ObjectNoVisitor information.
variablesVariable[]NoVariables

example:

  { 
    "visitor": { 
        "phone":"123-4355-212", 
      }, 
    "variables": [{ 
        "name":"string", 
        "value":"string"
    }] 
  } 

Response

The Response body contains data with the following structure:

NameTypeDescription
sessionIdGuidthe unique id of the session
contentVoiceBotAction[]

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
  {     
          "sessionId":"d3f5b968-ad51-42af-b759-64c0afc40b84", 
          "content":[{ 
              "type":"playText", 
              "content":{ 
                    "message": "Hi there! I'm a VoiceBot, here to help answer your questions.", 
              } 
            }] 
  } 

Voice Bot receive a message

POST /voicebot/sessions/{sessionid}/messages

Parameters

Path parameters

NameTypeRequiredDescription
sessionIdGuidyesSession id of the voice conversation

Request body
The request body contains data with the follow structure:

NameTypeRequiredDefaultDescription
textInputStringYesText input to voice robot
isTransferFailedBoolNoIf the bot Transfer Chat to agent failed
isLowSTTConfidenceBoolNoIf the STT is low confidence

example:

  { 
    "textInput":"I want to buy NBN", 
    "isTransferFailed": false, 
  }  

Response

the response is: VoicebotOutput Object

Response

  HTTP/1.1 200 OK 
  Content-Type:application/json 
  {     
    "id": "f9928d68-92e6-4487-a2e8-8234fc9d1f48", 
    "content": [ 
          { 
            "type":"playText", 
            "content": { 
                "message":"Hi, what can I do for you?", 
            } 
          } 
        ] 
  } 

Delete Voice Bot session

DELETE /voicebot/sessions/{sessionId}

Parameters

Path parameters

NameTypeRequiredDescription
sessionIdGuidyes

example:

Using curl

curl -H "Content-Type: application/json" -d

-X Delete https://api11.comm100.io/v4/voicebot/sessions/{sessionId}

Response

HTTP/1.1 204 OK

STT & TTS API

Speech To Text

Performs synchronous speech recognition: receive results after all audio has been sent and processed.

POST /stttts/stt:speechToText

Parameters

Request body
The request body contains data with the follow structure:

NameTypeRequiredDefaultDescription
configSTTVoiceConfig ObjectyesProvides information to the recognizer that specifies how to process the request.
audiostringyesThe audio data to be recognized.
enginestringnogooglee.g. google

example:

 { 
    "config": { 
	"encoding": "AMR" , 
	"sampleRateHertz": 8000, 
	"languageCode": "en-US", 
    }, 
    "audio": "string"    
  } 

Response

The Response body contains data with the following structure:

NameTypeDescription
resultsSpeechRecognitionResult[]Sequential list of transcription results corresponding to sequential portions of audio.

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
  {     
    "results": [ 
      {   
        "alternatives": [ 
          { 
            "transcript": "string", 
            "confidence": 0.8, 
          } 
        ]        
      } 
    ] 
  }  

Text To Speech

Synthesizes speech synchronously: receive results after all text input has been processed.

POST /stttts/tts:textToSpeech

Parameters

Request body
The request body contains data with the follow structure:

NameTypeRequiredDefaultDescription
inputSynthesisInput ObjectyesThe Synthesizer requires plain text as input.
configTTSVoiceConfig ObjectyesThe configuration of the synthesized audio.
enginestringnogooglee.g. google

example:

 { 
    "input": { 
      "text": "string", 
    }, 
    "config":{ 
        "encoding":"Mulaw",
	"sampleRateHertz":8000,
	"languageCode": "en-US", 
	"gender": "MALE", 	
    }    
} 

Response

The Response body contains data with the following structure:

NameTypeDescription
voiceContentstringThe audio data bytes encoded as specified in the request, including the header for encodings that are wrapped in containers (e.g. MP3, OGG_OPUS). For LINEAR16 audio, we include the WAV header. Note: as with all bytes fields, protobuffers use a pure binary representation, whereas JSON representations use base64.A base64-encoded string.

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
  { 
      "voiceContent": "string"
  } 

# Model
## Voice Message Object

NameTypeDefaultDescription
idGuid
contentVoiceAction[]

VoicebotOutput Object

NameTypeDefaultDescription
idGuid
contentVoiceBotAction[]

VoicebotAction Object

NameTypeDefaultDescription
typeenumtype of the response,including PlayAudio,PlayText,CollectDTMFDigits,CollectSpeechResponse,IVRMenu,TransferCall, EndCall
contentobjectresponse's content. when type is PlayAudio, it represents PlayAudio; when type is PlayText,it represents PlayText;when type is CollectDTMFDigits,it represents CollectDTMFDigits; when type is CollectSpeechResponse, it represents CollectSpeechResponse;when type is EndCall, it represents EndCall;when type is IVRMenu, it represents IVRMenu;when type is TransferCall, it represents TransferCall;

VoiceAction Object

NameTypeDefaultDescription
typeenumtype of the response,including PlayAudioAction,CollectDTMFDigitsAction,TransferCallAction, EndCallAction
contentobjectresponse's content. when type is PlayAudioAction, it represents PlayAudioAction;when type is CollectDTMFDigitsAction,it represents CollectDTMFDigitsAction; when type is EndCallAction, it represents EndCallAction;when type is TransferCallAction, it represents TransferCallAction;

PlayAudio Object

Text Response is represented as simple flat json objects with the following keys:

NameTypeDefaultDescription
audioPathStringString

PlayText Object

Text Response is represented as simple flat json objects with the following keys:

NameTypeDefaultDescription
messageStringString
delayTimeintsecond

CollectDTMFDigits Object

Text Response is represented as simple flat json objects with the following keys:

NameTypeDefaultDescription
messageStringString
numberOfDigitsStringEnumeration. 1, 2, … 29, 30, Variable. The number of digits entered by the caller in dialer. Default: Not sure.
stopGatherAfterPresskeyStringEnumeration. *, #. Available when Number of Digits is Not sure.

CollectSpeechResponse Object

Text Response is represented as simple flat json objects with the following keys:

NameTypeDefaultDescription
messageStringString
lowSTTConfidenceMessageStringWe set a default STT Confidence Score for all Voice Bots in system level, customers cannot change in this version.
lowSTTConfidenceRepeatTimesIntAvailable value: 0 - 9. Default: 2.
isConfirmationRequiredBoolIf the bot will reply to the answer to the caller to confirm.
confirmationMessageStringOnly available when Is Confirmation Required is “true”.
confirmationTextStringVisitor can speak the text to confirm the input. This text will not be read to visitors.
confirmationKeyintEnumeration. 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, *, #. Visitor can press the key to confirm.

TransferCall Object

NameTypeDefaultDescription
transferToStringSupport Phone Number and Sip URI.

EndCall Object

NameTypeDefaultDescription

IVRMenu Object

NameTypeDefaultDescription
messageStringString

Visitor Object

NameTypeDefaultDescription
nameStringName of the visitor
callerIDStringPhone or sipURI of the visitor
stateStringState/province of the visitor
countryStringCountry/region of the visitor
cityStringCity of the visitor

STTVoiceConfig Object

NameTypeDefaultDescription
encodingenum(STTAudioEncoding)Encoding of audio data. For details, see AudioEncoding.
sampleRateHertzIntSample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that is not possible, use the native sample rate of the audio source (instead of re-sampling). This field is optional for FLAC and WAV audio files, but is required for all other audio formats. For details, see AudioEncoding.
languageCodestringThe language of the voice expressed as a BCP-47 language tag, e.g. "en-US".

STTAudioEncoding Object

The encoding of the audio data .

For best results, the audio source should be captured and transmitted using a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, MP3, and WEBM_OPUS.

The FLAC and WAV audio file formats include a header that describes the included audio content. You can request recognition for WAV files that contain either LINEAR16 or MULAW encoded audio. If you specify an AudioEncoding when you send FLAC or WAV audio, the encoding configuration must match the encoding described in the audio header;

Enums
ENCODING_UNSPECIFIEDNot specified.
LINEAR16Uncompressed 16-bit signed little-endian samples (Linear PCM).
FLACFLAC (Free Lossless Audio Codec) is the recommended encoding because it is lossless--therefore recognition is not compromised--and requires only about half the bandwidth of LINEAR16. FLAC stream encoding supports 16-bit and 24-bit samples, however, not all fields in STREAMINFO are supported.
MULAW8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.
AMRAdaptive Multi-Rate Narrowband codec. sampleRateHertz must be 8000.
AMR_WBAdaptive Multi-Rate Wideband codec. sampleRateHertz must be 16000.
OGG_OPUSOpus encoded audio frames in Ogg container (OggOpus). sampleRateHertz must be one of 8000, 12000, 16000, 24000, or 48000.
SPEEX_WITH_HEADER_BYTEAlthough the use of lossy encodings is not recommended, if a very low bitrate encoding is required, OGG_OPUS is highly preferred over Speex encoding. The Speex encoding supported by Cloud Speech API has a header byte in each block, as in MIME type audio/x-speex-with-header-byte. It is a variant of the RTP Speex encoding defined in RFC 5574. The stream is a sequence of blocks, one block per RTP packet. Each block starts with a byte containing the length of the block, in bytes, followed by one or more frames of Speex data, padded to an integral number of bytes (octets) as specified in RFC 5574. In other words, each RTP header is replaced with a single byte containing the block length. Only Speex wideband is supported. sampleRateHertz must be 16000.
WEBM_OPUSOpus encoded audio frames in WebM container (OggOpus). sampleRateHertz must be one of 8000, 12000, 16000, 24000, or 48000.

TTSVoiceConfig Object

NameTypeDefaultDescription
encodingenum(TTSAudioEncoding)Encoding of audio data. For details, see AudioEncoding.
sampleRateHertzIntSample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that is not possible, use the native sample rate of the audio source (instead of re-sampling). This field is optional for FLAC and WAV audio files, but is required for all other audio formats. For details, see AudioEncoding.
languageCodeStringThe language of the voice expressed as a BCP-47 language tag, e.g. "en-US".
GenderenumMALE, FEMALE.

TTSAudioEncoding Object

Configuration to set up audio encoder. The encoding determines the output audio format that we'd like.

Enums
AUDIO_ENCODING_UNSPECIFIEDNot specified.
LINEAR16Uncompressed 16-bit signed little-endian samples (Linear PCM).
MP3MP3 audio at 32kbps.
OGG_OPUSOpus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in browsers (at least Chrome and Firefox). The quality of the encoding is considerably higher than MP3 while using approximately the same bitrate.
MULAW8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law. Audio content returned as MULAW also contains a WAV header.
ALAW8-bit samples that compand 14-bit audio samples using G.711 PCMU/A-law. Audio content returned as ALAW also contains a WAV header.

SpeechRecognitionResult Object

A speech recognition results corresponding to a portion of the audio.

NameTypeDefaultDescription
alternativesSpeechRecognitionAlternativeMay contain one or more recognition hypotheses (up to the maximum specified in maxAlternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

SpeechRecognitionAlternative Object

Alternative hypotheses (a.k.a. n-best list).

NameTypeDefaultDescription
transcriptStringTranscript text representing the words that the user spoke.
confidenceNumberThe confidence estimates between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

SynthesisInput Object

Contains text input to be synthesized. The input size is limited to 5000 characters.

NameTypeDefaultDescription
textStringThe raw text to be synthesized.

Variable Object

NameTypeDefaultDescription
nameStringthe name of a variable in a form.
valueStringthe value of a variable.

PlayAudioAction Object

NameTypeDefaultDescription
typeStringRequiredtype of the response,including voice,url
voicestringRequired when type is voiceThe audio data bytes encoded as specified in VoiceConfig. Note: as with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use base64.A base64-encoded string.
voiceConfigTTSVoiceConfig ObjectRequired when type is voiceThe encoding of the voice data sent in the request.
audioPathStringRequired when type is urlthe audio file url

CollectDTMFDigitsAction Object

NameTypeDefaultDescription
numberOfDigitsStringEnumeration. 1, 2, … 29, 30, Variable. The number of digits entered by the caller in dialer. Default: Not sure.
stopGatherAfterPresskeyStringEnumeration. *, #. Available when Number of Digits is Not sure.

TransferCallAction Object

NameTypeDefaultDescription
transferToStringSupport Phone Number and Sip URI.

EndCallAction Object

NameTypeDefaultDescription

Was this article helpful?