VoiceBot API

12 Jul 2022
13 Minutes to read

Print
Share
Dark
Light

VoiceBot API

Updated on 12 Jul 2022
13 Minutes to read

Print
Share
Dark
Light

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Summary

Voice Channel Adapter API

The channelURL can be any valid URL that implements this API, and it is configured in the system when a new channel needs to access.

POST /{channelURL} - Voice Channel Adapter receives input.

Voice Service Input API

POST /voiceservice/sessions - Create Session. Voice service creates a session
POST /voiceservice/sessions/{sessionId}/inputs - Receive a input. Voice service receives call input.
DELETE /voiceservice/sessions/{sessionId} - Delete a session. Voice service deletes a session.

Voice Service Notification API

POST /voiceservice/sessions/{sessionId}/notifications - Voice Service receives notifications

Voice Bot Service API

POST /voicebot/voicebots/{VoicebotId}/sessions - Create Session Voice Bot creates session
POST /voicebot/sessions/{sessionId}/messages - Voice Bot receive a message. Voice Bot receives input
DELETE /voicebot/sessions/{sessionId} - Delete Session Voice Bot deletes the session

STT & TTS API

Provide STT (Speech to Text) and TTS (Text to Speech) capabilities.

POST /stttts/stt:speechToText - Speech To Text
POST /stttts/tts:textToSpeech - Text To Speech

Endpoints

Voice Channel Adapter API

The ChannelURL can be any valid URL that implements this API, and it is configured in the system when a new channel needs to access.

Voice Channel Adapter receives Input.

POST /{channelURL}

Parameters

Request body
Request body the request body contains data with the following structure:
Request body is Voice Message Object

Name	Type	Required	Default	Description
`sessionId`	string	yes		channelIdentifier .
`content`	VoiceAction[]	yes

example:

  {     
          "sessionId":"d3f5b968-ad51-42af-b759-64c0afc40b84", 
          "content": [{ 
	        "type":"playAudioAction",
		"content":{ 
			"type":"voice",
                        "voice":"string", 
    	        	"voiceConfig":{ 
                 		 "encoding": "LINEAR16" , 
                  		"sampleRateHertz": 8000, 
              		 },
			 
                }
         }] 
}

Response

Response
HTTP/1.1 200 OK

Voice Service API

Create Sessions

POST /voiceservice/sessions

Parameters

Request body
Request body the request body contains data with the following structure:

Name	Type	Required	Description
`channel`	string	yes	type of the response, including `Twilio`, `SIP`
`channelIdentifier`	string	yes	The Unique ID corresponding to voicebotId,such as phone number，SIP URI
`visitor`	Visitor Object	no	visitor information
`variables`	Variable[]	no	Variables
`ttsVoiceConfig`	TTSVoiceConfig Object	text to speech	The configuration of the response audio.

example:

 {    
    "channel": "Twilio", 
    "channelIdentifier": "q3f5b438-xw31-44af-b729-64swaf3d0b56", 
    "visitor": { 
        "phone":"123-4355-212", 
      },
    "variables": [{ 
                  "name":"abc", 
                  "value": "123", 
      }]  
  }

Response

The Response body contains data with the following structure:

Name	Type	Description
`sessionId`	Guid	the unique id of the call
`content`	VoiceAction[]	Greeting output

Response

HTTP/1.1 200 OK 
  Content-Type:  application/json 
  {     
          "sessionId":"d3f5b968-ad51-42af-b759-64c0afc40b84", 
          "content": [{ 
	        "type":"playAudioAction",
		"content":{ 
			"type":"voice",
                        "voice":"string", 
    	        	"voiceConfig":{ 
                 		 "encoding": "LINEAR16" , 
                  		"sampleRateHertz": 8000, 
              		 } 
                }
    		}] 
  }

Receive a input

POST /voiceservice/sessions/{sessionId}/inputs

Parameters

Path parameters

Name	Type	Required	Description
`sessionId`	Guid	yes

Request body
The request body contains data with the follow structure:

Name	Type	Required	Description
`type`	string	yes	type of the response, including textInput, voiceInput
`textInput`	string	When type is textInput	Text input to voice robot
`voiceInput`	string	When type is voiceInput	The audio data bytes encoded as specified in VoiceConfig. Note: as with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use base64.A base64-encoded string.
`sttVoiceConfig`	STTVoiceConfig Object	When type is voiceInput	Provides information to the recognizer that specifies how to process the request.
`ttsVoiceConfig`	TTSVoiceConfig Object	text to speech	The configuration of the response audio.

example:

{  
	"type":"voiceInput",  
	"voiceInput":"string", 
    	"sttVoiceConfig":{ 
      		"encoding": "AMR" , 
      		"sampleRateHertz": 8000, 
	   } ,
	 "ttsVoiceConfig":{ 
      		"encoding": "AMR" , 
      		"sampleRateHertz": 8000, 
	   }    
}

Response

The Response body contains data with the following structure:

Name	Type	Description
`content`	VoiceAction[]

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
 
  {     
          "content": [{ 
	        "type":"playAudioAction",
		"content":{ 
			"type":"voice",
                        "voice":"string", 
    	        	"voiceConfig":{ 
                 		 "encoding": "LINEAR16" , 
                  		"sampleRateHertz": 8000, 
              		 } 
                }
    		}] 
  }

Delete a Session

DELETE /voiceservice/sessions/{sessionsId}

Parameters

Path parameters

Name	Type	Required	Description
`sessionId`	Guid	yes

Response

HTTP/1.1 200 OK

Voice Service Notification API

Voice Service receives notifications

POST /voiceservice/sessions/{sessionId}/notifications

Parameters

Path parameters

Name	Type	Required	Description
`sessionId`	Guid	yes

Request body
The Request body contains data with the following structure:

Name	Type	Required	Default	Description
`content`	VoiceBotAction[]	yes

example:

  {     
          "content":[{ 
              "type":"playText", 
              "content":{ 
                    "message": "Hi there! I'm a VoiceBot, here to help answer your questions", 
              } 
           }] 
  }

Response

HTTP/1.1 200 OK

Voice Bot Service API

Create A New Voice Bot Session

POST /voicebot/voicebots/{VoicebotId}/sessions

Parameters

Path parameters

Name	Type	Required	Description
`voicebotId`	Guid	yes

Request body
The request body contains data with the follow structure:

Name	Type	Required	Default	Description
`visitor`	Visitor Object	No		Visitor information.
`variables`	Variable[]	No		Variables

example:

  { 
    "visitor": { 
        "phone":"123-4355-212", 
      }, 
    "variables": [{ 
        "name":"string", 
        "value":"string"
    }] 
  }

Response

The Response body contains data with the following structure:

Name	Type	Description
`sessionId`	Guid	the unique id of the session
`content`	VoiceBotAction[]

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
  {     
          "sessionId":"d3f5b968-ad51-42af-b759-64c0afc40b84", 
          "content":[{ 
              "type":"playText", 
              "content":{ 
                    "message": "Hi there! I'm a VoiceBot, here to help answer your questions.", 
              } 
            }] 
  }

Voice Bot receive a message

POST /voicebot/sessions/{sessionid}/messages

Parameters

Path parameters

Name	Type	Required	Description
`sessionId`	Guid	yes	Session id of the voice conversation

Request body
The request body contains data with the follow structure:

Name	Type	Required	Description
`textInput`	String	Yes	Text input to voice robot
`isTransferFailed`	Bool	No	If the bot Transfer Chat to agent failed
`isLowSTTConfidence`	Bool	No	If the STT is low confidence

example:

  { 
    "textInput":"I want to buy NBN", 
    "isTransferFailed": false, 
  }

Response

the response is: VoicebotOutput Object

Response

  HTTP/1.1 200 OK 
  Content-Type:application/json 
  {     
    "id": "f9928d68-92e6-4487-a2e8-8234fc9d1f48", 
    "content": [ 
          { 
            "type":"playText", 
            "content": { 
                "message":"Hi, what can I do for you?", 
            } 
          } 
        ] 
  }

Delete Voice Bot session

DELETE /voicebot/sessions/{sessionId}

Parameters

Path parameters

Name	Type	Required	Description
`sessionId`	Guid	yes

example:

Using curl

curl -H "Content-Type: application/json" -d

-X Delete https://api11.comm100.io/v4/voicebot/sessions/{sessionId}

Response

HTTP/1.1 204 OK

STT & TTS API

Speech To Text

Performs synchronous speech recognition: receive results after all audio has been sent and processed.

POST /stttts/stt:speechToText

Parameters

Request body
The request body contains data with the follow structure:

Name	Type	Required	Default	Description
`config`	STTVoiceConfig Object	yes		Provides information to the recognizer that specifies how to process the request.
`audio`	string	yes		The audio data to be recognized.
`engine`	string	no	google	e.g. google

example:

 { 
    "config": { 
	"encoding": "AMR" , 
	"sampleRateHertz": 8000, 
	"languageCode": "en-US", 
    }, 
    "audio": "string"    
  }

Response

The Response body contains data with the following structure:

Name	Type	Description
`results`	SpeechRecognitionResult[]	Sequential list of transcription results corresponding to sequential portions of audio.

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
  {     
    "results": [ 
      {   
        "alternatives": [ 
          { 
            "transcript": "string", 
            "confidence": 0.8, 
          } 
        ]        
      } 
    ] 
  }

Text To Speech

Synthesizes speech synchronously: receive results after all text input has been processed.

POST /stttts/tts:textToSpeech

Parameters

Request body
The request body contains data with the follow structure:

Name	Type	Required	Default	Description
`input`	SynthesisInput Object	yes		The Synthesizer requires plain text as input.
`config`	TTSVoiceConfig Object	yes		The configuration of the synthesized audio.
`engine`	string	no	google	e.g. google

example:

 { 
    "input": { 
      "text": "string", 
    }, 
    "config":{ 
        "encoding":"Mulaw",
	"sampleRateHertz":8000,
	"languageCode": "en-US", 
	"gender": "MALE", 	
    }    
}

Response

The Response body contains data with the following structure:

Name	Type	Description
`voiceContent`	string	The audio data bytes encoded as specified in the request, including the header for encodings that are wrapped in containers (e.g. MP3, OGG_OPUS). For LINEAR16 audio, we include the WAV header. Note: as with all bytes fields, protobuffers use a pure binary representation, whereas JSON representations use base64.A base64-encoded string.

Response

 HTTP/1.1 200 OK 
  Content-Type:  application/json 
  { 
      "voiceContent": "string"
  }

# Model
## Voice Message Object

Name	Type	Default	Description
`id`	Guid
`content`	VoiceAction[]

VoicebotOutput Object

Name	Type	Default	Description
`id`	Guid
`content`	VoiceBotAction[]

VoicebotAction Object

Name	Type	Default	Description
`type`	enum		type of the response,including `PlayAudio`,`PlayText`,`CollectDTMFDigits`,`CollectSpeechResponse`,`IVRMenu`,`TransferCall`, `EndCall`
`content`	object		response's content. when type is `PlayAudio`, it represents PlayAudio; when type is `PlayText`,it represents PlayText;when type is `CollectDTMFDigits`,it represents CollectDTMFDigits; when type is `CollectSpeechResponse`, it represents CollectSpeechResponse;when type is `EndCall`, it represents EndCall;when type is `IVRMenu`, it represents IVRMenu;when type is `TransferCall`, it represents TransferCall;

VoiceAction Object

Name	Type	Default	Description
`type`	enum		type of the response,including `PlayAudioAction`,`CollectDTMFDigitsAction`,`TransferCallAction`, `EndCallAction`
`content`	object		response's content. when type is `PlayAudioAction`, it represents PlayAudioAction;when type is `CollectDTMFDigitsAction`,it represents CollectDTMFDigitsAction; when type is `EndCallAction`, it represents EndCallAction;when type is `TransferCallAction`, it represents TransferCallAction;

PlayAudio Object

Text Response is represented as simple flat json objects with the following keys:

Name	Type	Default	Description
`audioPath`	String		String

PlayText Object

Text Response is represented as simple flat json objects with the following keys:

Name	Type	Default	Description
`message`	String		String
`delayTime`	int		second

CollectDTMFDigits Object

Text Response is represented as simple flat json objects with the following keys:

Name	Type	Description
`message`	String	String
`numberOfDigits`	String	Enumeration. 1, 2, … 29, 30, Variable. The number of digits entered by the caller in dialer. Default: Not sure.
`stopGatherAfterPresskey`	String	Enumeration. *, #. Available when Number of Digits is Not sure.

CollectSpeechResponse Object

Text Response is represented as simple flat json objects with the following keys:

Name	Type	Description
`message`	String	String
`lowSTTConfidenceMessage`	String	We set a default STT Confidence Score for all Voice Bots in system level, customers cannot change in this version.
`lowSTTConfidenceRepeatTimes`	Int	Available value: 0 - 9. Default: 2.
`isConfirmationRequired`	Bool	If the bot will reply to the answer to the caller to confirm.
`confirmationMessage`	String	Only available when Is Confirmation Required is “true”.
`confirmationText`	String	Visitor can speak the text to confirm the input. This text will not be read to visitors.
`confirmationKey`	int	Enumeration. 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, *, #. Visitor can press the key to confirm.

TransferCall Object

Name	Type	Default	Description
`transferTo`	String		Support Phone Number and Sip URI.

EndCall Object

Name	Type	Default	Description

IVRMenu Object

Name	Type	Default	Description
`message`	String		String

Visitor Object

Name	Type	Description
`name`	String	Name of the visitor
`callerID`	String	Phone or sipURI of the visitor
`state`	String	State/province of the visitor
`country`	String	Country/region of the visitor
`city`	String	City of the visitor

STTVoiceConfig Object

Name	Type	Description
`encoding`	enum(STTAudioEncoding)	Encoding of audio data. For details, see AudioEncoding.
`sampleRateHertz`	Int	Sample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that is not possible, use the native sample rate of the audio source (instead of re-sampling). This field is optional for FLAC and WAV audio files, but is required for all other audio formats. For details, see AudioEncoding.
`languageCode`	string	The language of the voice expressed as a BCP-47 language tag, e.g. "en-US".

STTAudioEncoding Object

The encoding of the audio data .

For best results, the audio source should be captured and transmitted using a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, MP3, and WEBM_OPUS.

The FLAC and WAV audio file formats include a header that describes the included audio content. You can request recognition for WAV files that contain either LINEAR16 or MULAW encoded audio. If you specify an AudioEncoding when you send FLAC or WAV audio, the encoding configuration must match the encoding described in the audio header;

Enums
`ENCODING_UNSPECIFIED`	Not specified.
`LINEAR16`	Uncompressed 16-bit signed little-endian samples (Linear PCM).
`FLAC`	FLAC (Free Lossless Audio Codec) is the recommended encoding because it is lossless--therefore recognition is not compromised--and requires only about half the bandwidth of LINEAR16. FLAC stream encoding supports 16-bit and 24-bit samples, however, not all fields in STREAMINFO are supported.
`MULAW`	8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.
`AMR`	Adaptive Multi-Rate Narrowband codec. sampleRateHertz must be 8000.
`AMR_WB`	Adaptive Multi-Rate Wideband codec. sampleRateHertz must be 16000.
`OGG_OPUS`	Opus encoded audio frames in Ogg container (OggOpus). sampleRateHertz must be one of 8000, 12000, 16000, 24000, or 48000.
`SPEEX_WITH_HEADER_BYTE`	Although the use of lossy encodings is not recommended, if a very low bitrate encoding is required, OGG_OPUS is highly preferred over Speex encoding. The Speex encoding supported by Cloud Speech API has a header byte in each block, as in MIME type audio/x-speex-with-header-byte. It is a variant of the RTP Speex encoding defined in RFC 5574. The stream is a sequence of blocks, one block per RTP packet. Each block starts with a byte containing the length of the block, in bytes, followed by one or more frames of Speex data, padded to an integral number of bytes (octets) as specified in RFC 5574. In other words, each RTP header is replaced with a single byte containing the block length. Only Speex wideband is supported. sampleRateHertz must be 16000.
`WEBM_OPUS`	Opus encoded audio frames in WebM container (OggOpus). sampleRateHertz must be one of 8000, 12000, 16000, 24000, or 48000.

TTSVoiceConfig Object

Name	Type	Description
`encoding`	enum(TTSAudioEncoding)	Encoding of audio data. For details, see AudioEncoding.
`sampleRateHertz`	Int	Sample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that is not possible, use the native sample rate of the audio source (instead of re-sampling). This field is optional for FLAC and WAV audio files, but is required for all other audio formats. For details, see AudioEncoding.
`languageCode`	String	The language of the voice expressed as a BCP-47 language tag, e.g. "en-US".
`Gender`	enum	MALE, FEMALE.

TTSAudioEncoding Object

Configuration to set up audio encoder. The encoding determines the output audio format that we'd like.

Enums
`AUDIO_ENCODING_UNSPECIFIED`	Not specified.
`LINEAR16`	Uncompressed 16-bit signed little-endian samples (Linear PCM).
`MP3`	MP3 audio at 32kbps.
`OGG_OPUS`	Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in browsers (at least Chrome and Firefox). The quality of the encoding is considerably higher than MP3 while using approximately the same bitrate.
`MULAW`	8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law. Audio content returned as MULAW also contains a WAV header.
`ALAW`	8-bit samples that compand 14-bit audio samples using G.711 PCMU/A-law. Audio content returned as ALAW also contains a WAV header.

SpeechRecognitionResult Object

A speech recognition results corresponding to a portion of the audio.

Name	Type	Default	Description
`alternatives`	SpeechRecognitionAlternative		May contain one or more recognition hypotheses (up to the maximum specified in maxAlternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

SpeechRecognitionAlternative Object

Alternative hypotheses (a.k.a. n-best list).

Name	Type	Default	Description
`transcript`	String		Transcript text representing the words that the user spoke.
`confidence`	Number		The confidence estimates between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

SynthesisInput Object

Contains text input to be synthesized. The input size is limited to 5000 characters.

Name	Type	Default	Description
`text`	String		The raw text to be synthesized.

Variable Object

Name	Type	Default	Description
`name`	String		the name of a variable in a form.
`value`	String		the value of a variable.

PlayAudioAction Object

Name	Type	Default	Description
`type`	String	Required	type of the response,including `voice`,`url`
`voice`	string	Required when type is `voice`	The audio data bytes encoded as specified in VoiceConfig. Note: as with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use base64.A base64-encoded string.
`voiceConfig`	TTSVoiceConfig Object	Required when type is `voice`	The encoding of the voice data sent in the request.
`audioPath`	String	Required when type is `url`	the audio file url

CollectDTMFDigitsAction Object

Name	Type	Default	Description
`numberOfDigits`	String		Enumeration. 1, 2, … 29, 30, Variable. The number of digits entered by the caller in dialer. Default: Not sure.
`stopGatherAfterPresskey`	String		Enumeration. *, #. Available when Number of Digits is Not sure.

TransferCallAction Object

Name	Type	Default	Description
`transferTo`	String		Support Phone Number and Sip URI.

EndCallAction Object

Name	Type	Default	Description

Was this article helpful?

What's Next

VoiceBot Intent