While the use of audio in video surveillance systems is still not widespread, having audio can enhance a system’s ability to detect and interpret events, as well as enable audio communication over an IP network. The use of audio, however, can be restricted in some countries, so it is a good idea to check with local authorities.

Topics covered in this chapter include application scenarios, audio equipment, audio modes, audio detection alarm, audio compression and audio/video synchronization.


Audio applications

Having audio as an integrated part of a video surveillance system can be an invaluable addition to a system’s ability to detect and interpret events and emergency situations. The ability of audio to cover a 360-degree area enables a video surveillance system to extend its coverage beyond a camera’s field of view. It can instruct a PTZ camera or a PTZ dome camera (or alert the operator of one) to visually verify an audio alarm.

Audio can also be used to provide users with the ability to not only listen in on an area, but also communicate orders or requests to visitors or intruders. For instance, if a person in a camera’s field of view demonstrates suspicious behavior, such as loitering near a bank machine, or is seen to be entering a restricted area, a remote security guard can send a verbal warning to the person. In a situation where a person has been injured, being able to remotely communicate with and notify the victim that help is on the way can also be beneficial. Access control — that is, a remote ‘doorman’ at an entrance — is another area of application. Other applications include a remote helpdesk situation (e.g., an unmanned parking garage), and video conferencing. An audiovisual surveillance system increases the effectiveness of a security or remote monitoring solution by enhancing a remote user’s ability to receive and communicate information.

Audio support and equipment

Audio support can be more easily implemented in a network video system than in an analog CCTV system. In an analog system, separate audio and video cables must be installed from endpoint to endpoint; that is, from the camera and microphone location to the viewing/recording location. If the distance between the microphone and the station is too long, balanced audio equipment must be used, which increases installation costs and difficulty. In a network video system, a network camera with audio support processes the audio and sends both audio and video over the same network cable for monitoring and/or recording. This eliminates the need for extra cabling, and makes synchronizing the audio and video much easier.

A network video system with integrated audio support. Audio and video streams are sent over the same network cable.

Some video encoders have built-in audio, making it possible to add audio even if analog cameras are used in an installation.

A network camera or video encoder with an integrated audio functionality often provides a built-in microphone, and/or mic-in/line-in jack. With mic-in/line-in support, users have the option of using another type or quality of microphone than the one that is built into the camera or video encoder. It also enables the network video product to connect to more than one microphone, and the microphone can be located some distance away from the camera. The microphone should always be placed as close as possible to the source of the sound to reduce noise. In two-way, full-duplex mode, a microphone should face away and be placed some distance from a speaker to reduce feedback from the speaker.

Many Axis network video products do not come with a built-in speaker. An active speaker — a speaker with a built-in amplifier — can be connected directly to a network video product with audio support. If a speaker has no built-in amplifier, it must first connect to an amplifier, which is then connected to a network camera/video encoder.

To minimize disturbance and noise, always use a shielded audio cable and avoid running the cable near power cables and cables carrying high frequency switching signals. Audio cables should also be kept as short as possible. If a long audio cable is required, balanced audio equipment — that is, cable, amplifier and microphone that are all balanced — should be used to reduce noise.

Audio modes

Depending on the application, there may be a need to send audio in only one direction or both directions, which can be done either simultaneously or in one direction at a time. There are three basic modes of audio communication: simplex, half duplex and full duplex.


In simplex mode, audio is sent in one direction only. In this case, audio is sent by the camera to the operator. Applications include remote monitoring and video surveillance.

In this example of a simplex mode, audio is sent by the operator to the camera. It can be used, for instance, to provide spoken instructions to a person seen on the camera or to scare a potential car thief away from a parking lot.

Half duplex

In half-duplex mode, audio is sent in both directions, but only one party at a time can send. This is similar to a walkie-talkie.

Full duplex

In full-duplex mode, audio is sent to and from the operator simultaneously. This mode of communication is similar to a telephone conversation. Full duplex requires that the client PC has a sound card with support for full-duplex audio.

Audio detection alarm

Audio detection alarm can be used as a complement to video motion detection since it can react to events in areas too dark for the video motion detection functionality to work properly. It can also be used to detect activity in areas outside of the camera’s view.

When sounds, such as the breaking of a window or voices in a room, are detected, they can trigger a network camera to send and record video and audio, send e-mail or other alerts, and activate external devices such as alarms. Similarly, alarm inputs such as motion detection and door contacts can be used to trigger video and audio recordings. In a PTZ camera or a PTZ dome camera, audio alarm detection can trigger the camera to automatically turn to a preset location such as a specific window.

Audio compression

Analog audio signals must be converted into digital audio through a sampling process and then compressed to reduce the size for efficient transmission and storage. The conversion and compression is done using an audio codec, an algorithm that codes and decodes audio data.

Sampling frequency

There are many different audio codecs supporting different sampling frequencies and levels of compression. Sampling frequency refers to the number of times per second a sample of an analog audio signal is taken and is defined in hertz (Hz). In general, the higher the sampling frequency, the better the audio quality and the greater the bandwidth and storage needs.

Bit rate

The bit rate is an important setting in audio since it determines the level of compression and, thereby, the quality of the audio. In general, the higher the compression level (the lower the bit rate), the lower the audio quality. The differences in the audio quality of codecs may be particularly noticeable at high compression levels (low bit rates), but not at low compression levels (high bit rates). Higher compression levels may also introduce more latency or delay, but they enable greater savings in bandwidth and storage.

The bit rates most often selected with audio codecs are between 32 kbit/s and 64 kbit/s. Audio bit rates, as with video bit rates, are an important consideration to take into account when calculating total bandwidth and storage requirements.

Audio codecs

Axis network video products support three audio codecs. The first is AAC-LC (Advanced Audio Coding - Low Complexity), also known as MPEG-4 AAC, which requires a license. AAC-LC, particularly at a sampling rate of 16 kHz or higher and at a bit rate of 64 kbit/s, is the recommended codec to use when the best possible audio quality is required. The other two codecs are G.711 and G.726, which are non-licensed technologies.

Audio and video synchronization

Synchronization of audio and video data is handled by a media player (a computer software program used for playing back multimedia files) or by a multimedia framework such as Microsoft DirectX, which is a collection of application programming interfaces that handles multimedia files.

Audio and video are sent over a network as two separate packet streams. In order for the client or player to perfectly synchronize the audio and video streams, the audio and video packets must be time-stamped. The timestamping of video packets using Motion JPEG compression may not always be supported in a network camera. If this is the case and if it is important to have synchronized video and audio, the video format to choose is MPEG-4 or H.264 since such video streams, along with the audio stream, are sent using RTP (Real-time Transport Protocol), which timestamps the video and audio packets. There are many situations, however, where synchronized audio is less important or even undesirable; for example, if audio is to be monitored but not recorded.