Cost-effective solution to synchronised audio-visual data capture using multiple sensors

Question

Q1. What have the authors contributed in "Cost-effective solution to synchronised audio-visual data capture using multiple sensors" ?

Q2. What is the way to improve image sharpness with low cost lenses?

Q3. What is the importance of the microphone setup for accurate multimodal data capture?

Q4. What type of sensor can be synchronised with the audio data?

Q5. What is the primary choice for computer vision applications involving moving objects?

Q6. What is the way to record the timestamp signals?

Q7. How long does the delay of the slave cameras last?

Q8. Why do some sensor locations have a higher or lower read-out value than the correct measurements?

Q9. How can a PC synchronise its CPU cycle count?

Q10. How did the authors configure the port to transmit at 9600 bits per second?

Q11. How did the authors disable displaying the live video?

Q12. Why did Zitnick and others seek custom solutions?

Q13. What is the problem of the high cost of custom solutions and professional hardware?

Q14. How many triggers were used to synchronise the cameras?

Q15. How can the authors estimate the accuracy of the synchronisation of the gaze data?

Q16. What are the main problems of using low-cost commercial COTS components?

Q17. How can the authors find a linear mapping between audio sample number and the time of the external system?

Q18. What can be used as a common time base?

Q19. How many cameras can be connected to one FireWire bus?

Q20. What is the difference between the transmission and reception time?

Q21. How accurate is the synchronisation of data?

Q22. What is the uncertainty of localising the camera trigger edge?

Accepted Answer

Furthermore, the authors show that a consumer PC can currently capture 8-bit video data with 1024x1024 spatialand 59. The authors thus improve the quality/cost ratio of multi-sensor systems data capture systems.

Accepted Answer

If a monochrome camera is used, a monochrome colour source can improve image sharpness with lowcost lenses, by preventing chromatic abberation.

Accepted Answer

Since many audio processing methods are vulnerable to noise, the microphone setup is an important factor for accurate multimodal data capture.

Accepted Answer

Any type of sensor can be synchronised with the audio data, as long as it produces a measurable signal at the data capture moment, and its output data include reliable sample counts or timestamps relative to the first sample.

Accepted Answer

For computer vision applications involving moving objects, such as human beings or parts of the human body, progressive scan global shutter sensors are the primary choice.

Accepted Answer

The timestamp signals from multiple PCs can be recorded as separate channels in a multi-channel audio interface, making use of the hardware-synchronisation between the different audio channels.

Accepted Answer

When the trigger output of themaster camera is used as the input to the slave cameras, the resulting delay of the slave cameras is approximately 30µs.

Accepted Answer

Due to irregularities in sensor production, or the influence of radiation, some sensor locations have a defect that causes their pixel read-out values to be significantly higher (hot) or lower (cold) than the correct measurements.

Accepted Answer

Capture software running on different PCs can be synchronised by letting each PC transmit its CPU cycle count as timestamp signals outputted by the serial port.

Accepted Answer

In their recordings, the authors used the MOTU8pre at 48kHz sampling rate and the authors configured the serial port to transmit at 9600 bits per second (bps).

Accepted Answer

to prevent the communication to the PCI graphics card from reducing the storage WTR, the authors had to disable displaying the live video.

Accepted Answer

Because of the shortcomings and high costs of commercially available video capture systems, many researchers have already sought custom solutions that meet their own requirements.

Accepted Answer

The problem of the high cost of custom solutions and specialised professional hardware is that it keeps accurately synchronised multi-sensor data capture out of reach for most computer vision and pattern recognition researchers.

Accepted Answer

They used a tree of trigger connections between the processing boards (that each control one camera) to synchronise the cameras with a difference of 200 nanoseconds between subsequent levels of the tree.

Accepted Answer

Using a photo diode that is sensitive to IR, the authors could record these flashes as a sensor trigger signal in one of the audio channels and estimate the accuracy of synchronisation of the gaze data.

Accepted Answer

To overcome this, the authors propose solutions and present findings regarding the two most important difficulties in using low-cost Commercial Off-The-Shelf (COTS) components: reaching the required bandwidth for data capture and achieving accurate multi-sensor synchronisation.

Accepted Answer

the authors could find a linear mapping between audio sample number and the time of the external system, by applying a linear fit on all two-dimensional time synchronisation points (timestamps with corresponding audio time) that are received during a recording.

Accepted Answer

These signals can be recorded in a parallel audio channel as well, and can even be used as a common time base to synchronise multiple asynchronous audio interfaces.

Accepted Answer

The maximum number of cameras that can be connected to one FireWire bus is typically limited to 4 or 8 (DMA channels), depending on the bus hardware.

Accepted Answer

Assuming that the process of transmission and reception are symmetric, the transmission latency can be found as half of the time needed for transmitting and receiving the timestamp signal, compensated by the duration of the signal.

Accepted Answer

The above-discussed experiments show that synchronisation by transmitting timestamp signals through the serial port, can be done with an accuracy of approximately 20µs.

Accepted Answer

This means that, with an audio sampling rate of 48kHz, the uncertainty of localising the rising camera trigger edge is around 20µs.

Cost-effective solution to synchronised audio-visual data capture using multiple sensors

Figures

Citations

A Multimodal Database for Affect Recognition and Implicit Tagging

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions

The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent

A Review of Human Activity Recognition Methods

The SEMAINE corpus of emotionally coloured character interactions

References

Facial action coding system: a technique for the measurement of facial movement

Facial action coding system

High-quality video view interpolation using a layered representation

Optical properties of human skin, subcutaneous and mucous tissues in the wavelength range from 400 to 2000 nm

High performance imaging using large camera arrays

Related Papers (5)

A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions

The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data

The Vera am Mittag German audio-visual emotional speech database

The world of emotions is not two-dimensional.