scispace - formally typeset
Search or ask a question
Author

Simon Dobrišek

Bio: Simon Dobrišek is an academic researcher from University of Ljubljana. The author has contributed to research in topics: Facial recognition system & Computer science. The author has an hindex of 13, co-authored 59 publications receiving 552 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: In the latest version of the system all the modules of the early version are being integrated into the user interface, which has some basic web-browsing functionalities and a text-to-speech screen-reader function controlled by the mouse as well.
Abstract: Blind and visually-impaired people face many problems in interacting with information retrieval systems. State-of-the-art spoken language technology offers potential to overcome many of them. In the mid-nineties our research group decided to develop an information retrieval system suitable for Slovene-speaking blind and visually-impaired people. A voice-driven text-to-speech dialogue system was developed for reading Slovenian texts obtained from the Electronic Information System of the Association of Slovenian Blind and Visually Impaired Persons Societies. The evolution of the system is presented. The early version of the system was designed to deal explicitly with the Electronic Information System where the available text corpora are stored in a plain text file format without any, or with just some, basic non-standard tagging. Further improvements to the system became possible with the decision to transfer the available corpora to the new web portal, exclusively dedicated to blind and visually-impaired users. The text files were reformatted into common HTML/XML pages, which comply with the basic recommendations set by the Web Access Initiative. In the latest version of the system all the modules of the early version are being integrated into the user interface, which has some basic web-browsing functionalities and a text-to-speech screen-reader function controlled by the mouse as well.

95 citations

Journal ArticleDOI
TL;DR: The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.
Abstract: The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when bu...

71 citations

Journal ArticleDOI
TL;DR: A comprehensive experimental evaluation of several recent face detectors for their performance with masked-face images is conducted and the usefulness of multiple off-the-shelf deep-learning models for recognizing correct face-mask placement is investigated.
Abstract: The new Coronavirus disease (COVID-19) has seriously affected the world. By the end of November 2020, the global number of new coronavirus cases had already exceeded 60 million and the number of deaths 1,410,378 according to information from the World Health Organization (WHO). To limit the spread of the disease, mandatory face-mask rules are now becoming common in public settings around the world. Additionally, many public service providers require customers to wear face-masks in accordance with predefined rules (e.g., covering both mouth and nose) when using public services. These developments inspired research into automatic (computer-vision-based) techniques for face-mask detection that can help monitor public behavior and contribute towards constraining the COVID-19 pandemic. Although existing research in this area resulted in efficient techniques for face-mask detection, these usually operate under the assumption that modern face detectors provide perfect detection performance (even for masked faces) and that the main goal of the techniques is to detect the presence of face-masks only. In this study, we revisit these common assumptions and explore the following research questions: (i) How well do existing face detectors perform with masked-face images? (ii) Is it possible to detect a proper (regulation-compliant) placement of facial masks? and iii) How useful are existing face-mask detection techniques for monitoring applications during the COVID-19 pandemic? To answer these and related questions we conduct a comprehensive experimental evaluation of several recent face detectors for their performance with masked-face images. Furthermore, we investigate the usefulness of multiple off-the-shelf deep-learning models for recognizing correct face-mask placement. Finally, we design a complete pipeline for recognizing whether face-masks are worn correctly or not and compare the performance of the pipeline with standard face-mask detection models from the literature. To facilitate the study, we compile a large dataset of facial images from the publicly available MAFA and Wider Face datasets and annotate it with compliant and non-compliant labels. The annotation dataset, called Face-Mask-Label Dataset (FMLD), is made publicly available to the research community.

65 citations

Proceedings ArticleDOI
04 May 2015
TL;DR: The comparison of both speech synthesis modules integrated in the proposed DROPSY-based approach reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.
Abstract: The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the de-identification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker de-identification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the de-identification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TD-PSOLA-based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.

40 citations

Journal ArticleDOI
TL;DR: This paper presents the Slovene-language spoken resources that were acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years.
Abstract: This paper presents the Slovene-language spoken resources that were acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years. The resources consist of: • isolated-spoken-word corpora designed for phonetic research of the Slovene spoken language; • read-speech corpora from dialogues relating to air flight information; • isolated-word corpora, designed for studying the Slovene spoken diphthongs; • Slovene diphone corpora used for text-to-speech synthesis systems; • a weather forecast speech database, as an attempt to capture radio and television broadcast news in the Slovene language; and • read- and spontaneous-speech corpora used to study the effects of the psycho physical conditions of the speakers on their speech characteristics. All the resources are accompanied by relevant text transcriptions, lexicons and various segmentation labels. The read-speech corpora relating to the air flight information domain also are annotated prosodically and semantically. The words in the orthographic transcription were automatically tagged for their lemma and morphosyntactic description. Many of the mentioned speech resources are freely available for basic research purposes in speech technology and linguistics. In this paper we describe all the resources in more detail and give a brief description of their use in the spoken language technology products developed at LUKS.

34 citations


Cited by
More filters
01 Jan 2004
TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.
Abstract: From the Publisher: The accessible presentation of this book gives both a general view of the entire computer vision enterprise and also offers sufficient detail to be able to build useful applications. Users learn techniques that have proven to be useful by first-hand experience and a wide range of mathematical methods. A CD-ROM with every copy of the text contains source code for programming practice, color images, and illustrative movies. Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance. Topics are discussed in substantial and increasing depth. Application surveys describe numerous important application areas such as image based rendering and digital libraries. Many important algorithms broken down and illustrated in pseudo code. Appropriate for use by engineers as a comprehensive reference to the computer vision enterprise.

3,627 citations

Patent
11 Jan 2011
TL;DR: In this article, an intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions.
Abstract: An intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions. The system can be implemented using any of a number of different platforms, such as the web, email, smartphone, and the like, or any combination thereof. In one embodiment, the system is based on sets of interrelated domains and tasks, and employs additional functionally powered by external services with which the system can interact.

1,462 citations

Journal ArticleDOI
TL;DR: This first of its kind, comprehensive literature review of the diverse field of affective computing focuses mainly on the use of audio, visual and text information for multimodal affect analysis, and outlines existing methods for fusing information from different modalities.

969 citations