scispace - formally typeset
Search or ask a question

Showing papers by "Andreas Spanias published in 2018"


Proceedings Article
29 Apr 2018
TL;DR: In this paper, a self-attention mechanism is employed for clinical time-series modeling, which employs a masked, self attention mechanism, and uses positional encoding and dense interpolation strategies for incorporating temporal order.
Abstract: With widespread adoption of electronic health records, there is an increased emphasis for predictive models that can effectively deal with clinical time-series data. Powered by Recurrent Neural Network (RNN) architectures with Long Short-Term Memory (LSTM) units, deep neural networks have achieved state-of-the-art results in several clinical prediction tasks. Despite the success of RNN, its sequential nature prohibits parallelized computing, thus making it inefficient particularly when processing long sequences. Recently, architectures which are based solely on attention mechanisms have shown remarkable success in transduction tasks in NLP, while being computationally superior. In this paper, for the first time, we utilize attention models for clinical time-series modeling, thereby dispensing recurrence entirely. We develop the SAnD (Simply Attend and Diagnose) architecture, which employs a masked, self-attention mechanism, and uses positional encoding and dense interpolation strategies for incorporating temporal order. Furthermore, we develop a multi-task variant of SAnD to jointly infer models with multiple diagnosis tasks. Using the recent MIMIC-III benchmark datasets, we demonstrate that the proposed approach achieves state-of-the-art performance in all tasks, outperforming LSTM models and classical baselines with hand-engineered features.

340 citations


Journal ArticleDOI
TL;DR: The idea of using deep architectures to perform kernel machine optimization, for both computational efficiency and end-to-end inferencing, is explored and the kernel dropout regularization is introduced to enable improved training convergence.
Abstract: Building highly nonlinear and nonparametric models is central to several state-of-the-art machine learning systems. Kernel methods form an important class of techniques that induce a reproducing kernel Hilbert space (RKHS) for inferring non-linear models through the construction of similarity functions from data. These methods are particularly preferred in cases where the training data sizes are limited and when prior knowledge of the data similarities is available. Despite their usefulness, they are limited by the computational complexity and their inability to support end-to-end learning with a task-specific objective. On the other hand, deep neural networks have become the de facto solution for end-to-end inference in several learning paradigms. In this paper, we explore the idea of using deep architectures to perform kernel machine optimization, for both computational efficiency and end-to-end inferencing. To this end, we develop the deep kernel machine optimization framework, that creates an ensemble of dense embeddings using Nystrom kernel approximations and utilizes deep learning to generate task-specific representations through the fusion of the embeddings. Intuitively, the filters of the network are trained to fuse information from an ensemble of linear subspaces in the RKHS. Furthermore, we introduce the kernel dropout regularization to enable improved training convergence. Finally, we extend this framework to the multiple kernel case, by coupling a global fusion layer with pretrained deep kernel machines for each of the constituent kernels. Using case studies with limited training data, and lack of explicit feature sources, we demonstrate the effectiveness of our framework over conventional model inferencing techniques.

50 citations


Proceedings ArticleDOI
15 May 2018
TL;DR: The study describes remote fault detection using machine learning approaches, power output optimization using cloud movement prediction and consensus-based solar array parameter estimation, which are used in the development of a utility-scale solar cyber-physical system.
Abstract: This paper describes three methods used in the development of a utility-scale solar cyber-physical system. The study describes remote fault detection using machine learning approaches, power output optimization using cloud movement prediction and consensus-based solar array parameter estimation. Dynamic cloud movement, shading and soiling, lead to fluctuations in power output and loss of efficiency. For optimization of output power, a cloud movement prediction algorithm is proposed. Integrated fault detection methods are also described to predict and by pass failing modules. Finally, the fully connected solar array, which is fitted with multiple sensors, is operated as an Internet of things network. Integrated with each module are sensors and radio electronics communicating all data to a fusion center. Gathering data at the fusion center to compute and transmit analytics requires secure low power communication solutions. To optimize the resources and power consumption, we describe a method to integrate fully distributed algorithms designed for a wireless sensor network in this CPS system.

22 citations


Proceedings ArticleDOI
10 Sep 2018
TL;DR: An NSF Research Experiences for Undergraduates site to embed students in research projects related to integrated sensor and signal processing systems and this paper describes the REU activities, modules, training, projects, and their assessment.
Abstract: Arizona State University (ASU) established an NSF Research Experiences for Undergraduates (REU) site to embed students in research projects related to integrated sensor and signal processing systems The program includes both sensor hardware and algorithm/software design for a variety of applications including health monitoring The site was funded in February 2017 and the Co-PIs recruited nine students from different universities and community colleges to spend the summer of 2017 in research laboratories at ASU The program included structured training with modules in sensor design, signal processing, and machine learning Cross-cutting training included research ethics, IEEE manuscript development, and building presentation skills Nine undergraduate research projects were launched and the program went through an assessment by an independent evaluator This paper describes the REU activities, modules, training, projects, and their assessment

13 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work in progress paper describes software that enables online machine learning experiments in an undergraduate DSP course that operates in HTML5 and embeds several digital signal processing functions that provide a user-friendly visualization of phoneme recognition tasks.
Abstract: This work in progress paper describes software that enables online machine learning experiments in an undergraduate DSP course. This software operates in HTML5 and embeds several digital signal processing functions. The software can process natural signals such as speech and can extract various features, for machine learning applications. For example in the case of speech processing, LPC coefficients and formant frequencies can be computed. In this paper, we present speech processing, feature extraction and clustering of features using the K-means machine learning algorithm. The primary objective is to provide a machine learning experience to undergraduate students. The functions and simulations described provide a user-friendly visualization of phoneme recognition tasks. These tasks make use of the Levinson-Durbin linear prediction and the K-means machine learning algorithms. The exercise was assigned as a class project in our undergraduate DSP class. The description of the exercise along with assessment results is described.

12 citations


Posted Content
TL;DR: This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization and proposes to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion.
Abstract: In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

12 citations


Journal ArticleDOI
TL;DR: It is shown that the proposed algorithm can be used to estimate the center and radius of the (not necessarily location-based) data at sensor nodes in a distributed way, thereby providing information and insights about the global data at each sensor.
Abstract: A fully distributed algorithm for estimating the center and the radius of the smallest sphere that contains a wireless sensor network is proposed. The center finding problem is formulated as a convex optimization problem in summation form by using a soft-max approximation to the maximum function. Diffusion adaptation method is used where states of nodes converge to the estimated center distributively. Then, distributed max consensus is used to compute the radius. The proposed algorithm is fully distributed in the sense that each node in the network only needs to know its own location and nodes do not need to be pre-labeled. The algorithm works for any connected graph structure. The performance of the proposed algorithm is analyzed and it is shown that there is a tradeoff: a larger design parameter results in a more accurate center estimation but also makes the convergence speed of the distributed algorithm slower. It is also shown that the proposed algorithm can also be used to estimate the center and radius of the (not necessarily location-based) data at sensor nodes in a distributed way, thereby providing information and insights about the global data at each sensor. Simulation results corroborating the theory are also provided.

11 citations


Proceedings ArticleDOI
02 Sep 2018
TL;DR: In this paper, the authors employ attention models to learn embeddings and the metric jointly in an end-to-end fashion for speaker diarization in automatic speech processing systems.
Abstract: In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

10 citations


Proceedings ArticleDOI
23 Jul 2018
TL;DR: This paper develops a solar array simulation in J-DSP and introduces a multi-layer perceptron model for PV fault detection, and deploy and assess the simulation by disseminating to a group of users that provide feedback.
Abstract: When collecting solar energy via photovoltaic (PV) panel arrays, one common issue is the potential occurrence of faults. Faults arise from panel short-circuit, soiling, shading, ground leakage and other sources. Machine learning algorithms have enabled data-based classification of faults. In this paper, we present an Internet-based PV array fault monitoring simulation using the Java-Dsp(j-Dsp)simulation environment. We first develop a solar array simulation in J-DSP and then form appropriate graphics to examine V-I curves, maximum power point tracking, and faults. We then introduce a multi-layer perceptron model for PV fault detection. We deploy and assess the simulation by disseminating to a group of users that provide feedback.

10 citations


Book
02 Mar 2018
TL;DR: This book addresses the problem of estimating the structure of distributed WSNs with existing consensus algorithms, including average consensus and max consensus, and introduces a distributed algorithm for counting the total number of nodes in a wireless sensor network with noisy communication channels.
Abstract: The area of detection and estimation in a distributed wireless sensor network (WSN) has several applications, including military surveillance, sustainability, health monitoring, and Internet of Things (IoT). Compared with a wired centralized sensor network, a distributed WSN has many advantages including scalability and robustness to sensor node failures. In this book, we address the problem of estimating the structure of distributed WSNs. First, we provide a literature review in: (a) graph theory; (b) network area estimation; and (c) existing consensus algorithms, including average consensus and max consensus. Second, a distributed algorithm for counting the total number of nodes in a wireless sensor network with noisy communication channels is introduced. Then, a distributed network degree distribution estimation (DNDD) algorithm is described. The DNDD algorithm is based on average consensus and in-network empirical mass function estimation. Finally, a fully distributed algorithm for estimating...

10 citations


Posted Content
TL;DR: In this article, localization using narrowband communication signals with time-of-arrivals (TOMA) measurements is considered and the Cramer-Rao lower bound for localization error is derived under different assumptions on fading coefficients.
Abstract: In this paper, localization using narrowband communication signals are considered in the presence of fading channels with time of arrival measurements. When narrowband signals are used for localization, due to existing hardware constraints, fading channels play a crucial role in localization accuracy. In a location estimation formulation, the Cramer-Rao lower bound for localization error is derived under different assumptions on fading coefficients. For the same level of localization accuracy, the loss in performance due to Rayleigh fading with known phase is shown to be about 5dB compared to the case with no fading. Unknown phase causes an additional 1dB loss. The maximum likelihood estimators are also derived. In an alternative distributed detection formulation, each anchor receives a noisy signal from a node with known location if the node is active. Each anchor makes a decision as to whether the node is active or not and transmits a bit to a fusion center once a decision is made. The fusion center combines all the decisions and uses a design parameter to make the final decision. We derive optimal thresholds and calculate the probabilities of false alarm and detection under different assumptions on the knowledge of channel information. Simulations corroborate our analytical results.

Proceedings ArticleDOI
29 Aug 2018
TL;DR: This paper aims to develop a fast dynamic-texture prediction method, using tools from non-linear dynamical modeling, and fast approaches for approximate regression, which is applied to shading prediction in utility scale solar arrays.
Abstract: This paper aims to develop a fast dynamic-texture prediction method, using tools from non-linear dynamical modeling, and fast approaches for approximate regression. We consider dynamic textures to be described by patch-level non-linear processes, thus requiring tools such as delay-embedding to uncover a phase-space where dynamical evolution can be more easily modeled. After mapping the observed time-series from a dynamic texture video to its recovered phase-space, a time-efficient approximate prediction method is presented which utilizes locality-sensitive hashing approaches to predict possible phase-space vectors, given the current phase-space vector. Our experiments show the favorable performance of the proposed approach, both in terms of prediction fidelity, and computational time. The proposed algorithm is applied to shading prediction in utility scale solar arrays.

Posted Content
TL;DR: This article proposes to use attention models for effective feature learning and develop two novel architectures that exploit the interlayer dependences for building multilayered graph embeddings and shows that using simple random features is an effective choice, even in cases where explicit node attributes are not available.
Abstract: Modern data analysis pipelines are becoming increasingly complex due to the presence of multi-view information sources. While graphs are effective in modeling complex relationships, in many scenarios a single graph is rarely sufficient to succinctly represent all interactions, and hence multi-layered graphs have become popular. Though this leads to richer representations, extending solutions from the single-graph case is not straightforward. Consequently, there is a strong need for novel solutions to solve classical problems, such as node classification, in the multi-layered case. In this paper, we consider the problem of semi-supervised learning with multi-layered graphs. Though deep network embeddings, e.g. DeepWalk, are widely adopted for community discovery, we argue that feature learning with random node attributes, using graph neural networks, can be more effective. To this end, we propose to use attention models for effective feature learning, and develop two novel architectures, GrAMME-SG and GrAMME-Fusion, that exploit the inter-layer dependencies for building multi-layered graph embeddings. Using empirical studies on several benchmark datasets, we evaluate the proposed approaches and demonstrate significant performance improvements in comparison to state-of-the-art network embedding strategies. The results also show that using simple random features is an effective choice, even in cases where explicit node attributes are not available.

Proceedings ArticleDOI
23 Jul 2018
TL;DR: An intuitive method for optimizing activity detection data is presented that utilizes different Microcontroller Units with embedded sensors which are used for activity detection and incorporates supervised learning to generate a predictive model for activity optimization.
Abstract: Internet of Things (IoT) has enabled several applications related to data analytics. In this paper, an intuitive method for optimizing activity detection data is presented. Further applications include exploring detection accuracies of physical activities such as walking intensity and movement on stairs. This method utilizes different Microcontroller Units (MCUs) with embedded sensors which are used for activity detection. Additionally, this method also incorporates supervised learning - more specifically the Fine Gaussian SVM, to generate a predictive model for activity optimization.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: The analysis of a distributed consensus algorithm for estimating the maximum of the node initial state values in a network is considered in the presence of communication noise, wherein, at each iteration, the state values are multiplied by a random matrix characterized by the noise distribution.
Abstract: The analysis of a distributed consensus algorithm for estimating the maximum of the node initial state values in a network is considered in the presence of communication noise. Conventionally, the maximum is estimated by updating the node state value with the largest received measurements in every iteration at each node. However, due to additive channel noise, the estimate of the maximum at each node has a positive drift at each iteration and this results in nodes diverging from the true max value. Max-plus algebra is used to study this ergodic process, wherein, at each iteration, the state values are multiplied by a random matrix characterized by the noise distribution. The growth rate of the state values due to noise is studied by analyzing the Lyapunov exponent of the product of noise matrices in a max-plus semiring. The growth rate of the state values is bounded by a constant which depends on the spectral radius of the network and the noise variance. Simulation results supporting the theory are also presented.

Journal ArticleDOI
TL;DR: This paper derives a new data-driven complete basis that is similar to the deterministic Bernstein polynomial basis and develops two methods for performing basis expansions of functionals of two distributions.
Abstract: A number of fundamental quantities in statistical signal processing and information theory can be expressed as integral functions of two probability density functions. Such quantities are called density functionals as they map density functions onto the real line. For example, information divergence functions measure the dissimilarity between two probability density functions and are useful in a number of applications. Typically, estimating these quantities requires complete knowledge of the underlying distribution followed by multidimensional integration. Existing methods make parametric assumptions about the data distribution or use nonparametric density estimation followed by high-dimensional integration. In this paper, we propose a new alternative. We introduce the concept of “data-driven basis functions”—functions of distributions whose value we can estimate given only samples from the underlying distributions without requiring distribution fitting or direct integration. We derive a new data-driven complete basis that is similar to the deterministic Bernstein polynomial basis and develop two methods for performing basis expansions of functionals of two distributions. We also show that the new basis set allows us to approximate functions of distributions as closely as desired. Finally, we evaluate the methodology by developing data-driven estimators for the Kullback–Leibler divergences and the Hellinger distance and by constructing empirical estimates of tight bounds on the Bayes error rate.

Posted Content
TL;DR: This paper proposes to use attention models for effective feature learning, and develops two novel architectures that exploit the inter-layer dependencies for building multi-layered graph embeddings, and shows that using simple random features is an effective choice, even in cases where explicit node attributes are not available.
Abstract: Modern data analysis pipelines are becoming increasingly complex due to the presence of multi-view information sources. While graphs are effective in modeling complex relationships, in many scenarios a single graph is rarely sufficient to succinctly represent all interactions, and hence multi-layered graphs have become popular. Though this leads to richer representations, extending solutions from the single-graph case is not straightforward. Consequently, there is a strong need for novel solutions to solve classical problems, such as node classification, in the multi-layered case. In this paper, we consider the problem of semi-supervised learning with multi-layered graphs. Though deep network embeddings, e.g. DeepWalk, are widely adopted for community discovery, we argue that feature learning with random node attributes, using graph neural networks, can be more effective. To this end, we propose to use attention models for effective feature learning, and develop two novel architectures, GrAMME-SG and GrAMME-Fusion, that exploit the inter-layer dependencies for building multi-layered graph embeddings. Using empirical studies on several benchmark datasets, we evaluate the proposed approaches and demonstrate significant performance improvements in comparison to state-of-the-art network embedding strategies. The results also show that using simple random features is an effective choice, even in cases where explicit node attributes are not available.

23 Jun 2018
TL;DR: The efforts to develop an online simulation environment that will support web-based laboratories for training undergraduate students from Electrical Engineering and other disciplines in sensors and machine learning are described.
Abstract: Integrating sensing and machine learning is important in elevating precision in several Internet of Things (IoT) and mobile applications. In our Electrical Engineering classes, we have begun developing self-contained modules to train students in this area. We focus specifically in developing modules in machine learning including pre-processing, feature extraction and classification. We have also embedded in these modules software to provide hands-on training. In this paper, we describe our efforts to develop an online simulation environment that will support web-based laboratories for training undergraduate students from Electrical Engineering and other disciplines in sensors and machine learning. We also present our efforts to enable students to visualize and understand the inner workings of various machine learning algorithms along with descriptions of their performance with several types of synthetic and sensor data.

Posted Content
TL;DR: In this paper, the authors perform a detailed analysis of GAT models, and present interesting insights into their behavior, and show that the models are vulnerable to heterogeneous rogue nodes and hence propose novel regularization strategies to improve the robustness of gAT models.
Abstract: Machine learning models that can exploit the inherent structure in data have gained prominence. In particular, there is a surge in deep learning solutions for graph-structured data, due to its wide-spread applicability in several fields. Graph attention networks (GAT), a recent addition to the broad class of feature learning models in graphs, utilizes the attention mechanism to efficiently learn continuous vector representations for semi-supervised learning problems. In this paper, we perform a detailed analysis of GAT models, and present interesting insights into their behavior. In particular, we show that the models are vulnerable to heterogeneous rogue nodes and hence propose novel regularization strategies to improve the robustness of GAT models. Using benchmark datasets, we demonstrate performance improvements on semi-supervised learning, using the proposed robust variant of GAT.

Posted Content
TL;DR: In this paper, a parameterized family of coverage-based sample designs with provably improved coverage characteristics was proposed for 2D image analysis, and algorithms for effective sample synthesis were developed.
Abstract: Sampling one or more effective solutions from large search spaces is a recurring idea in machine learning, and sequential optimization has become a popular solution. Typical examples include data summarization, sample mining for predictive modeling and hyper-parameter optimization. Existing solutions attempt to adaptively trade-off between global exploration and local exploitation, wherein the initial exploratory sample is critical to their success. While discrepancy-based samples have become the de facto approach for exploration, results from computer graphics suggest that coverage-based designs, e.g. Poisson disk sampling, can be a superior alternative. In order to successfully adopt coverage-based sample designs to ML applications, which were originally developed for 2-d image analysis, we propose fundamental advances by constructing a parameterized family of designs with provably improved coverage characteristics, and by developing algorithms for effective sample synthesis. Using experiments in sample mining and hyper-parameter optimization for supervised learning, we show that our approach consistently outperforms existing exploratory sampling methods in both blind exploration, and sequential search with Bayesian optimization.

Posted Content
TL;DR: It is shown that the models are vulnerable to adversaries (rogue nodes) and hence novel regularization strategies are proposed to improve the robustness of GAT models, and performance improvements on semi-supervised learning are demonstrated.
Abstract: Machine learning models that can exploit the inherent structure in data have gained prominence. In particular, there is a surge in deep learning solutions for graph-structured data, due to its wide-spread applicability in several fields. Graph attention networks (GAT), a recent addition to the broad class of feature learning models in graphs, utilizes the attention mechanism to efficiently learn continuous vector representations for semi-supervised learning problems. In this paper, we perform a detailed analysis of GAT models, and present interesting insights into their behavior. In particular, we show that the models are vulnerable to adversaries (rogue nodes) and hence propose novel regularization strategies to improve the robustness of GAT models. Using benchmark datasets, we demonstrate performance improvements on semi-supervised learning, using the proposed robust variant of GAT.

23 Jun 2018
TL;DR: The paper describes the technical details of the research activities and summarizes an independent assessment of the projects and learning experiences.
Abstract: A unique Research Experiences for Undergraduates (REU) site was established at our University to address education and research problems in integrated sensor device and DSP algorithm design. The site will recruit and train nine undergraduate students each summer and engage them in research endeavors on the design of sensors including student training in mathematical methods for extracting information from sensor systems. The program was launched in 2017, and nine undergraduate research projects advised by a team of faculty advisors started in the summer. The projects embedded REU students in tasks whose focus was to design sensors and interpret their data by studying and programming appropriate machine learning algorithms. The paper describes the technical details of the research activities and summarizes an independent assessment of the projects and learning experiences.

Posted Content
TL;DR: It is shown that the estimated state sequence is asymptotically unbiased and converges toward the sample quantile in the mean-square sense and is provided to distributed estimation of trimmed mean, computation of median, maximum, or minimum values and identification of outliers through simulation.
Abstract: A quantile is defined as a value below which random draws from a given distribution falls with a given probability. In a centralized setting where the cumulative distribution function (CDF) is unknown, the empirical CDF (ECDF) can be used to estimate such quantiles after aggregating the data. In a fully distributed sensor network, however, it is challenging to estimate quantiles. This is because each sensor node observes local measurement data with limited storage and data transmission power which make it difficult to obtain the global ECDF. This paper proposes consensus-based quantile estimation for such a distributed network. The states of the proposed algorithm are recursively updated with two-steps at each iteration: one is a local update based on the measurement data and the current state, and the other is averaging the updated states with neighboring nodes. We consider the realistic case of communication links between nodes being corrupted by independent random noise. It is shown that the estimated state sequence is asymptotically unbiased and converges toward the sample quantile in the mean-square sense. The two step-size sequences corresponding to the averaging and local update steps result in a mixed-time scale algorithm with proper decay rates in order to achieve convergence. We also provide applications to distributed estimation of trimmed mean, computation of median, maximum, or minimum values and identification of outliers through simulation.

Patent
12 Jan 2018
TL;DR: In this paper, an image recognition algorithm implemented by a hardware control system which operates directly on data from a compressed sensing camera is presented, allowing faster operation and reducing the computing requirements of the system.
Abstract: The disclosure relates to an image recognition algorithm implemented by a hardware control system which operates directly on data from a compressed sensing camera. A computationally expensive image reconstruction step can be avoided, allowing faster operation and reducing the computing requirements of the system. The method may implement an algorithm that can operate at speeds comparable to an equivalent approach operating on a conventional camera's output. In addition, at high compression ratios, the algorithm can outperform approaches in which an image is first reconstructed and then classified.

Posted Content
TL;DR: This paper argues that it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discriminative margin parameter, for building robust diarization systems and proposes to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines.
Abstract: State-of-the-art speaker diarization systems utilize knowledge from external data, in the form of a pre-trained distance metric, to effectively determine relative speaker identities to unseen data. However, much of recent focus has been on choosing the appropriate feature extractor, ranging from pre-trained $i-$vectors to representations learned via different sequence modeling architectures (e.g. 1D-CNNs, LSTMs, attention models), while adopting off-the-shelf metric learning solutions. In this paper, we argue that, regardless of the feature extractor, it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discrimnative margin parameter, for building robust diarization systems. Furthermore, we propose to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines. To this end, we measure diarization performance across different language speakers, and variations in the number of speakers in a recording. Using empirical studies, we provide interesting insights into the effectiveness of different design choices and make recommendations.