scispace - formally typeset
Search or ask a question

Showing papers on "Function (mathematics) published in 2018"


Journal ArticleDOI
TL;DR: This study proposes two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU), and suggests the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection can be competitive with DQN, without the need for a separate target network.

696 citations


Proceedings ArticleDOI
15 Feb 2018
TL;DR: This paper explore the structure of neural loss functions and the effect of loss landscapes on generalization, using a range of visualization methods, and explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
Abstract: Neural network training relies on our ability to find "good" minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple "filter normalization" method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.

554 citations


Journal ArticleDOI
TL;DR: In this article, a new fractional derivative with respect to another function is introduced, the so-called ψ-Hilfer fractional derivatives, which can be used to obtain uniformly convergent sequence of function, uniformly continuous function and examples including the Mittag-Leffler function with one parameter.

485 citations


Journal ArticleDOI
Guannan Qu1, Na Li1
TL;DR: It is shown that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth, and a novel gradient estimation scheme is proposed that uses history information to achieve fast and accurate estimation of the average gradient.
Abstract: There has been a growing effort in studying the distributed optimization problem over a network. The objective is to optimize a global function formed by a sum of local functions, using only local computation and communication. The literature has developed consensus-based distributed (sub)gradient descent (DGD) methods and has shown that they have the same convergence rate $O(\frac{\log t}{\sqrt{t}})$ as the centralized (sub)gradient methods (CGD), when the function is convex but possibly nonsmooth. However, when the function is convex and smooth, under the framework of DGD, it is unclear how to harness the smoothness to obtain a faster convergence rate comparable to CGD's convergence rate. In this paper, we propose a distributed algorithm that, despite using the same amount of communication per iteration as DGD, can effectively harnesses the function smoothness and converge to the optimum with a rate of $O(\frac{1}{t})$ . If the objective function is further strongly convex, our algorithm has a linear convergence rate. Both rates match the convergence rate of CGD. The key step in our algorithm is a novel gradient estimation scheme that uses history information to achieve fast and accurate estimation of the average gradient. To motivate the necessity of history information, we also show that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth.

440 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, a weakly supervised temporal action localization algorithm is proposed, which learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations.
Abstract: We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

353 citations


Journal ArticleDOI
TL;DR: In this paper, an analysis of evolutions equations generated by three fractional derivatives namely the Riemann-Liouville, Caputo-Fabrizio, and Atangana-Baleanu derivatives is presented.
Abstract: We presented an analysis of evolutions equations generated by three fractional derivatives namely the Riemann–Liouville, Caputo–Fabrizio and the Atangana–Baleanu fractional derivatives. For each evolution equation, we presented the exact solution for time variable and studied the semigroup principle. The Riemann–Liouville fractional operator verifies the semigroup principle but the associate evolution equation does not. The Caputo–Fabrizio fractional derivative does not satisfy the semigroup principle but surprisingly, the exact solution satisfies very well all the principle of semigroup. However, the Atangana–Baleanu for small time is the stretched exponential derivative, which does not satisfy the semigroup as operators. For a large time the Atangana–Baleanu derivative is the same with Riemann–Liouville fractional derivative, thus satisfies semigroup principle as an operator. The solution of the associated evolution equation does not satisfy the semigroup principle as Riemann–Liouville. With the connection between semigroup theory and the Markovian processes, we found out that the Atangana–Baleanu fractional derivative has at the same time Markovian and non-Markovian processes. We concluded that, the fractional differential operator does not need to satisfy the semigroup properties as they portray the memory effects, which are not always Markovian. We presented the exact solutions of some evolutions equation using the Laplace transform. In addition to this, we presented the numerical solution of a nonlinear equation and show that, the model with the Atangana–Baleanu fractional derivative has random walk for small time. We also observed that, the Mittag-Leffler function is a good filter than the exponential and power law functions, which makes the Atangana–Baleanu fractional derivatives powerful mathematical tools to model complex real world problems.

289 citations


Journal ArticleDOI
TL;DR: This brief addresses the trajectory tracking control problem of a fully actuated surface vessel subjected to asymmetrically constrained input and output with the proposed control, which will never be violated during operation, and all system states are bounded.
Abstract: This brief addresses the trajectory tracking control problem of a fully actuated surface vessel subjected to asymmetrically constrained input and output. The controller design process is based on the backstepping technique. An asymmetric time-varying barrier Lyapunov function is proposed to address the output constraint. To overcome the difficulty of nondifferentiable input saturation, a smooth hyperbolic tangent function is employed to approximate the asymmetric saturation function. A Nussbaum function is introduced to compensate for the saturation approximation and ensure the system stability. The command filters and auxiliary systems are integrated with the control law to avoid the complicated calculation of the derivative of the virtual control in backstepping. In addition, the bounds of uncertainties and disturbances are estimated and compensated with an adaptive algorithm. With the proposed control, the constraints will never be violated during operation, and all system states are bounded. Simulation results and comparisons with standard method illustrate the effectiveness and advantages of the proposed controller.

266 citations


Journal ArticleDOI
TL;DR: This paper presents a new algorithm, termed Truncated amplitude flow (TAF), to recover an unknown vector from a system of quadratic equations, and proves that as soon as the number of equations is on the order of theNumber of unknowns, TAF recovers the solution exactly.
Abstract: This paper presents a new algorithm, termed truncated amplitude flow (TAF), to recover an unknown vector $ {x}$ from a system of quadratic equations of the form $y_{i}=|\langle {a}_{i}, {x}\rangle |^{2}$ , where $ {a}_{i}$ ’s are given random measurement vectors. This problem is known to be NP-hard in general. We prove that as soon as the number of equations is on the order of the number of unknowns, TAF recovers the solution exactly (up to a global unimodular constant) with high probability and complexity growing linearly with both the number of unknowns and the number of equations. Our TAF approach adapts the amplitude-based empirical loss function and proceeds in two stages. In the first stage, we introduce an orthogonality-promoting initialization that can be obtained with a few power iterations. Stage two refines the initial estimate by successive updates of scalable truncated generalized gradient iterations , which are able to handle the rather challenging nonconvex and nonsmooth amplitude-based objective function. In particular, when vectors $ {x}$ and ${a}_{i}$ ’s are real valued, our gradient truncation rule provably eliminates erroneously estimated signs with high probability to markedly improve upon its untruncated version. Numerical tests using synthetic data and real images demonstrate that our initialization returns more accurate and robust estimates relative to spectral initializations. Furthermore, even under the same initialization, the proposed amplitude-based refinement outperforms existing Wirtinger flow variants, corroborating the superior performance of TAF over state-of-the-art algorithms.

266 citations


Journal ArticleDOI
TL;DR: In this article, the authors showed that fractional operators obeying index law cannot model real world problems taking place in two states, more precisely they cannot describe phenomena taking place beyond their boundaries, as they are scaling invariant, more specifically their results show that, mathematical models based on these differential operators are not able to describe the inverse memory, meaning the full history of a physical problem cannot be described accurately using these derivatives with index law properties.
Abstract: Recently fractional differential operators with non-index law properties have being recognized to have brought new weapons to accurately model real world problems particularly those with non-Markovian processes This present paper has two double aims, the first was to prove the inadequacy and failure of index law fractional calculus and secondly to show the application of fractional differential operators with no index law properties to statistic and dynamical systems To achieve this, we presented the historical construction of the concept of fractional differential operators from Leibniz to date Using a matrix based on the fractional differential operators, we proved that, fractional operators obeying index law cannot model real world problems taking place in two states, more precisely they cannot describe phenomena taking place beyond their boundaries, as they are scaling invariant, more precisely our results show that, mathematical models based on these differential operators are not able to describe the inverse memory, meaning the full history of a physical problem cannot be described accurately using these derivatives with index law properties On the other hand, we proved that, differential operators with no index-law properties are scaling variant, thus can describe situations taking place in different states and are able to localize the frontiers between two states We present the renewal process properties included in differential equation build out of the Atangana–Baleanu fractional derivative and counting process, which is connected to its inter-arrival time distribution Mittag–Leffler distribution which is the kernel of these derivatives We presented the connection of each derivative to a statistical family, for instance Riemann–Liouville–Caputo derivatives are connected to the Pareto statistic, which has no well-defined average when alpha is less than 1 corresponding to the interval where fractional operators mostly defined We established new properties and theorem for the Atangana–Baleanu derivative of an analytic function, in particular we proved that, they are convolution of the Mittag–Leffler function with the Riemann–Liouville–Caputo derivatives To see the accuracy of the non-index law derivative to in modeling real chaotic problems, 4 examples were considered, including the nine-term 3-D novel chaotic system, King Cobra chaotic system, the Ikeda delay system and chaotic chameleon system The numerical simulations show very interesting and novel attractors The king cobra system with the Atangana–Baleanu presented a very novel attractor where at the earlier time we observed a random walk and latter time we observed the real sharp of the cobra The Ikeda model with Atangana–Baleanu presented different attractors for each value of fractional order, in particular we obtain a square and circular explosions The results obtained in this paper show that, the future of modeling real world problem relies on fractional differential operators with non-index law property Our numerical results showed that, to not model physical problems with fractional differential operators with non-singular kernel and imposing index law in fractional calculus is rightfully living with closed eyes without ever taking a risk to open them

261 citations


Proceedings Article
21 Feb 2018
TL;DR: In this article, instancewise feature selection is introduced as a methodology for modelinterpretation, which is based on learning a function to extract a subset of features that are most informative for each given example.
Abstract: We introduce instancewise feature selection as a methodology for model interpretation. Our method is based on learning a function to extract a subset of features that are most informative for each given example. This feature selector is trained to maximize the mutual information between selected features and the response variable, where the conditional distribution of the response variable given the input is the model to be explained. We develop an efficient variational approximation to the mutual information, and show the effectiveness of our method on a variety of synthetic and real data sets using both quantitative metrics and human evaluation.

257 citations


Book ChapterDOI
08 Sep 2018
TL;DR: ContextualLoss as mentioned in this paper is based on both context and semantics to compare regions with similar semantic meaning, while considering the context of the entire image, which can translate eyes-to-eyes and mouth-tomouth.
Abstract: Feed-forward CNNs trained for image transformation problems rely on loss functions that measure the similarity between the generated image and a target image. Most of the common loss functions assume that these images are spatially aligned and compare pixels at corresponding locations. However, for many tasks, aligned training pairs of images will not be available. We present an alternative loss function that does not require alignment, thus providing an effective and simple solution for a new space of problems. Our loss is based on both context and semantics – it compares regions with similar semantic meaning, while considering the context of the entire image. Hence, for example, when transferring the style of one face to another, it will translate eyes-to-eyes and mouth-to-mouth. Our code can be found at https://www.github.com/roimehrez/contextualLoss.

Proceedings Article
15 Feb 2018
TL;DR: In this article, the authors investigate the tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations, and demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.
Abstract: In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

Posted Content
TL;DR: A Law of Large Numbers and a Central Limit Theorem for the empirical distribution are established, which together show that the approximation error of the network universally scales as O(n-1) and the scale and nature of the noise introduced by stochastic gradient descent are quantified.
Abstract: Neural networks, a central tool in machine learning, have demonstrated remarkable, high fidelity performance on image recognition and classification tasks. These successes evince an ability to accurately represent high dimensional functions, potentially of great use in computational and applied mathematics. That said, there are few rigorous results about the representation error and trainability of neural networks, as well as how they scale with the network size. Here we characterize both the error and scaling by reinterpreting the standard optimization algorithm used in machine learning applications, stochastic gradient descent, as the evolution of a particle system with interactions governed by a potential related to the objective or "loss" function used to train the network. We show that, when the number $n$ of parameters is large, the empirical distribution of the particles descends on a convex landscape towards a minimizer at a rate independent of $n$. We establish a Law of Large Numbers and a Central Limit Theorem for the empirical distribution, which together show that the approximation error of the network universally scales as $o(n^{-1})$. Remarkably, these properties do not depend on the dimensionality of the domain of the function that we seek to represent. Our analysis also quantifies the scale and nature of the noise introduced by stochastic gradient descent and provides guidelines for the step size and batch size to use when training a neural network. We illustrate our findings on examples in which we train neural network to learn the energy function of the continuous 3-spin model on the sphere. The approximation error scales as our analysis predicts in as high a dimension as $d=25$.

Proceedings Article
02 Mar 2018
TL;DR: Surprisingly, the paths between minima of recent neural network architectures on CIFAR10 and CIFar100 are essentially flat, which implies that neural networks have enough capacity for structural changes, or that these changes are small betweenMinima.
Abstract: Training neural networks involves finding minima of a high-dimensional non-convex loss function. Knowledge of the structure of this energy landscape is sparse. Relaxing from linear interpolations, we construct continuous paths between minima of recent neural network architectures on CIFAR10 and CIFAR100. Surprisingly, the paths are essentially flat in both the training and test landscapes. This implies that neural networks have enough capacity for structural changes, or that these changes are small between minima. Also, each minimum has at least one vanishing Hessian eigenvalue in addition to those resulting from trivial invariance.

Journal ArticleDOI
TL;DR: A new form of K-filters with time-varying low-gain is introduced in this paper to compensate for unmeasurable/unknown states of stochastic feedforward systems with unknown control coefficients and unknown output function.

Journal ArticleDOI
TL;DR: Comparisons with other published methods demonstrate that the proposed GCPSO method produces very good results in the extraction of the PV model parameters, which can find highly accurate solutions while demanding a reduced computational cost.

Journal ArticleDOI
TL;DR: The aim of this paper is to design a locally optimal time-varying estimator to simultaneously estimate both the system states and the fault signals such that, at each sampling instant, the covariance of the estimation error has an upper bound that is minimized by properly designing the estimator gain.

Journal ArticleDOI
TL;DR: In this paper, a sparsely-connected depth-4 neural network is constructed to approximate a function f on a d-dimensional manifold and bound its error in approximating f. The size of the network depends on the dimension and curvature of the manifold.

Journal ArticleDOI
TL;DR: Several quadratic transformation inequalities for Gaussian hypergeometric function are presented and the analogs of duplication inequalities for the generalized Grötzsch ring function are found.
Abstract: In the article, we present several quadratic transformation inequalities for Gaussian hypergeometric function and find the analogs of duplication inequalities for the generalized Grotzsch ring function

Proceedings ArticleDOI
18 Jun 2018
TL;DR: The authors disentangle the identity and attributes of faces, and then recombine the identity vector and the attribute vector to synthesize a new face of the subject with the extracted attribute.
Abstract: We propose a framework based on Generative Adversarial Networks to disentangle the identity and attributes of faces, such that we can conveniently recombine different identities and attributes for identity preserving face synthesis in open domains. Previous identity preserving face synthesis processes are largely confined to synthesizing faces with known identities that are already in the training dataset. To synthesize a face with identity outside the training dataset, our framework requires one input image of that subject to produce an identity vector, and any other input face image to extract an attribute vector capturing, e.g., pose, emotion, illumination, and even the background. We then recombine the identity vector and the attribute vector to synthesize a new face of the subject with the extracted attribute. Our proposed framework does not need to annotate the attributes of faces in any way. It is trained with an asymmetric loss function to better preserve the identity and stabilize the training process. It can also effectively leverage large amounts of unlabeled training face images to further improve the fidelity of the synthesized faces for subjects that are not presented in the labeled training face dataset. Our experiments demonstrate the efficacy of the proposed framework. We also present its usage in a much broader set of applications including face frontalization, face attribute morphing, and face adversarial example detection.

Journal ArticleDOI
TL;DR: In this article, the standard and non-local nonlinear Schrodinger (NLS) equations obtained from the coupled NLS system of equations (AKNS) were studied by using the Hirota bilinear method.
Abstract: We study standard and nonlocal nonlinear Schrodinger (NLS) equations obtained from the coupled NLS system of equations (Ablowitz-Kaup-Newell-Segur (AKNS) equations) by using standard and nonlocal reductions, respectively. By using the Hirota bilinear method, we first find soliton solutions of the coupled NLS system of equations; then using the reduction formulas, we find the soliton solutions of the standard and nonlocal NLS equations. We give examples for particular values of the parameters and plot the function |q(t, x)|2 for the standard and nonlocal NLS equations.

Posted Content
TL;DR: Conditions for global convergence of the standard optimization algorithm used in machine learning applications, stochastic gradient descent (SGD), are established and the scaling of its error with the size of the network is quantified.
Abstract: Neural networks, a central tool in machine learning, have demonstrated remarkable, high fidelity performance on image recognition and classification tasks. These successes evince an ability to accurately represent high dimensional functions, but rigorous results about the approximation error of neural networks after training are few. Here we establish conditions for global convergence of the standard optimization algorithm used in machine learning applications, stochastic gradient descent (SGD), and quantify the scaling of its error with the size of the network. This is done by reinterpreting SGD as the evolution of a particle system with interactions governed by a potential related to the objective or "loss" function used to train the network. We show that, when the number $n$ of units is large, the empirical distribution of the particles descends on a convex landscape towards the global minimum at a rate independent of $n$, with a resulting approximation error that universally scales as $O(n^{-1})$. These properties are established in the form of a Law of Large Numbers and a Central Limit Theorem for the empirical distribution. Our analysis also quantifies the scale and nature of the noise introduced by SGD and provides guidelines for the step size and batch size to use when training a neural network. We illustrate our findings on examples in which we train neural networks to learn the energy function of the continuous 3-spin model on the sphere. The approximation error scales as our analysis predicts in as high a dimension as $d=25$.

Journal ArticleDOI
TL;DR: Stochastic homogenization theory allows us to better understand the convergence of the algorithm, and a stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.
Abstract: Entropy-SGD is a first-order optimization method which has been used successfully to train deep neural networks. This algorithm, which was motivated by statistical physics, is now interpreted as gradient descent on a modified loss function. The modified, or relaxed, loss function is the solution of a viscous Hamilton–Jacobi partial differential equation (PDE). Experimental results on modern, high-dimensional neural networks demonstrate that the algorithm converges faster than the benchmark stochastic gradient descent (SGD). Well-established PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. Stochastic homogenization theory allows us to better understand the convergence of the algorithm. A stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.

Journal ArticleDOI
TL;DR: In this paper, consensus equilibrium (CE) is introduced, which generalizes regularized inversion to include a much wider variety of both forward (or data fidelity) components and prior (or regularity) components without the need for either to be expressed using a cost function.
Abstract: Regularized inversion methods for image reconstruction are used widely due to their tractability and their ability to combine complex physical sensor models with useful regularity criteria. Such methods motivated the recently developed Plug-and-Play prior method, which provides a framework to use advanced denoising algorithms as regularizers in inversion. However, the need to formulate regularized inversion as the solution to an optimization problem limits the expressiveness of possible regularity conditions and physical sensor models. In this paper, we introduce the idea of consensus equilibrium (CE), which generalizes regularized inversion to include a much wider variety of both forward (or data fidelity) components and prior (or regularity) components without the need for either to be expressed using a cost function. CE is based on the solution of a set of equilibrium equations that balance data fit and regularity. In this framework, the problem of MAP estimation in regularized inversion is replaced by...

Journal Article
TL;DR: The soft set is generalized to the hypersoft set by transforming the function F into a multi-attributefunction, and the hybrids of Crisp, Fuzzy, Intuitionistic FBuzzy, Neutrosophic, and Plithogenic Hypersoft Set are introduced.
Abstract: In this paper, we generalize the soft set to the hypersoft set by transforming the function F into a multi-attributefunction. Then we introduce the hybrids of Crisp, Fuzzy, Intuitionistic Fuzzy, Neutrosophic, and Plithogenic Hypersoft Set.

Journal ArticleDOI
TL;DR: Monotonicity of the local value iteration ADP algorithm is presented, which shows that under some special conditions of the initial value function and the learning rate function, the iterative value function can monotonically converge to the optimum.
Abstract: In this paper, convergence properties are established for the newly developed discrete-time local value iteration adaptive dynamic programming (ADP) algorithm. The present local iterative ADP algorithm permits an arbitrary positive semidefinite function to initialize the algorithm. Employing a state-dependent learning rate function, for the first time, the iterative value function and iterative control law can be updated in a subset of the state space instead of the whole state space, which effectively relaxes the computational burden. A new analysis method for the convergence property is developed to prove that the iterative value functions will converge to the optimum under some mild constraints. Monotonicity of the local value iteration ADP algorithm is presented, which shows that under some special conditions of the initial value function and the learning rate function, the iterative value function can monotonically converge to the optimum. Finally, three simulation examples and comparisons are given to illustrate the performance of the developed algorithm.

Journal ArticleDOI
TL;DR: A trust-region model-based algorithm for solving unconstrained stochastic optimization problems that utilizes random models of an objective function f(x), obtained from stochastically observations of the function or its gradient.
Abstract: In this paper, we propose and analyze a trust-region model-based algorithm for solving unconstrained stochastic optimization problems. Our framework utilizes random models of an objective function f(x), obtained from stochastic observations of the function or its gradient. Our method also utilizes estimates of function values to gauge progress that is being made. The convergence analysis relies on requirements that these models and these estimates are sufficiently accurate with high enough, but fixed, probability. Beyond these conditions, no assumptions are made on how these models and estimates are generated. Under these general conditions we show an almost sure global convergence of the method to a first order stationary point. In the second part of the paper, we present examples of generating sufficiently accurate random models under biased or unbiased noise assumptions. Lastly, we present some computational results showing the benefits of the proposed method compared to existing approaches that are based on sample averaging or stochastic gradients.

Proceedings Article
01 Feb 2018
TL;DR: A new paradigm for discovering disentangled representations of class structure is proposed and a novel loss function based on the $F$ statistic is proposed, which describes the separation of two or more distributions.
Abstract: Deep-embedding methods aim to discover representations of a domain that make explicit the domain's class structure and thereby support few-shot learning. Disentangling methods aim to make explicit compositional or factorial structure. We combine these two active but independent lines of research and propose a new paradigm suitable for both goals. We propose and evaluate a novel loss function based on the $F$ statistic, which describes the separation of two or more distributions. By ensuring that distinct classes are well separated on a subset of embedding dimensions, we obtain embeddings that are useful for few-shot learning. By not requiring separation on all dimensions, we encourage the discovery of disentangled representations. Our embedding method matches or beats state-of-the-art, as evaluated by performance on recall@$k$ and few-shot learning tasks. Our method also obtains performance superior to a variety of alternatives on disentangling, as evaluated by two key properties of a disentangled representation: modularity and explicitness. The goal of our work is to obtain more interpretable, manipulable, and generalizable deep representations of concepts and categories.

Posted Content
TL;DR: Findings suggest ways to run the QAOA that reduce or eliminate the use of the outer loop optimization and may allow us to find good solutions with fewer calls to the quantum computer.
Abstract: The Quantum Approximate Optimization Algorithm, QAOA, uses a shallow depth quantum circuit to produce a parameter dependent state. For a given combinatorial optimization problem instance, the quantum expectation of the associated cost function is the parameter dependent objective function of the QAOA. We demonstrate that if the parameters are fixed and the instance comes from a reasonable distribution then the objective function value is concentrated in the sense that typical instances have (nearly) the same value of the objective function. This applies not just for optimal parameters as the whole landscape is instance independent. We can prove this is true for low depth quantum circuits for instances of MaxCut on large 3-regular graphs. Our results generalize beyond this example. We support the arguments with numerical examples that show remarkable concentration. For higher depth circuits the numerics also show concentration and we argue for this using the Law of Large Numbers. We also observe by simulation that if we find parameters which result in good performance at say 10 bits these same parameters result in good performance at say 24 bits. These findings suggest ways to run the QAOA that reduce or eliminate the use of the outer loop optimization and may allow us to find good solutions with fewer calls to the quantum computer.

Journal ArticleDOI
TL;DR: It is proved that the result of each of the four basic operations on fuzzy numbers introduced based on the proposed approach leads to a fuzzy number, and the condition for the existence of the granular derivative of a fuzzy function is provided by a theorem.
Abstract: In this paper, using the concept of horizontal membership functions, a new definition of fuzzy derivative called granular derivative is proposed based on granular difference. Moreover, a new definition of fuzzy integral called granular integral is defined, and its relation with the granular derivative is given. A new definition of a metric—granular metric—on the space of type-1 fuzzy numbers, and a concept of continuous fuzzy functions are also presented. Restrictions associated to previous approaches—Hukuhara differentiability, strongly generalized Hukuhara differentiability, generalized Hukuhara differentiability, generalized differentiability, Zadeh's extension principle, and fuzzy differential inclusions—dealing with fuzzy differential equations (FDEs) are expressed. It is shown that the proposed approach does not have the drawbacks of the previous approaches. It is also demonstrated how this approach enables researchers to solve FDEs more conveniently than ever before. Moreover, we showed that this approach does not necessitate that the diameter of the fuzzy function be monotonic. It is also proved that the result of each of the four basic operations on fuzzy numbers introduced based on the proposed approach leads to a fuzzy number. Moreover, the condition for the existence of the granular derivative of a fuzzy function is provided by a theorem. Additionally, by two examples, it is shown that the existence of the granular derivative of a fuzzy function does not imply the existence of the generalized Hukuhara differentiability of the fuzzy function, and vice versa. The terms doubling property and unnatural behavior in modeling phenomenon are also introduced. Furthermore, using some examples, the paper proceeds to elaborate on the efficiency and effectiveness of the proposed approach. Moreover, as an application of the proposed approach, the response of Boeing 747 to impulsive elevator input is obtained in the presence of uncertain initial conditions and parameters.