A Probabilistic Model of the Categorical Association Between Colors.

Open AccessProceedings Article

A Probabilistic Model of the Categorical Association Between Colors.

Jason Chuang, +2 more

- pp 6-11

Chats0

TLDR

A non-parametric probabilistic model that can be used to encode relationships in color naming datasets, and it is shown that the uniqueness of a color name (color saliency) can be captured using the entropy of the probability distribution.

Abstract:

In this paper we describe a non-parametric probabilistic model that can be used to encode relationships in color naming datasets. This model can be used with datasets with any number of color terms and expressions, as well as terms from multiple languages. Because the model is based on probability theory, we can use classic statistics to compute features of interest to color scientists. In particular, we show that the uniqueness of a color name (color saliency) can be captured using the entropy of the probability distribution. We demonstrate this approach by applying this model to two different datasets: the multi-lingual World Color Survey (WCS), and a database collected via the web by Dolores Labs. We demonstrate how saliency clusters similarly named colors for both datasets, and compare our WCS results to those of Kay and his colleagues. We compare the two datasets to each other by converting them to a common colorspace (IPT). Introduction There has been growing interest in how to use color naming data to improve color models. Better color name databases[7, 10, 11, 12, 14, 2] and online naming studies[18, 8] have stimulated recent work. Color naming databases and associated models have been been useful in color transfer[5], gamut mapping[19, 20], and methods for specifying or selecting colors in an image[15, 16, 17]. In this paper, we examine the issue of how to represent and quantify the association between colors induced by names. Current methods that incorporate naming data represent the category associated with a color using either a single name[5, 6], a vector[19], or by a set of fuzzy logic memberships[1, 2, 17]. We present a probabilistic framework for working with colors. We define the categorical association of a color c as a conditional probability P(C|c) over colors C in the color space C . For a color c, the probability P(C|c) represents how likely other colors in the space C are assigned the same linguistic label as c. Our choice of using a probability over colors in our framework is motivated by the following criteria not met by current approaches. Our model satisfies three design goals. (1) Our approach can incorporate categorical effects from any number of color words, expressions involving multiple words, and different languages. (2) Our framework is based on a non-parametric model which can capture the differences in color name distributions such as “yellow” having a narrow focus and “green” having a wide distribution[21]. (3) Embedding our representation in a probabilistic framework enables us to apply a wide array of statistical and probabilistic tools to further analyze and study the effect of categories on colors. We implement our model on two datasets. We extract color naming data from six languages in the World Color Survey which contains naming information at 330 colors on the surface of the Munsell solid[7]. We also investigate online naming data collected by DoloresLabs which contains names given to 10,000 randomly sampled colors in the RGB cube[8]. Our framework can incorporate cross-linguistic data and combine contributions from color words with similar meanings. We introduce the concept of salient colors based on the statistical notion of entropy. Salient colors from our approach show good correspondence basic color terms identified by Berlin and Kay[3]. Our approach also reveals two regions that are consistently named in the sRGB cube not corresponding to typical basic color terms. We compare qualitatively the differences in salient name regions between the World Color Survey and the DoloresLabs datasets. Motivations and Related Work The goal of this paper is to present a computational framework for modeling color categories derived from experimental data. Our framework is motivated by three issues that are at best partially addressed in the current literature. 1. We would like a framework that can include all possible words for describing a color and not be limited to a predefined list of terms. 2. We would like a non-parametric model capable of capturing the details in categorical association but still be robust to noise in the naming dataset. 3. We would like a framework that can support a rich set of computational and mathematical operations, so that more in-depth studies of categorical effects can be built on the framework. In particular, our approach is grounded in probability theory. The first issue addresses how to account for the many potential expressions for describing a color. In 1969, Berlin and Kay defined color words as basic color terms if their meanings cannot be derived from other words, and proposed that there are a total of eleven basic color terms. Basic color terms were shown to be universal across languages. While some languages such as English contain all eleven terms, others may have developed only a subset of the words[3]. Subsequent studies confirmed that basic color terms are words with the highest consensus between speakers[4], but found twelve basic color terms in Russian contradicting the limit on the number of terms[22]. Kay and McDaniel hypothesized that as languages evolve, some individuals may consider additional words such as aqua/turquoise (green and blue), chartreuse/lime (yellow and green), and maroon/burgundy (red and black) as basic color terms[9]. Many existing methods assume eleven or a fixed number of color categories and cannot process the full set of responses from recent surveys such as the HP Labs Multilingual Naming Experiment[18] and the DoloresLabs Naming Dataset[8], which have hundreds of color words. Chang et al.’s category-preserving color transfer algorithm defines eleven convex regions in the color space corresponding to the basic color terms[5]. Motomura’s categorical color mapping algorithm maps foci of the eight chromatic basic color terms between the source and target gamuts[19]. Moroney’s system for translating colors to names operates on the n most frequently used color words. We want a framework where all words are included and contributions from words with similar cognitive concepts such as “maroon” and “burgundy” are combined based on their similarity. Secondly, color names exhibit different naming distributions. Colors such as “red” and “yellow” are known to have a narrow and well-defined center while colors such as “green” and “blue” are known to be composed of a broad range of hue.[21] We want our framework have the flexibility to capture the details in the distributions while being robust to noise in the data. Current approaches tend to model color categories as a volume in color space, using various parameterized models, or using non-parameterized approaches such as histograms. Partitioning the color space[12, 5] assume color names occupy discrete and non-overlapping regions in the color space. Motomura’s gamut-mapping algorithm assumes that each basic term has an ellipsoid-shaped distribution and models the distributions using an 81-parameter covariance matrix[19]. Benavente models the color naming space using a set of 6-parameter SigmoidGaussian distributions[1]. One advantage of parameterized models is that they are constructed from a small number of parameters which can be estimated accurately. In his adaptive lexical classification system, Moroney proposes an alternative implementation in which color names are represented as non-parametric histograms[16]. While histograms can capture any shape of distribution, Moroney reported noise in the data due to limited number of data points and suggests that smoothing operators or hedging be applied to post-process the histograms.1 Finally, we would like a framework capable of supporting a rich set of computational and mathematical tools. Instead of being merely a representation, the framework should allows us to perform further computation and analysis on how categories affect the way we associate colors. Treating the association between colors as a probability distribution positions our framework within the well-studied domain of probability theory. Methodology Colors and Color Words A naming dataset consists of a list of responses in the form of “color”-“color word” pairs that record the words used to describe a color. A “color” refers to the stimuli shown to a respondent and varies between datasets from Munsell color chips viewed under controlled lighting to rectangles of colors displayed on uncalibrated monitors. Unconstrained surveys allow respondents to use any expression whereas constrained surveys ask respondents to choose from a predefined list of words. An unconstrained color expression could include, e.g., “granny smith apple green”, “light robin’s egg pastel blue”, or “mix all the paint together”. In practice, most expressions recorded in unconstrained surveys consistent of a single word or a simple set of words such as “blue” or 1We should emphasize our application differs from Moroney’s in that his work is on modeling the distribution of color names while our work is on modeling the association between colors due to naming effects. “bluish green”. We will use the term “color words” from this point on even though it could refer to any possible expressions for describing a color. A naming dataset can be tabulated using a word count table where the list of all colors presented in the survey is displayed along the columns, and a list of all color words recorded is displayed along the rows. Each entry in the table indicates the number of times a corresponding color word is used to describe the corresponding color. Depending on the nature of the naming dataset, the density of word count table may vary. The World Color Survey (WCS)[7] is cross-linguistic and unconstrained, and collects naming data on a set of 330 colors. The word count table for the WCS consisting of 2300 rows by 330 columns with 20% non-zero entries. In comparison, the DoloresLabs color name dataset[8] while also unconstrained uses 10000 randomly-sampled colors. A total of 1966 expressi

A Probabilistic Model of the Categorical Association Between Colors.

Citations

Selecting semantically-resonant colors for data visualization

Color naming models for color selection, image editing and palette design

Modeling how people extract color themes from images

Somewhere Over the Rainbow: An Empirical Assessment of Quantitative Colormaps

Coloring with Words: Guiding Image Colorization Through Text-based Palette Generation

References

Divergence measures based on the Shannon entropy

Basic Color Terms: Their Universality and Evolution

Part I. Theory

Salience of chromatic basic color terms confirmed by three measures

Locating basic colours in the munsell space

Related Papers (5)

Color naming models for color selection, image editing and palette design

Basic Color Terms: Their Universality and Evolution

Locating basic colours in the munsell space

Color Semantics for Image Indexing

Color compatibility from large datasets