scispace - formally typeset
Search or ask a question

Showing papers in "Annals of Mathematical Statistics in 1957"


Journal ArticleDOI
TL;DR: In this article, the transition probabilities of a Markov chain of arbitrary order were obtained and their asymptotic distribution was obtained for a single observation of a long chain, and the relation between likelihood ratio criteria and contingency tables was discussed.
Abstract: Maximum likelihood estimates and their asymptotic distribution are obtained for the transition probabilities in a Markov chain of arbitrary order when there are repeated observations of the chain. Likelihood ratio tests and $\chi^2$-tests of the form used in contingency tables are obtained for testing the following hypotheses: (a) that the transition probabilities of a first order chain are constant, (b) that in case the transition probabilities are constant, they are specified numbers, and (c) that the process is a $u$th order Markov chain against the alternative it is $r$th but not $u$th order. In case $u = 0$ and $r = 1$, case (c) results in tests of the null hypothesis that observations at successive time points are statistically independent against the alternate hypothesis that observations are from a first order Markov chain. Tests of several other hypotheses are also considered. The statistical analysis in the case of a single observation of a long chain is also discussed. There is some discussion of the relation between likelihood ratio criteria and $\chi^2$-tests of the form used in contingency tables.

1,401 citations


Journal ArticleDOI
TL;DR: In this paper, the concept of the variance function for an experimental design is introduced, and the problem of selecting practically useful designs is discussed, and in this connection, the notion of variance function is introduced.
Abstract: Suppose that a relationship $\eta = \varphi(\xi_1, \xi_2, \cdots, \xi_k)$ exists between a response $\eta$ and the levels $\xi_1, \xi_2, \cdots, \xi_k$ of $k$ quantitative variables or factors, and that nothing is assumed about the function $\varphi$ except that, within a limited region of immediate interest in the space of the variables, it can be adequately represented by a polynomial of degree $d$. A $k$-dimensional experimental design of order $d$ is a set of $N$ points in the $k$-dimensional space of the variables so chosen that, using the data generated by making one observation at each of the points, all the coefficients in the $d$th degree polynomial can be estimated. The problem of selecting practically useful designs is discussed, and in this connection the concept of the variance function for an experimental design is introduced. Reasons are advanced for preferring designs having a "spherical" or nearly "spherical" variance function. Such designs insure that the estimated response has a constant variance at all points which are the same distance from the center of the design. Designs having this property are called rotatable designs. When such arrangements are submitted to rotation about the fixed center, the variances and covariances of the estimated coefficients in the fitted series remain constant. Rotatable designs having satisfactory variance functions are given for $d = 1, 2$; and $k = 2, 3, \cdots, \infty$. Blocking arrangements are derived. The simplification in the form of the confidence region for a stationary point resulting from the use of a second order rotatable design is discussed.

1,332 citations


Journal ArticleDOI
TL;DR: In this paper, the authors studied the problem of examining a "random sample" of permutations and making the decision to accept or reject a hypothesis on the basis of those permutations only.
Abstract: Suppose $X_1, \cdots, X_m, Y_1, \cdots, Y_n$ are $m + n = N$ independent random variables, the $X$'s identically distributed and the $Y$'s identically distributed, each with a continuous cdf. Let $$z = (z_1, \cdots, z_m, z_{m + 1}, \cdots, z_N) = (x_1, \cdots, x_m, y_1, \cdots, y_n)$$ represent an observation on the $N$ random variables and let $$u(z) = (1/m) \sum^m_{i = 1} z_i - (1/n) \sum^N_{i = m + 1} z_i = \bar x - \bar y$$. Consider the $r = N! N$-tuples obtained from $(z_1, \cdots, z_N)$ by making all permutations of the indices $(1, \cdots, N)$. Since we assume continuous cdf's, then with probability one, these $r N$-tuples will be distinct. Denote them by $z^{(1)}, \cdots, z^{(r)}$, and suppose that they have been ordered so that $$u(z^{(1)} \geqq \cdots \geqq u(z^{(r)})$$. Notice that since $$\bar x - \bar y = (1/m) \sum^N_{i = 1} z_i - (N/m)\bar y = (N/n)\bar x - (1/n) \sum^N_{i = 1} z_i,$$ the same ordering can be induced by choosing $u(z) = c\bar x$ or $u(z) = - c\bar y$ for any $c > 0$. Assuming that the cdf's of $X_1, Y_1$ are of the form $F(x), F(x - \Delta)$ respectively, Pitman [2] suggested essentially the following test of the hypothesis $H'$ that $\Delta = 0$. Select a set of $k (k > 0)$ integers $i_1, \cdots, i_k, (1 \leqq i_1 0$. A practical shortcoming of this procedure is the great difficulty in enumerating the points $z^{(i)}$ and the evaluation of $u(z^{(i)})$ for each of them. For instance, even after eliminating those permutations which always give the same value of $u$, then for sample sizes $m = n = 5$, there are $\binom{10}{5} = 252$ permutations to examine, and for sample sizes $m = n = 10$, there are $\binom{20}{10} = 184,765$ permutations to examine. In the following section, we propose the almost obvious procedure of examining a "random sample" of permutations and making the decision to accept or reject $H$ on the basis of those permutations only. Bounds are determined for the ratio of the power of the original procedure to the modified one. Some numerical values of these bounds are given in Table 1. The bounds there listed correspond to tests which in both original and modified form have size $\alpha$, and for which the modified test is based on a random sample of $s$ permutations drawn with replacement. These have been computed for a certain class of alternatives which is described below. For simplicity, we have restricted the main exposition to the two-sample problem. In Section 5, we point out extensions to the more general hypotheses of invariance studied in [1].

718 citations



Journal ArticleDOI
TL;DR: In this article, an expository paper giving an account of the "goodness of fit" test and the "two sample" test based on the empirical distribution functiontests which were initiated by the four authors cited in the title is given.
Abstract: 1. Preface. This is an expository paper giving an account of the "goodness of fit" test and the "two sample" test based on the empirical distribution functiontests which were initiated by the four authors cited in the title. An attempt is made here to give a fairly complete coverage of the history, development, present status, and outstanding current problems related to these topics. The reader is advised that the relative amount of space and emphasis allotted to the various phases of the subject does not reflect necessarily their intrinsic merit and importance, but rather the author's personal interest and familiarity. Also, for the sake of uniformity the notationt of miany of the writers quoted has been altered so that when referring to the original papers it will be necessary to check their nomenlclature. 2. The empirical distribution function and the tests. Let XI, X2, * * ,XXn be independent random variables (observations) each having the same distribution

520 citations


Journal ArticleDOI
TL;DR: In the present paper, the following five topics have been selected for mathematical discussion and new results are presented: random assortment of subunits of a gene, process of natural selection in a finite population, chance of fixation of mutant genes, population structure and evolution.
Abstract: In genetics, stochastic processes arise at all levels of organization ranging from subunits of the gene to natural populations. Types of stochastic processes involved are also diverse. In the present paper, the following five topics have been selected for mathematical discussion and new results are presented: (1) Random assortment of subunits of a gene. (2) Senescence in paramecium due to random assortment of chromosomes. (3) Process of natural selection in a finite population (interaction between selection and random genetic drift). (4) Chance of fixation of mutant genes. (5) Population structure and evolution. Finally it is pointed out that new mathematical techniques will be needed for a satisfactory treatment of Wright's theory of evolution.

439 citations


Journal ArticleDOI
TL;DR: In this paper, it was shown that the conditional expectation of any unbiased estimator of the $r$th cumulant of the Inverse Gaussian probability density function for a fixed number of observations on a variate $x$ has been derived.
Abstract: Given a fixed number $n$ of observations on a variate $x$ which has the Inverse Gaussian probability density function $$\exp\big\{-\frac{\phi^2x}{2\lambda} + \phi - \frac{\lambda}{2x}\big\} \sqrt{\frac{\lambda}{2\pi x^3}},\quad 0 < x < \infty$$, for which $E(x) = \lambda/\phi = \mu$, it is shown how to find functions of the sample mean $m$ whose expectations can be expressed suitably in terms of the parameter $\phi$ (or $\mu$). In particular, it is shown that the conditional expectation of any unbiased estimator $\tilde{\kappa}_r$ of the $r$th cumulant $\kappa_r$ is $$E(\tilde{\kappa}_r \mid m) = 2m(\frac{1}{2}\lambda n^2)^{r - 1}e^{\frac{1}{2}g} \int^\infty_1 (u - 1)^{2r - 3}e^{-\frac{1}{2}gu^2} du/(r - 2)!$$ where $g = \lambda n/m$. This expectation may be evaluated either by series given in the paper or by using published tables of numerical values of certain functions to which it can be related. The conditional variance of the usual mean square estimator $s^2$ of $\kappa_2$ is also found. These results give an asymptotic series for the conditional variance of a generalization $\chi^2_s = (n - 1)s^2/E(s^2 \mid m)$ of a statistic discussed by Cochran. Exact formulae for the expectation of the statistic $s^2/m^3$ and its mean square error as an estimator of $\lambda^{-1}$ are given or described. This statistic is a consistent estimator of $\lambda^{-1}$ and has asymptotically an efficiency of $\phi/(\phi + 3)$.

417 citations


Journal ArticleDOI
TL;DR: In this article, a family of transformations of the form (y + c) = (y+c)^p = e^{my} is defined, which includes all linear transformations of these transformations and all their limits.
Abstract: The attention of statisticians has usually been focussed on single transformations, rather than on families of transformations. With a growing appreciation of the advantages of examining the behavior of data or approximations over whole families of transformations (Moore and Tukey [2], Anscombe and Tukey [1]), there arises a need for rationally planned charts for representing families of transformations. The contributions which (i) the topology of the family and (ii) a definition of the strength of a transformation can make to charting are studied in general and applied to the charting of the simple family of transformations. This family is defined to include all transformations of the form $$y \text{is replaced by} z = (y + c)^p$$ and all their limits. It thus includes $z = \log (y + c), z = e^{my}$ and the special case \begin{equation*}z = \begin{cases}0, & y = y_\min,\\1, & \text{otherwise},\end{cases}\end{equation*} where $y_\min$ is the least value of $y$ either (i) present in the data or (ii) possible, as well as all linear transformations of these transformations. Experience having shown that transformations with $p \leqq 1$ are much more frequently useful than any others, the charts developed, presented, and exemplified here are restricted to the part of the simple family--its central region--for which $p \leqq 1$. Separate charts are presented for two cases which should cover most cases which arise in practice: (1) Where, as with counted data and small counts, the least reasonable value for $y + c = 0$, and this value is likely to occur; (2) Where $y + c$ is always safely $>0$, and the range of $y$ is through not many powers of 10.

404 citations


Journal ArticleDOI
TL;DR: In this paper, the spectral analysis of wide sense stationary time series which possess a spectral density function and whose fourth moment functions satisfy an integrability condition (which includes Gaussian processes) is studied.
Abstract: This paper is concerned with the spectral analysis of wide sense stationary time series which possess a spectral density function and whose fourth moment functions satisfy an integrability condition (which includes Gaussian processes). Consistent estimates are obtained for the spectral density function as well as for the spectral distribution function and a general class of spectral averages. Optimum consistent estimates are chosen on the basis of criteria involving the notions of order of consistency and asymptotic variance. The problem of interpolating the estimated spectral density, so that only a finite number of quantities need be computed to determine the entire graph, is also discussed. Both continuous and discrete time series are treated.

358 citations



Journal ArticleDOI
TL;DR: In this paper, it was shown that in many stochastic structures where the distribution function (d.f.) depends continuously upon the parameters and d.f.'s of the chance variables in the structure, those parameters and D.f.'s which are identified (uniquely determined by the d. f. of the structure) can be strongly consistently estimated by the minimum distance method.
Abstract: The present paper gives the formal statements and proofs of the results illustrated in [1]. In a series of papers ([2], [3], [4]) the present author has been developing the minimum distance method for obtaining strongly consistent estimators (i.e., estimators which converge with probability one). The method of the present paper is much superior, in simplicity and generality of application, to the methods used in the papers [2] and [4] cited above. Roughly speaking, the present paper can be summarized by saying that, in many stochastic structures where the distribution function (d.f.) depends continuously upon the parameters and d.f.'s of the chance variables in the structure, those parameters and d.f.'s which are identified (uniquely determined by the d.f. of the structure) can be strongly consistently estimated by the minimum distance method of the present paper. Since identification is obviously a necessary condition for estimation by any method, it follows that, in many actual statistical problems, identification implies estimatability by the method of the present paper. Thus problems of long standing like that of Section 5 below are easily solved. For this problem the whole canonical complex (Section 6 below; see [1]) has never, to the author's knowledge, been estimated by any other method. The directional parameter of the structure of Section 4 seems to be here estimated for the first time. As the identification problem is solved for additional structures it will be possible to apply the minimum distance method. The proofs in the present paper are of the simplest and most elementary sort. In Section 8 we treat a problem in estimation for nonparametric stochastic difference equations. Here the observed chance variables are not independent, but the minimum distance method is still applicable. The treatment is incomparably simpler than that of [4], where this and several other such problems are treated. The present method can be applied to the other problems as well. Application of the present method is routine in each problem as soon as the identification question is disposed of. In this respect it compares favorably with the method of [4], whose application was far from routine. As we have emphasized in [1], the present method can be applied with very many definitions of distance (this is also true of the earlier versions of the minimum distance method). The definition used in the present paper has the convenience of making a certain space conditionally compact and thus eliminating the need for certain circumlocutions. Since no reason is known at present for preferring one definition of distance to another we have adopted a convenient definition. It is a problem of great interest to decide which, if any, definition of distance yields estimators preferable in some sense. The definition of distance used in this paper was employed in [9]. As the problem is formulated in Section 2 below (see especially equation (2.1), the "observed" chance variables $\{X_i\}$ are known functions (right members of (2.1)) of the "unobservable" chance variables $\{Y_i\}$ and of the unknown constants $\{\theta_i\}$. In the problems treated in [3], [9], and [11], it is the distribution of the observed chance variables which is a known function of unobservable chance variables and of unknown constants, and not the observed chance variables themselves. However, the latter problems can easily be put in the same form as the former problems. Moreover, in the method described below the values of the observed chance variables are used only to estimate the distribution function of the observed chance variables (by means of the empiric distribution function). Consequently there is no difference whatever in the treatment of the problems by the minimum distance method, no matter how the problems are formulated. The unobservable chance variables $\{Y_i\}$ correspond to what in [11] are called "incidental parameters"; the unknown constants $\{\theta_i\}$ are called in [11] "structural parameters". In [9] there is a discussion of the fact that in some problems treated in the literature the incidental parameters are considered as constants and in other problems as chance variables. In contradistinction to the present paper [3] (in particular its Section 5) treats the incidental parameters as unknown constants. The fundamental idea of both papers is the same: The estimator is chosen to be such a function of the observed chance variables that the d.f. of the observed chance variables (when the estimator is put in place of the parameters and distributions being estimated) is "closest" to the empiric d.f. of the observed chance variables. The details of application are perhaps easier in the present paper; the problems are different and of interest per se.



Journal ArticleDOI
TL;DR: In this article, it was shown that in a general class of decision functions, there exists a minimax procedure that observes the process of a decision function for a constant length of time.
Abstract: The main purpose of this paper is to prove, by the method of invariance, that in certain sequential decision problems (discrete and continuous time) there exists a minimax procedure $\delta^\ast$ among the class of all sequential decision functions such that $\delta^\ast$ observes the process for a constant length of time. In the course of proving these results a general invariance theorem will be proved (Sec. 3) under conditions which are easy to verify in many important examples (Sec. 2). A brief history of the invariance theory will be recounted in the next paragraph. The theorem of Sec. 3 is to be viewed only as a generalization of one due to Peisakoff [1]; the more general setting here (see Sec. 2; the assumptions of [1] are discussed under Condition 2b) is convenient for many applications, and some of the conditions of Sec. 2 (and the proofs that they imply the assumptions) are new; but the method of proof used in Sec. 3 is only a slight modification of that of [1]. The form of this extension of [1] in Secs. 2 and 3, and the results of Secs. 4 and 5, are new as far as the author knows. In 1939 Pitman [2] suggested on intuitive grounds the use of best invariant procedures in certain problems of estimation and testing hypotheses concerning scale and location parameters. In the same year Wald [3] had the idea that the theorem of Sec. 3 should be valid for certain nonsequential problems of estimating a location parameter; unfortunately, as Peisakoff points out, there seems to be a lacuna in Wald's proof. During the war Hunt and Stein [4] proved the theorem for certain problems in testing hypotheses in their famous unpublished paper whose results have been described by Lehmann in [5a], [5b]. Peisakoff's previously cited work [1] of 1950 contains a comprehensive and fairly general development of the theory and includes many topics such as questions of admissibility and consideration of vector-valued risk functions which will not be considered in the present paper (the latter could be included by using the devise of taking linear combinations of the components of the risk vector). Girshick and Savage [6] at about the same time gave a proof of the theorem for the location parameter case with squared error or bounded loss function. In their book [7], Blackwell and Girshick in the discrete case prove the theorem for location (or scale) parameters. The referee has called the author's attention to a paper by H. Kudo in the Nat. Sci. Report of the Ochanomizu University (1955), in which certain nonsequential invariant estimation problems are treated by extending the method of [7]. All of the results mentioned above are nonsequential. Peisakoff [1] mentions that sequential analysis can be considered in his development, but (see Sec. 4) his considerations would not yield the results of the present paper. A word should be said about the possible methods of proof. (The notation used here is that of Sec. 2 but will be familiar to readers of decision theory.) The method of Hunt and Stein, extended to problems other than testing hypotheses, is to consider for any decision function $\delta$ a sequence of decision functions $\{\delta_i\}$ defined by $$\delta_i(x,\Delta) = \int_{G_n} \delta_i(gx,g\Delta)\mu(dg)/\mu(G_n)$$ where $\mu$ is left Haar measure on a group $G$ of transformations leaving the problem invariant and $\{G_n\}$ is a sequence of subsets of $G$ of finite $\mu$-measure and such that $G_n \rightarrow G$ in some suitable sense. If $G$ were compact, we could take $\mu(G) = 1$ and let $G_1 = G$; it would then be clear that $\delta_1$ is invariant and that $\sup_Fr_{\delta_1} (F) \leqq \sup_Fr_\delta(F),$ yielding the conclusion of the theorem of Sec. 3. If $G$ is not compact, an invariant procedure $\delta_0$ which is the limit in some sense of the sequence $\{\delta_i\}$ must be obtained (this includes proving that, in Lehmann's terminology, suitable conditions imply that any almost invariant procedure is equivalent to an invariant one) and $\sup_Fr_{\delta_0} (F) \leqq \sup_Fr_\delta(F)$ must be proved. Peisakoff's method differs somewhat from this, in that for each $\delta$ one considered a family $\{\delta_g\}$ of procedures obtained in a natural way from $\delta$, and shows that an average over $G_n$ of the supremum risks of the $\delta_g$ does not exceed that of $\delta$ as $n \rightarrow \infty$; there is an obvious relationship between the two methods. Similarly, in [7] the average of $r_\delta(gF_0)$ for $g$ in $G_n$ and some $F_0$ is compared with that of an optimum invariant procedure (the latter can thus be seen to be Bayes in the wide sense); the method of [6] is in part similar. In some problems it is convenient (see Example iii and Remark 7 in Sec. 2) to apply the method of Hunt and Stein to a compact group as indicated above in conjunction with the use of Peisakoff's method for a group which is not compact. The possibility of having an unbounded weight function does not arise in the Hunt-Stein work. Peisakoff handles it by two methods, only one of which is used in the present paper, namely, to truncate the loss function. The other method (which also uses a different assumption from Assumption 5) is to truncate the region of integration in obtaining the risk function. Peisakoff gives several conditions (usually of symmetry or convexity) which imply Assumption 4 of Sec. 2 or the corresponding assumption for his second method of proof in the cases treated by him, but does not include Condition 4b or 4c of Sec. 2. Blackwell and Girshick use Condition 4b for a location parameter in the discrete case with $W$ continuous and not depending on $x$, using a method of proof wherein it is the region of integration rather than the loss function which is truncated. (The proof in [6] is similar, using also the special form of $W$ there.) It is Condition 4c which is pertinent for many common weight functions used in estimating a scale parameter, e.g., any positive power of relative error in the problem of estimating the standard deviation of a normal d.f. The overlap of the results of Secs. 4 and 5 of the present paper with previous publications will now be described. There are now three known methods for proving the minimax character of decision functions. Wolfowitz [8] used the Bayes method for a great variety of weight functions for the case of sequential estimation of a normal distribution with unknown mean (see also [9]). Hodges and Lehmann [10] used their Cramer-Rao inequality method for a particular weight function in the case of the normal distribution with unknown mean and gamma distribution with unknown scale (as well as in some other cases not pertinent here) to obtain a slightly weaker minimax result (see the discussion in Sec. 6.1 of [12]) than that obtainable by the Bayes method. The Bayes method was used in the sequential case by Kiefer [11] in the case of a rectangular distribution with unknown scale or exponential distribution with unknown location, for a particular weight function. This method was used by Dvoretzky, Kiefer and Wolfowitz in [12] for discrete and continuous time sequential problems involving the Wiener, gamma, Poisson, and negative binomial processes, for particular classes of weight functions. The disadvantage of using the Cramer-Rao method is in the limitation of its applicability in weight function and in regularity conditions which must be satisfied, as well as in the weaker result it yields. The Bayes method has the disadvantage that, when a least favorable a priori distribution does not exist, computations become unpleasant in proving the existence (if there is one) of a constant-time minimax procedure unless an appropriate sequence of a priori distributions can be chosen in such a way that the a posteriori expected loss at each stage does not depend on the observations (this is also true in problems where we are restricted to a fixed experimentation time or size, but it is less of a complication there); thus, the weight functions considered in [12] for the gamma distribution were only those relative to which such sequences could be easily guessed, while the proof in [11] is made messy by the author's inability to guess such a sequence, and even in [8] the computations become more involved in the case where an unsymmetric weight function is treated. (If, e.g., $\mathscr{F}$ is isomorphic to $G$, the sequence of a priori distributions obtained by truncating $\mu$ to $G_n$ in the previous paragraph would often be convenient for proving the minimax character by the Bayes method if it were not for the complication just noted.) The third method, that of invariance, has the obvious shortcoming of yielding little unless the group $G$ is large enough and/or there exists a simple sequence of sufficient statistics; however, when it applies to the extent that it does in the examples of Secs. 4 and 5, it reduces the minimax problem to a trivial problem of minimization. Several other sequential problems treated in Section 4 seem never to have been treated previously by any method or for any weight function; some of these involve both an unknown scale and unknown location parameter. A multivariate example is also treated in Sec. 4. In example xv of Sec. 4 will be found some remarks which indicate when the method used there can or cannot be applied successfully. In Sec. 5, in addition to treating continuous time sequential problems in a manner similar to that of Sec. 4, we consider another type of problem where the group $G$ acts on the time parameter of the process rather than on the values of the sample function.

Journal ArticleDOI
TL;DR: In this article, the authors investigate the extent to which a statistician, by determining the order in which treatments are administered, and not revealing to the experimenter which treatment comes next until after the individual who is to receive it has been selected, can control the selection bias.
Abstract: Suppose an experimenter $E$ wishes to compare the effectiveness of two treatments, $A$ and $B$, on a somewhat vaguely defined population. As individuals arrive, $E$ decides whether they are in the population, and if he decides that they are, he administers $A$ or $B$ and notes the result, until $nA$'s and $nB$'s have been administered. Plainly, if $E$ is aware, before deciding whether an individual is in the population, which treatment is to be administered next, he may, not necessarily deliberately, introduce a bias into the experiment. This bias we call selection bias. We propose to investigate the extent to which a statistician $S$, by determining the order in which treatments are administered, and not revealing to $E$ which treatment comes next until after the individual who is to receive it has been selected, can control this selection bias. Thus a design $d$ is a distribution over the set $T$ of the $\binom{2n}{n}$ sequences of length $2n$ containing $nA$'s and $nB's$. We shall measure the bias of a design by the maximum expected number of correct guesses which an experimenter can achieve, knowing $d$, attempting to guess the successive elements of a sequence $t \varepsilon T$ selected by $d$, and being told after each guess whether or not it is correct. The distribution of the number $G$ of correct guesses depends both on $d$ and on the prediction method $p$ used by the experimenter. We shall consider particularly two designs, the truncated binomial, in which the successive treatments are selected independently with probability $\frac{1}{2}$ each until $n$ treatments of one kind have occurred, and the sampling design, in which all $\binom{2n}{n}$ sequences are equally likely. We shall consider particularly two prediction methods, the convergent prediction, which predicts that treatment which has hitherto occurred less often, and the divergent prediction, which predicts that treatment which has hitherto occurred more often, except that after $n$ treatments of one kind have been administered, the divergent prediction agrees with the convergent predictions that the other treatment will follow; when both treatments have occurred equally often, either method predicts $A$ or $B$ by tossing a fair coin, independently for each case of equality. We find that among all designs, the truncated binomial minimizes the maximum expected number of correct guesses. For this design, the expected number of correct guesses is independent of the prediction method, and is $$n + n \binom{2n}{n} \big/ 2^{2n} \sim n + (n/\pi)^{1/2}$$ With the truncated binomial design, the variance in the number of correct guesses is largest for the divergence strategy and is $$3n/2 - D - D^2/4 \sim (3\pi - 2)n/2\pi - 2(n/\pi)^{1/2},$$ where $D = n \binom{2n}{n} \big/ 2^{2n - 1}$, and is smallest for the convergence strategy, and is $n/2 - D^2/4 \sim (\pi - 1)n/2\pi$. For the sampling design, convergent prediction maximizes the expected number of correct guesses; this maximum is $$n + 2^{2n - 1} \big/ \binom{2n}{n} - \frac{1}{2} \sim n + (\pi n/4)^{1/2}.$$ Finally we note that, if treatments are selected independently at random, bias of the kind we discuss disappears, but the treatment numbers can no longer be preassigned. Three such designs are considered: the fixed total design, in which the total number of treatments is a fixed number $s$, the fixed factor design, in which we continue until $1/X + 1/Y \leqq 2/n$, where $X$ is the number of $A$ treatments and $Y$ is the number of $B$ treatments administered, and the fixed minimum design, in which we continue until $\min (X, Y) = n$. For the fixed total design, we find that, for $s = 2n + 4, \mathrm{Pr}(1/X + 1/Y \leqq 2/n) \sim 0.955$ for large $n$; at the expense of 4 extra observations, we have a bias-free design whose variance factor will with probability 0.955 be smaller than that in which treatment numbers are preassigned. For the fixed factor design, the additional number of observations required to achieve the given precision has for large $n$ the distribution of the square of a normal deviate. For the fixed minimum design, in which we guarantee precision for the estimated effect of each treatment, the expected number of additional observations is roughly 1.13 $(n)^{1/2}$.

Book ChapterDOI
TL;DR: In this article, the theory of Part I is extended to problems in which it is permitted not to come to a definite conclusion regarding one or more of the questions under consideration, and some problems are also investigated in which, from a single set of observations, one wishes to answer a number of questions in sequence.
Abstract: The theory of Part I is extended to problems in which it is permitted not to come to a definite conclusion regarding one or more of the questions under consideration. Some problems are also investigated in which, from a single set of observations, one wishes to answer a number of questions in sequence. Here the nature of the question at a later stage will depend on the answers obtained at the earlier stages.

Journal ArticleDOI
TL;DR: In this article, it is shown that a stationary Markov process with transition matrix (M) satisfies the recurrence relation, provided that the probability distribution of the transition matrix is determined by the eight probabilities.
Abstract: Let $M = \| m_{ij} \|$ be a $4 \times 4$ irreducible aperiodic Markov matrix such that $h_1 eq h_2, h_3 eq h_4$, where $h_i = m_{i1} + m_{i2}$. Let $x_1, x_2, \cdots$ be a stationary Markov process with transition matrix $M$, and let $y_n = 0$ when $x_n = 1$ or 2, $y_n = 1$ when $x_n = 3 \text{or} 4$. For any finite sequence $s = (\epsilon_1, \epsilon_2, \cdots, \epsilon_n)$ of 0's and 1's, let $p(s) = \mathrm{Pr}\{y_1 = \epsilon_1, \cdots, y_n = \epsilon_n\}$. If \begin{equation*}\tag{1}p^2(00) eq p(0)p(000) \text{and} p^2(01) eq p(1)p(010),\end{equation*} the joint distribution of $y_1, y_2, \cdots$ is uniquely determined by the eight probabilities $p(0), p(00), p(000), p(010), p(0000), p(0010), p(0100), p(0110)$, so that two matrices $M$ determine the same joint distribution of $y_1, y_2, \cdots$ whenever the eight probabilities listed agree, provided (1) is satisfied. The method consists in showing that the function $p$ satisfies the recurrence relation \begin{equation*}\tag{2}p(s, \epsilon, \delta, 0) = p(s, \epsilon, 0)a(\epsilon, \delta) + p(s, \epsilon)b(\epsilon, \delta)\end{equation*} for all $s$ and $\epsilon = 0$ or 1, $\delta = 0$ or 1, where $a(\epsilon, \delta), b(\epsilon, \delta)$ are (easily computed) functions of $M$, and noting that, if (1) is satisfied, $a(\epsilon, \delta)$ and $b(\epsilon, \delta)$ are determined by the eight probabilities listed. The class of doubly stochastic matrices yielding the same joint distribution for $y_1, y_2, \cdots$ is described somewhat more explicitly, and the case of a larger number of states is considered briefly.

Journal ArticleDOI
TL;DR: In this article, it was shown that under certain conditions the distributions of the sample size under the two hypotheses uniquely determine a generalized sequential probability ratio test (GSPRT), and that GSPRT's are a complete class with respect to the probabilities of the two types of error and the average distribution of the sampled size over a finite set of other distributions.
Abstract: Generalized sequential probability ratio tests (hereafter abbreviated GSPRT's) for testing between two simple hypotheses have been defined in [1]. The present paper, divided into four sections, discusses certain properties of GSPRT's. In Section 1 it is shown that under certain conditions the distributions of the sample size under the two hypotheses uniquely determine a GSPRT. In the second section, the admissibility of GSPRT's is discussed, admissibility being defined in terms of the probabilities of the two types of error and the distributions of the sample size required to come to a decision; in particular, notwithstanding the result of Section 1, many GSPRT's are inadmissible. In Section 3 it is shown that, under certain monotonicity assumptions on the probability ratios, the GSPRT's are a complete class with respect to the probabilities of the two types of error and the average distribution of the sample size over a finite set of other distributions. In Section 4, finer characterizations are given of GSPRT's which minimize the expected sample size under a third distribution satisfying certain monotonicity properties relative to the other two distributions; these characterizations give monotonicity properties of the decision bounds.

Journal ArticleDOI
TL;DR: In this paper, the problem of obtaining maximum likelihood estimates for the parameters involved in a stationary single-channel, Markovian queuing process is considered and a method of taking observations is presented which simplifies this problem to that of determining a root of a certain quadratic equation.
Abstract: The problem of obtaining maximum likelihood estimates for the parameters involved in a stationary single-channel, Markovian queuing process is considered. A method of taking observations is presented which simplifies this problem to that of determining a root of a certain quadratic equation. A useful and even simpler rational approximation is also studied.

Journal ArticleDOI
TL;DR: In this article, the authors extend the saddle-point theorem in various directions and give some applications to distributions connected with the multinomial distribution, especially to the distribution of $\chi^2$ and to the maximum entry in a multINomial distribution.
Abstract: Many problems in the theory of probability and statistics can be solved by evaluating coefficients in generating function, or, for continuous differentiable distributions, by an analogous process with Laplace or Fourier transforms. As pointed out for example by H. E. Daniels [2], these problems can often be solved by asymptotic series derived by the saddle-point method from integrals containing a large parameter. Daniels gave a form of saddle-point theorem that is convenient for applications to probability and statistics. In the present paper we extend the theorem in various directions and give some applications to distributions connected with the multinomial distribution, especially to the distribution of $\chi^2$ and to the distribution of the maximum entry in a multinomial distribution.

Journal ArticleDOI
TL;DR: In this article, the authors considered the problem of computing the probability integral of the statistic $y = (x, lbrack p\rbrack} - x)/s_ u, where x is the maximum of normal independent chance variables with common mean and common unknown variance.
Abstract: The statistic $y = (x_{\lbrack p\rbrack} - x)/s_ u$ is studied where $x_{\lbrack p\rbrack}$ is the maximum of $p$ normal independent chance variables with common mean and common unknown variance $\sigma^2, x$ is another independent normal chance variable with the same mean and the same variance $\sigma^2,$ and $s^2_ u$ (distributed as $\sigma^2\chi^2_ u/ u$ with $ u$ degrees of freedom) is an estimate of the common variance which is independent of each one of the above $p + 1$ chance variables. Several different methods are proposed and studied for computing the probability integral of $y$ and percentage points of $y$; in addition, a method for computing percentage points without first computing the probability integral of $y$ is considered. A table of (upper) percentage points of $y$ is given as Table I at the end of the paper. Applications of the statistic $y$ to several ranking and selection problems are mentioned in Section 2. Moments of $y$ are given in Section 3. In Section 7 it is shown that Table I can be used to obtain an approximation and bounds to the percentage points of a related statistic.


Journal ArticleDOI
TL;DR: In this article, it was shown that any analytic function of the root can be expanded in Lagrange's series, which provides a way of actually computing the transition probabilities of the process.
Abstract: The system to be studied consists of a service unit and a queue of customers waiting to be served. Service-times of customers are independent, nonnegative variates with the common distribution $B(v)$ having a finite first moment $b_1$. Customers arrive in a Poisson process (see Feller [4], p. 364) of intensity $\lambda$; they form a queue and are served in order of arrival, with no defections from the queue. For previous work on this queueing system see for instance Pollaczek [11], Khintchine [9], Lindley [10], Kendall [7], [8], Smith [12], Bailey [1], and Takacs [14]. Let $W(t)$ be the time a customer would have to wait if he arrived at $t$. The forward Kolmogorov equation for the distribution of $W(t)$ is solved in principle by the use of Laplace integrals, and $E\{\exp\{ - sW(t)\}\}$ is determined in terms of $W(0)$ and the root of a possibly transcendental equation. It is shown that any analytic function of the root can be expanded in Lagrange's series, which provides a way of actually computing the transition probabilities of the process. Let $z$ be the first zero of $W(t)$. A series for $E\{\exp\{ - \tau z\}\}$ is obtained, and it is proved that $\mathrm{pr}\{z < \infty\} = 1$ if and only if $\lambda b_1 \leqq 1$. From a functional relation between $E\{W(t)\}$ and $\mathrm{pr}\{W(t) = 0\}$ the covariance function $R$ of $W(t)$ is determined. If the service-time distribution $B(v)$ has a finite third moment, then $R$ is absolutely integrable, and the spectral distribution of $W(t)$ is absolutely continuous.




Journal ArticleDOI
TL;DR: In this article, for the Mann-Whitney statistic, upper and lower bounds were obtained in terms of the p-approximation of the distribution of the variance for any random variable and for any stochastically comparable random variable.
Abstract: Let $X, Y$ be independent random variables with continuous cumulative probability functions and let $$p = \mathrm{Pr}\{Y < X\}.$$ For the variance of the Mann-Whitney statistic $U,$ upper and lower bounds are obtained in terms of $p$, for the case of any $X$ and $Y$ as well as for the case of stochastically comparable $X, Y$. The results for the case of stochastic comparability are new, while the inequalities in the case of arbitrary $X, Y$ have either been obtained by van Dantzig or are a consequence of other inequalities due to van Dantzig.


Journal ArticleDOI
TL;DR: In this paper, the Wishart distribution is derived from the Bartlett decomposition of orthogonal matrices having elements depending on certain random vectors, which is a useful tool in various distribution problems in multivariate analysis.
Abstract: Orthogonal matrices having elements depending on certain random vectors provide a useful tool in various distribution problems in multivariate analysis. The method is applied to the derivation of the distributions of Hotelling's $T^2$ and Wilks' generalized variance, the Bartlett decomposition, and the Wishart distribution.

Journal ArticleDOI
TL;DR: In this paper, the authors consider a population consisting of nk elements and show that the sample mean is an unbiased estimate for the population mean and its variance is given by a simple sampling procedure.
Abstract: Consider a finite population consisting of $N$ elements $y_1, y_2, \cdots, y_N$. Throughout the paper we will assume that $N = nk$. A systematic sample of $n$ elements is drawn by choosing one element at random from the first $k$ elements $y_1, \cdots, y_k$, and then selecting every $k$th element thereafter. Let $y_{ij} = y_{i + (j - 1)k}(i = 1, \cdots, k; j = 1, \cdots, n)$; obviously systematic sampling is equivalent to selecting one of the $k$ "clusters" $$C_i = \{y_{ij}; j = 1, \cdots, n\}$$ at random. From this it follows that the sample mean $\bar y_i = 1/n \sum^n_{j = 1} y_{ij}$ is an unbiased estimate for the population mean $\bar y = 1/N \sum^k_{i = 1} \sum^n_{j = 1} y_{ij}$ and that $\operatorname{Var} \bar y_i = 1/k \sum^k_{i = 1} (\bar y_i - \bar y)^2$. We will denote this variance by $V^{(1)}_{sy}$ indicating by the superscript that only one cluster is selected at random. $V^{(1)}_{sy}$ can be written as \begin{equation*}\tag{1}V^{(1)}_{sy} = S^2 - \frac{1}{k} \sum^k_{i = 1} S^2_i, \text{where} S^2 = \frac{1}{N} \sum^k_{i = 1} \sum^n_{j = 1} (y_{ij} - \bar y)^2,\end{equation*} \\ \begin{equation*} S^2_i = \frac{1}{n} \sum^n_{j = 1} (y_{ij} - \bar y_i)^2.\end{equation*} It is natural to compare systematic sampling with stratified random sampling, where one element is chosen independently in each of the $n$ strata $\{y_1, \cdots, y_k\}, \{y_{k + 1}, \cdots, y_{2k}\}, \cdots$, and with simple random sampling using sample size $n$. The corresponding variances of the sample mean will be denoted by $V^{(1)}_{st} V^{(n)}_{ran}$ respectively. We consider now the following generalization of systematic sampling which appears to have been suggested by J. Tukey (see [3], p. 96, [4], [5]). Instead of choosing at first only one element at random we select a simple random sample of size $s$ (without replacement) from the first $k$ elements and then every $k$th element following those selected. In this way we obtain a sample of $ns$ elements and, if $i_1, i_2, \cdots, i_s$ are the serial numbers of the elements first chosen, the sample mean $1/s(\bar y_{i_1} + \cdots + \bar y_{i_s})$ can be used as an estimate for the population mean. This sampling procedure is clearly equivalent to drawing a simple random sample of size $s$ from the $k$ clusters $C_i(i = 1, \cdots, k)$. It therefore follows (see, for example, [2], Chapter 2.3 to 2.4) that the sample mean is an unbiased estimate for the population mean and that its variance, which we denote by $V^{(s)}_{sy}$, is given by begin{equation*} \tag{2}V^{(s)}_{sy} = \frac{k - s}{ks} \frac{1}{k - 1} \sum^k_{i = 1} (\bar y_i - \bar y)^2 = \frac{1}{s} \frac{k - s}{k - 1} V^{(1)}_{sy}.\end{equation*} Again, it is natural to compare this sampling procedure with stratified random sampling, where a simple random sample of size $s$ is drawn independently in each of the $n$ strata $\{y_1, \cdots, y_k\}, \{y_{k + 1}, \cdots, y_{2k}\}, \cdots$ or with simple random sampling employing sample size $ns$. We denote the corresponding variances of the sample mean (which in both cases is an unbiased estimate for the population mean) by $V^{(s)}_{st} ,V^{(ns)}_{ran}$ respectively. From well-known variance formulae (see, for example, [2], Chapters 2.4 and 5.3) it follows that \begin{equation*}\tag{3}V^{(s)}_{st} = \frac{1}{s} \frac{k - s} {k - 1} V^{(1)}_{st},\\ V^{(ns)}_{ran} = \frac{N - ns}{s(N - n)} V^{(n)}_{ran} = \frac{1}{s} \frac{k - s}{k - 1} V^{(n)}_{ran}. \end{equation*} Thus the relative magnitudes of the three variances $V^{(s)}_{sy}, V^{(s)}_{st}, V^{(ns)}_{ran}$ are the same as for $V^{(1)}_{sy}, V^{(1)}_{st}, V^{(n)}_{ran}$, of which comparisons were made for several types of populations by W. G. Madow and L. H. Madow [6] and W. G. Cochran [1]. Some of the results will be reviewed in Section 3. The object of this note is to compare systematic sampling with $s$ random starts, as described above, with systematic sampling employing only one random start but using a sample of the same size $ns$. To make this comparison we obviously have to assume that $k$ is an integral multiple of $s$, say $k = ls$. The latter procedure then consists in choosing one element at random from the first $l$ elements $\{y_1, \cdots, y_l\}$ and selecting every $l$th consecutive element. We denote the variances of the sample mean of the two procedures by $V^{(s)}_k, V^{(1)}_l$ respectively, indicating by the subscript the size of the initial "counting interval." (In our notation $V^{(s)}_{sy} \equiv V^{(s)}_k$.) We shall show in Section 4 that $V^{(1)}_l = V^{(s)}_k$ in the case of a population "in random order," but $V^{(1)}_l < V^{(s)_k$ for a population with a linear trend or with a positive correlation between the elements which is a decreasing convex function of their distance apart. Some numerical results on the relative precision of the two procedures will be given in Section 5 for the case of a large population with an exponential correlogram.