scispace - formally typeset
Open AccessJournal ArticleDOI

Panel data analysis—advantages and challenges

Cheng Hsiao, +1 more
- 16 Mar 2007 - 
- Vol. 16, Iss: 1, pp 1-22
Reads0
Chats0
TLDR
The proliferation of panel data studies is explained in terms of data availability, the more heightened capacity for modeling the complexity of human behavior than a single cross-section or time series data can possibly allow, and challenging methodology.
Abstract
We explain the proliferation of panel data studies in terms of (i) data availability, (ii) the more heightened capacity for modeling the complexity of human behavior than a single cross-section or time series data can possibly allow, and (iii) challenging methodology. Advantages and issues of panel data modeling are also discussed.

read more

Content maybe subject to copyright    Report

WISE WORKING PAPER SERIES
WISEWP0602
Panel Data Analysis - Advantages and Challenges
Cheng Hsiao
April 19, 2006
COPYRIGHT© WISE, XIAMEN UNIVERSITY, CHINA

Panel Data Analysis Advantages and Challenges
Cheng Hsiao
Department of Economics
University of Southern California
Los Angeles, CA 90089-0253
and
Wang Yanan
Institute for Studies in Economics
Xiamen University, China
April 19, 2006
ABSTRACT
We explain the proliferation of panel data studies in terms of (i) data availability, (ii)
the more heightened capacity for modeling the complexity of human behavior than a single
cross-section or time series data can possibly allow, and (iii) challenging methodology.
Advantages and issues of panel data modeling are also discussed.
Keywords: Panel data; Longitudinal data; Unobserved heterogeneity; Random effects;
Fixed effects
I would like to thank Irene C. Hsiao for helpful discussion and editorial assistance and
Kannika Damrongplasit for drawing the figures. Some of the arguments presented here
also appear in Hsiao (2005, 2006).

1. Introduction
Panel data or longitudinal data typically refer to data containing time series obser-
vations of a number of individuals. Therefore, observations in panel data involve at least
two dimensions; a cross-sectional dimension, indicated by subscript i, and a time series
dimension, indicated by subscript t. However, panel data could have a more complicated
clustering or hierarchical structure. For instance, variable y may be the measurement of
the level of air pollution at station in city j of country i at time t (e.g. Antweiler (2001),
Davis (1999)). For ease of exposition, I shall confine my presentation to a balanced panel
involving N cross-sectional units, i =1,...,N,overT time periods, t =1,...,T.
There is a proliferation of panel data studies, be it methodological or empirical. In
1986, when Hsiao’s (1986) first edition of Panel Data Analysis was published, there were
29 studies listing the key words: “panel data or longitudinal data”, according to Social
Sciences Citation index. By 2004, there were 687 and by 2005, there were 773. The growth
of applied studies and the methodological development of new econometric tools of panel
data have been simply phenomenal since the seminal paper of Balestra and Nerlove (1966).
There are at least three factors contributing to the geometric growth of panel data
studies. (i) data availability, (ii) greater capacity for modeling the complexity of human
behavior than a single cross-section or time series data, and (iii) challenging methodology.
In what follows, we shall briefly elaborate each of these one by one. However, it is impos-
sible to do justice to the vast literature on panel data. For further reference, see Arellano
(2003), Baltagi (2001), Hsiao (2003), Matyas and Sevester (1996), and Nerlove (2002), etc.
2. Data Availability
The collection of panel data is obviously much more costly than the collection of cross-
sectional or time series data. However, panel data have become widely available in both
developed and developing countries.
The two most prominent panel data sets in the US are the National Longitudinal
Surveys of Labor Market Experience (NLS) and the University of Michigan’s Panel Study
1

of Income Dynamics (PSID). The NLS began in the mid 1960’s. It contains five separate
annual surveys covering distinct segments of the labor force with different spans: men
whose ages were 45 to 59 in 1966, young men 14 to 24 in 1966, women 30 to 44 in 1967,
young women 14 to 24 in 1968, and youth of both sexes 14 to 21 in 1979. In 1986, the NLS
expanded to include annual surveys of the children born to women who participated in
the National Longitudinal Survey of Youth 1979. The list of variables surveyed is running
into the thousands, with emphasis on the supply side of market.
The PSID began with collection of annual economic information from a representative
national sample of about 6,000 families and 15,000 individuals in 1968 and has continued
to the present. The data set contains over 5,000 variables (Becketti, Gould, Lillard and
Welch (1988)). In addition to the NLS and PSID data sets, there are many other panel
data sets that could be of interest to economists, see Juster (2000).
In Europe, many countries have their annual national or more frequent surveys such
as the Netherlands Socio-Economic Panel (SEP), the German Social Economics Panel
(GSOEP), the Luxembourg Social Panel (PSELL), the British Household Panel Survey
(BHS), etc. Starting in 1994, the National Data Collection Units (NDUS) of the Statistical
Office of the European Committees have been coordinating and linking existing national
panels with centrally designed multi-purpose annual longitudinal surveys. The European
Community Household Panel (ECHP) are published in Eurostat’s reference data base New
Cronos in three domains: health, housing, and income and living conditions.
Panel data have also become increasingly available in developing countries. In these
countries, there may not have been a long tradition of statistical collection. It is of special
importance to obtain original survey data to answer many significant and important ques-
tions. Many international agencies have sponsored and helped to design panel surveys. For
instance, the Dutch non-government organization (NGO), ICS, Africa, collaborated with
the Kenya Ministry of Health to carry out a Primary School Deworming Project (PDSP).
The project took place in Busia district, a poor and densely-settled farming region in
2

western Kenya. The 75 project schools include nearly all rural primary schools in this
area, with over 30,000 enrolled pupils between the ages of six to eighteen from 1998-2001.
Another example is the Development Research Institute of the Research Center for Rural
Development of the State Council of China, in collaboration with the World Bank, which
undertook an annual survey of 200 large Chinese township and village enterprises from
1984 to 1990.
3. Advan tages of Panel Data
Panel data, by blending the inter-individual differences and intra-individual dynamics
have several advantages over cross-sectional or time-series data:
(i) More accurate inference of model parameters. Panel data usually contain more
degrees of freedom and less multicollinearity than cross-sectional data which may
be viewed as a panel with T = 1, or time series data which is a panel with N =1,
hence improving the efficiency of econometric estimates (e.g. Hsiao, Mountain
and Ho-Illman (1995).
(ii) Greater capacity for capturing the complexity of human behavior than a single
cross-section or time series data. These include:
(ii.a) Constructing and testing more complicated behavioral hypotheses. For in-
stance, consider the example of Ben-Porath (1973) that a cross-sectional
sample of married women was found to have an average yearly labor-force
participation rate of 50 percent. These could be the outcome of random
draws from a homogeneous population or could be draws from heteroge-
neous populations in which 50% were from the population who always work
and 50% never work. If the sample was from the former, each woman would
be expected to spend half of her married life in the labor force and half out of
the labor force. The job turnover rate would be expected to be frequent and
3

Citations
More filters
Journal ArticleDOI

Promoting novelty, rigor, and style in energy social science: towards codes of practice for appropriate methods and research design

TL;DR: It is the hope that this Review will inspire more interesting, robust, multi-method, comparative, interdisciplinary and impactful research that will accelerate the contribution that energy social science can make to both theory and practice.
Journal ArticleDOI

A longitudinal study of determinants of perceived employability.

TL;DR: In this paper, a longitudinal study aimed to analyze core determinants of perceived employability using a sample of 465 employees (time 1) taken from four companies in Switzerland and surveyed at three points in time.
Journal ArticleDOI

COVID-19 and vaccine hesitancy: A longitudinal study.

TL;DR: This paper found that those with less favorable attitudes toward a COVID-19 vaccination also perceived the virus to be less threatening, while those with more favorable attitudes viewed the virus as less threatening.
Journal ArticleDOI

Cross-Sectional Dependence in Panel Data Analysis

TL;DR: In this article, the authors provide an overview of the existing literature on panel data models with error cross-sectional dependence (CSD), and distinguish between weak and strong CSD and link these concepts to the spatial and factor structure approaches.
Journal ArticleDOI

The effect of innovation on CO2 emissions of OCED countries from 1990 to 2014.

TL;DR: It is asserted that improvement in GDP per capita leads to the rise in CO2 in most OECD economies, although mitigate emissions in few OECDs; hence, the economic-EKC model is not valid for most economies.
References
More filters
Book

Econometric Analysis of Cross Section and Panel Data

TL;DR: This is the essential companion to Jeffrey Wooldridge's widely-used graduate text Econometric Analysis of Cross Section and Panel Data (MIT Press, 2001).
Journal ArticleDOI

Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations.

TL;DR: In this article, the generalized method of moments (GMM) estimator optimally exploits all the linear moment restrictions that follow from the assumption of no serial correlation in the errors, in an equation which contains individual effects, lagged dependent variables and no strictly exogenous variables.
Journal ArticleDOI

Distribution of the Estimators for Autoregressive Time Series with a Unit Root

TL;DR: In this article, the limit distributions of the estimator of p and of the regression t test are derived under the assumption that p = ± 1, where p is a fixed constant and t is a sequence of independent normal random variables.
Journal ArticleDOI

Another look at the instrumental variable estimation of error-components models

TL;DR: In this paper, a framework for efficient IV estimators of random effects models with information in levels which can accommodate predetermined variables is presented. But the authors do not consider models with predetermined variables that have constant correlation with the effects.
Journal ArticleDOI

Specification Tests in Econometrics

Jerry A. Hausman
- 01 Nov 1978 - 
TL;DR: In this article, the null hypothesis of no misspecification was used to show that an asymptotically efficient estimator must have zero covariance with its difference from a consistent but asymptonically inefficient estimator, and specification tests for a number of model specifications in econometrics.
Frequently Asked Questions (21)
Q1. What can be used to derive the statistical properties of panel data estimators?

When N is finite and T is large, standard time series techniques can be used to derive the statistical properties of panel data estimators. 

The authors explain the proliferation of panel data studies in terms of ( i ) data availability, ( ii ) the more heightened capacity for modeling the complexity of human behavior than a single cross-section or time series data can possibly allow, and ( iii ) challenging methodology. 

A general principle of obtaining valid inference of β ˜ in the presence of incidental parameters γ ˜it is to find proper transformation to eliminate γ ˜it from the specification. 

Ignoring cross-sectional dependence can lead to inconsistent estimators, in particular when T is finite (e.g. Hsiao and Tahmiscioglu (2005)). 

A general approach of estimating a model involving incidental parameters is to findtransformations to transform the original model into a model that does not involve incidental parameters. 

If an investigator is only interested in the relationship between y and x ˜ , one approach to characterize the heterogeneity not captured by x ˜ is to assume that the parameter vector varies across i and over t, θ ˜it , so that the conditional density of y given x ˜ takes the form f(yit | x ˜it ; θ ˜it ). 

The two most prominent panel data sets in the US are the National LongitudinalSurveys of Labor Market Experience (NLS) and the University of Michigan’s Panel Study1of Income Dynamics (PSID). 

Panel data usually contain moredegrees of freedom and less multicollinearity than cross-sectional data which may be viewed as a panel with T = 1, or time series data which is a panel with N = 1, hence improving the efficiency of econometric estimates (e.g. Hsiao, Mountain and Ho-Illman (1995).(ii) Greater capacity for capturing the complexity of human behavior than a singlecross-section or time series data. 

The European Community Household Panel (ECHP) are published in Eurostat’s reference data base New Cronos in three domains: health, housing, and income and living conditions. 

Assuming that the heterogeneity across25cross-sectional units and over time that are not captured by the observed variables can be captured by period-invariant individual specific and/or individual-invariant time specific effects, the authors surveyed the fundamental methods for the analysis of linear static and dynamic models. 

For instance, a dynamic logit model with time dummy explanatory variable can not meet the Honoré and Kyriazidou (2000) conditions for generating consistent estimator, but can still be estimated by the modified MLE with good finite sample properties. 

The effects of unobserved heterogeneity can either be assumed as random variables, referred to as the random effects model, or fixed parameters, referred to as the fixed effects model, or a mixture of both, refereed to as the mixed effects model. 

The advantages of fixed effects (FE) specification are that it can allow the individual-and/or time specific effects to be correlated with explanatory variables x ˜it . 

When both N and T are large and cross-sectional units are not independent, a factoranalytic framework of the form (4.40) has been proposed to model cross-sectional dependency and variants of unit root tests are proposed (e.g. Perron and Moon (2004)). 

A joint limit will give a more robust result than either a sequential limit or a diagonal-path limit, but will also be substantially more difficult to derive and will apply only under stronger conditions, such as the existence of higher moments. 

One way to restore homogeneity across i and/or over t is to add more conditional variables, say z ˜ ,f(yit | x ˜it , z ˜it ; θ ˜ ). (4.4)However, the dimension of z ˜ can be large. 

When N −→ ∞, 1N N∑i=1 uit −→ 0, (4.40) implies that v̄t = b̄ ˜ ′ f ˜t , where b̄ ˜ ′ is the cross-sectional av-erage of b ˜ ′ i = (bi1, . . . , bir) and f ˜t = (f1t, . . . , frt). 

The advantage of such an approach is that the bias reduced estimators may still allow the use of all the sample information so that from a mean square error point of view, the bias reduced estimator may still dominate a consistent estimators because the latter often have to throw away a lot of sample, thus tend to have large variances. 

The bias correction term is derived by noting that to the order of (1/T ) the first derivative of ∗i with respect to β ˜ converges to 12 E[ ∗i,βαiαi (β˜ ,αi)] E[ ∗ i,αiαi(β ˜,αi)] . 

Two approaches have been proposed to model cross-sectional dependence: economic distance or spatial approach and factor approach. 

Monte Carlo experiments conducted by Carro (2005) have shown that when T = 8,the bias of modified MLE for dynamic probit and logit models are negligible.