# Measuring labour earnings inequality in post-apartheid South Africa

Abstract: This paper investigates the validity of household survey data published by Statistics South Africa since 1993 and later integrated into the Post-Apartheid Labour Market Series (PALMS). A series of statistical adjustments are proposed, compared, and applied to primary data with the purpose of generating time-comparable, unbiased estimates, and accurate standard errors of labour earnings inequality coefficients.

## Summary (6 min read)

### 1 The Post-Apartheid Labour Market Series

- No consensus has been reached on the quality of long-run time series.
- This project produced the so-called Post-Apartheid Labour Market Series : a stacked cross-section consisting of a harmonized compilation of four household surveys2 conducted after 1993 and focused on socioeconomic topics (Kerr et al. 2013).
- The full description given in Kerr and Wittenberg (2019b: 16) is as follows: Monthly REAL earnings variable generated from the earnings amount data (not bracket information) across all waves where earnings amounts were asked and data have been released (all waves except OHS 1996 and QFLS waves 2008, 2009 and 2012).
- For this reason, PALMS has generated a new strand of academic literature that explores the shortand long-term dynamics of wage inequality in post-transition South Africa, as well as a vibrant discussion on the need for higher-quality time-consistent and more frequent microeconomic data.
- While it is not feasible to fully address all problems pertaining to primary data collection, the final remarks discuss what assumptions are needed in order to make defensible comparisons over time.

### 2 Labour income in post-apartheid South Africa: a literature review

- A number of attempts to quantify inequality dynamics since the advent of democracy in South Africa explore the quality of surveys and censuses available in the country and eventually comment on the comparability of relevant variables over time.
- Cichello et al. (2005) compare 1993 and 1998 earnings in the KwaZulu Natal Income Dynamics Study and reach different results when using the data as a panel and as a cross-section.
- By contrast, the panel data indicate that workers who were already employed in the formal sector in 1993 experienced a fall in earnings, while informal workers started at a much lower average earnings point but experienced a rise due to mobility towards formal employment.
- Wittenberg (2017c) effects further adjustments to yield PALMSv2.1 and calculates wage inequality through the Gini coefficient.
- He argues that despite some noise in the estimates, the measurements made after the LFS 2007:1 are noticeably higher than those made from 2000 to 2006.

### 3 Working with PALMS

- In PALMS, the variable reporting real earnings with no adjustment returns a mean of ZAR8,784 per month and a median of ZAR3,225.
- The number of observations, 𝑁𝑁𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜, in the original file is 963,492; this is higher than in any of the other approaches because every possible earner is included.
- Plotting raw data against time also evidences the presence of issues.
- Figure 1 displays a clear trend of average real earnings and a puzzling volatility, with suspicious falls in 1994 and after 2000 and rises in 2012, among other things.

### 3.1 The benchmark data set

- The first issue encountered in exploring the statistical properties of PALMSv3.3 is that unrealistic values are found with respect to the age category, with 708 respondents supposedly reporting as more than 100 years old (one individual is recorded as being 142 years old).
- For this reason, the sample is restricted to those typically assumed to be in the labour force—that is, to respondents in the age group 18–65 (see also Finn and Leibbrandt 2018).
- Secondly, given that analysis is restricted to labour income, the unemployed, who can be assumed to receive zero earnings, are also excluded.
- In the LFS, Wittenberg and Pirouz (2013: 6) note the impossibility of identifying ‘those working for themselves (employers/self-employed) and those working for others’.
- Because of the shift in recorded self-employment, Wittenberg (2017c) accounts for inequality across wage-earners only.

### 3.2 Outliers

- In the latter case, real earnings in logs are linearly regressed over gender, race, a quadratic in age, education, and occupation levels.
- After removal of outliers, the scatterplot of studentized residuals shows no presence of extreme values; mean earnings are now ZAR7,035 and the median is ZAR3,293, and the final number of millionaires is two out of a total 435,048 observations.

### 3.3 Zero-earners

- Zero-earners are workers who report null labour income, for various possible reasons: (i) the respondent earns a positive income but is lying; (ii) zero surplus at the end of the period is equated with zero income; (iii) the individual is receiving not monetary pay but experience, income in kind.
- 8 6 Kerr and Wittenberg (2019a) report 476 flagged outliers, but using an old version of the data (Kerr: personal communication).
- According to Wittenberg and Pirouz (2013), zero-earners represent a problem only among the LFS’ self-employed due to the simplification of the instrument and increased coverage of informal subsistence workers.
- This is perhaps the most common approach used by researchers working with household survey data in South Africa.
- On the other hand, there are 2,824 zero earnings that are flagged as implausible and imputed (see Section 3.7): imputed zero monetary earnings are only slightly lower than observed values, indicating that workers with such characteristics would not earn no monthly wages if they worked in paid employment: in other words, they are implausible records.

### 3.4 Sample weights

- While sample weights are usually designed as inverse inclusion probability, Stats SA implements instead a poststratification adjustment based on auxiliary population totals to reflect race, gender, and age group distribution.
- Due to the cross-sectional nature of the data, post-stratification weighting corrects sampling errors (i.e. non-response rates and out-of-date sampling frame) given the external information available at the particular year in question.
- Along these lines of thought, Branson (2010) first proposed a cross-entropy estimation approach to create a new set of individual weights—to be common within households—which inflates the sample to a time-consistent external total, while maintaining the post-stratification sampling correction applied by Stats SA.
- In order to account for higher survival rates and a growing population, PALMSv3.3 updates cross-entropy weights using Stats SA population estimates for 2019.
- In PALMSv3.3, the variable ‘ceweight1’ is the recommended weight to use in conjunction with realearnings.

### 3.5 Bracket responses

- Given that individuals may be reluctant to disclose the exact rand amount that they earn, in many surveys it is customary to offer respondents the option of providing their income information in bands (Juster and Smith 1997).
- Ardington et al. (2005) claim that bracket incomes are usually higher than those that give point values and show that inequality levels are generally underestimated as a result of collecting income information in bands, although fortunately not by much.
- This strategy will outperform either imputation whenever the distribution of the variable in question is markedly different from the distributional assumptions implicit in the imputation strategy (earnings follow a log-normal distribution).
- PALMS’ realearnings variable does not include bracket information, which is instead registered as missing values.
- The variable bracketweight is the product of 𝑤𝑤𝑖𝑖 (the inverse inclusion probability of a point value response in a particular bracket in a particular wave) and 𝑝𝑝ℎ(the cross-entropy weight for that particular individual created from the Stats SA 2019 demographic model).

### 3.6 OHS 1996: a special case of bracket responses

- The 1996 OHS (wave 3 in PALMSv3.3) was a much smaller survey (around 15,000 respondents),11 given that it ran immediately after the census, and it displayed a much simpler instrument than usual since it captured no earnings amounts but only brackets.
- In the above, Kerr et al. (2019) used five nearest neighbours to draw from.

### 3.7 Multiple imputations over missing observations

- Ardington et al. (2005) were the first to implement multiple imputations for missing income values in the South African data.
- Unlike imputation of (iii) in the previous section, the procedure for filling missing real earnings values in (i), as described in Wittenberg (2017b), requires a preliminary passage, which is: for each wave of the data set, an ordered logit model—with province, gender, education, race, a quadratic in age, and occupation as explanatory variables—is used to impute the brackets.
- The predicted brackets are then (along with covariates gender and education) used as independent variable in the linear regression to multiply impute rand amounts using PMM, exactly as in the second stage of the OHS 1996 imputation.
- PALMSv3.3miincomes maintains some of the issues of PALMSv3.3, such as implausibly old workers and different extreme values, such that the authors cannot always make use of the multiple imputations already existing in PALMSv3.3miincomes.
- 14 MI estimates can be non-replicable in the sense that the estimates one person reports from a sample of m imputed data sets can differ substantially from the estimates that someone else would get if they re-imputed the data and obtained a different sample of M imputed data sets.

### 3.8 Breaks in the series

- The South African labour income series is bedevilled by breaks.
- The earnings question did reappear in late 2009, but data was released only from 2010, in the separate LMDSA (Wittenberg 2017b).
- The proportion of missing values in each imputation wave does not exceed the 30 per cent of total observations.
- The imputed data is also checked numerically by generating descriptive statistics.

### 3.9 Under-reporting

- A number of studies compare the QLFS earnings data against other sources, particularly administrative data released by the South African Revenue Service since 2011, and suggest that it under-reports high incomes (Bassier and Woolard 2018; Seekings 2007; van der Berg et al.
- Furthermore, when comparing the wage figures in the QLFS and the SARS data set, Wittenberg (2017b) notes that the gap is relatively uniform, at around 40 per cent, across different deciles.
- These considerations necessarily imply that the estimate of the Gini coefficient through PALMS in the years 2000–19 will be lower than actually observed, yet higher than estimated through alternative data sources that completely ignore the lower deciles.
- The problem of under-reporting earnings is inherent to the LFS waves too.
- It is widely acknowledged that between the last OHS (October 1999) and the first LFS (February 2000), there was an increase in coverage of marginal workers, and a consequent decline in earnings (Kerr and Wittenberg 2019a).

### 3.10 Quarter frequency (1993–2007)

- Given the relationship of this paper to subsequent research on the relationship between monetary policy and income inequality, and considering the short timeframe in which monetary policy shocks propagate through the economy, it is necessary to derive sub-annual frequencies (i.e. quarterly) from annual or biannual surveys.
- Each group will then represent a quarter of the year.
- This approach is both elementary and simplistic, given that it negates any real shift among quarters.

### 4 Measuring wage inequality

- The final step of multiple imputations for missing data is to perform the desired analysis on each mth complete data set, then combine the results of the m analyses from every round, and finally average over the m estimates to obtain a point value with associated standard errors.
- Figures 4 and 5 show the frequency distribution of real wages across workers with their respective moments, in two distinct points in time: real wages are more evenly distributed in last quarter of 1994 than in the first quarter of 2015.
- This is confirmed by the Gini index plotted in Figure 6a.
- In what follows I shall describe the changes in inequality over time and across multiple measures.
- These figures are based on individual monthly wage income (excluding the self-employed and the unemployed) at gross level (pre-tax).

### 4.3 The P90/P50 dispersion ratio

- A similar explanation can be applied to the P90/P50 ratio, the income share of the richest 10 per cent with respect to the lower 50 per cent of the wage distribution.
- The average ratio is 4.7, which implies that the richest receive five times more income than the poorest.
- Figure 8 shows a well- defined, positive trend that peaked in the first quarter of 2015.
- This evidence suggests that the wage differential between the ninth and the fifth decile of the wage distribution has been increasing over time: while the richest have become richer, the wage of the poorest 50 per cent has not increased proportionally.
- Yet, it seems that P50 changes are more closely related to P90 than P10.

### 4.4 The generalized entropy index

- Measures from the generalized entropy (GE) class are sensitive to changes at the higher end of the distribution if the weight given to distances between incomes at different parts of the income distribution is high.
- The GE index calculated here employs a parameter equal to 2 such that the index is especially sensitive to the existence of large incomes.
- Figure 9 reveals the worrying presence of high incomes around 2000, while it confirms previous observations over the 2014–16 period.

### 5 Conclusions

- A number of problems have been inherited from the primary data used to compile PALMS in the first place, and as such they have no post-fieldwork solution.
- Inevitably, the time series plotted in Figures 6 to 9 may still feature characteristics that should be ascribed more to methodological than to real variation.
- 16 Nevertheless, this work contributes to the previous literature on South African disaggregated data by improving existing data quality, delivering a robust time series of labour income inequality among wage employees, and thus facilitating long-run dynamic policy analysis.

Did you find this useful? Give us your feedback

...read more

##### Citations

183 citations

5 citations

### Cites background or methods from "Measuring labour earnings inequalit..."

...4 This section relies heavily on my previous work (Merrino 2020)....

[...]

...If this is the case, the functional distribution of income remains overall an inadequate proxy for labour income inequality in the South African case (Merrino 2020)....

[...]

...By looking at the evolution of the labour share in the post-apartheid era, it emerges that it has been moving in the same direction as wage inequality (Merrino 2020)....

[...]

### Cites methods from "Measuring labour earnings inequalit..."

...This analysis exploits the recent improvement of the household survey data integrated in the Post-Apartheid Labour Market Series (PALMS), which allows a reliable comparison over time of inequality measures (Merrino 2020)....

[...]

...The harmonized data include the Household Surveys from 1994 to 1999, the Labour Force Surveys from 2000 to 2007, and the Quarterly Labour Force Surveys from 2008 to 2019 (see Finn and Leibbrandt 2018; Kerr and Wittenberg 2019; Merrino 2020)....

[...]

##### References

3,863 citations

### "Measuring labour earnings inequalit..." refers methods in this paper

...The process of PMM imputation is repeated m times to obtain m imputed data sets to be eventually analysed as though they were complete (Rubin 1987)....

[...]

511 citations

### "Measuring labour earnings inequalit..." refers methods or result in this paper

...Leibbrandt et al. (2010) include all forms of labour earnings from three comparable national household survey data sets: the PSLSD for 1993, the LFS and IES for 2000, and the National Income Dynamics Study (NIDS) for 2008....

[...]

...Finn (2015) calculates the Gini wage inequality in PALMS using the same datacleaning procedure suggested by Wittenberg (2014b): in contrast to Leibbrandt et al. (2010), who calculated overall income inequality, the Gini coefficient of real wages in 2003:1 (0.553) was almost identical in 2012:1…...

[...]

...As already observed by Casale et al. (2004) and Leibbrandt et al. (2010), Wittenberg and Pirouz (2013) also evidence how the change in coverage between the OHSs and the successive LFSs generated a gap in the earnings series at the year 2000....

[...]

439 citations

### "Measuring labour earnings inequalit..." refers methods in this paper

...Therefore, I follow Wittenberg (2017b) and compare three procedures that distinctly detect contaminating observations: the BACON algorithm (Billor et al. 2000), a robust regression through iteratively reweighted least squares, and a studentized residuals approach....

[...]

376 citations

275 citations

### "Measuring labour earnings inequalit..." refers background in this paper

...13 Schenker and Taylor (1996) did simulations with three and ten k, finding small differences in performance, although with k = 3 there was less bias and more sampling variation....

[...]