söndag 2 september 2012

Användande av R-paketet lme4 i empirisk samhällsvetenskap

“most producers and consumers of comparative political economy are intrinsically interested in specific cases. Why not cater to this interest by keeping our cases visible?”
Michael Shalev, 2007

De senaste åren har jag närmat mig användande av så kallade time series-cross section (TSCS) data, vilket innebär datasets som kombinerar upprepade observationer över tid (tidsserie, TS) med observation av flera olika enheter (tvärsnitt, CS). TSCS brukar användas som etikett på sådana paneldata där enheterna är länder, ofta femton-tjugo rika länder, tidsenheten är år och man har ungefär tjugo-femtio observationer (år) per land.

När jag kollar på denna litteratur - som handlar om vad som bestämmer inkomstfördelningen i de rika länderna, arbetslösheten, effekter av centralbanksoberoende, välfärdsstatens storlek (1, 2, 3), skatter, löneökningstakt i EMU osv - så blir jag förvånad över att så lite fokus har ägnats åt variationer i orsaker och effekter mellan länder. Den stora majoriteten av dessa studier skattar en koefficient per oberoende variabel: en "totaleffekt" som förväntas gälla i alla inkluderade länder.

Inom jämförande politisk ekonomi med TSCS så har i praktiken en de facto-metodologisk standard regerat sedan Beck och Katz kom med sina extremt inflytelserika artiklar 1995 och 1996 som ersatte den tidigare dominanta Parks-metoden som använde feasible GLS. Beck och Katz förespråkade istället användande av OLS med en laggad beroende variabel bland de oberoende variablerna, panelkorrigerade standardfel (2007 skrev Beck att folk misstagit sig på vad P:et i PCSE stod för: det står för "panel", inte "panacea", anmärkte han surt), och efter hand också land-dummies. Land-dummies i princip alltid som "fixed effects", icke-modellerade enhetseffekter som inte följer någon distribution, istället för "random effects" som är modellerade effekter som antas följa en normaldistribution. Heterogeniteten mellan länder blir då skillnad i nivåer (varierande intercept), inte i effekter (varierande slopes/koefficienter).

De senaste fem åren eller så verkar dock metodologerna, inklusive Beck och Katz, ha blivit mer intresserade av varierande effekter. Beck och Katz öppnar i en artikel i ett specialnummer om TSCS av den ledande metodologitidskriften inom statsvetenskap Political Analysis från 2007 för användandet av random coefficient-modeller, där alltså inte bara intercepten tillåts variera och modelleras som följandes en distribution, utan detsamma gäller för beta-koefficienterna. Redaktören Beck beskriver i tidskriftsnumrets inledning denna artikel:
"My article with Katz also treats TSCS data as hierarchical, although estimation is done
classically (via maximum likelihood). We show that the classical model, which is usually known as the random coefficients model, has very nice theoretical and statistical properties. This model allows for all model parameters to vary randomly over units (the Shor et al. article only considers random variation of the intercepts) and thus allows for the process that translates the covariates into the dependent variable to vary randomly across units, but not to vary in a completely arbitrary way (i.e., there is some, but not complete, uniformity in the models that describe the various units). We show that the maximum likelihood estimator for this model performs well, in that it accurately estimates variability (and also does not find variability when none exists)."
Han tydliggör som synes att skattningen görs med maximum likelihood; annars har pionjärerna för varierande effekter inom samhällsvetenskapen varit bayesianer, och bayesianen Bruce Westerns artikel "Causal Heterogeneity in Comparative Research" från 1998 framstår nu som ett paper som var före sin tid. (Också Larry Bartels paper "Pooling Disparate Observations" från 1996 skrevs från ett bayesianskt perspektiv.) Beck skrev så sent som 2001 att det var otydligt hur väl random coefficient-modeller skulle fungera i TSCS, och Westerns paper från 1998 har bara 154 citeringar på Google Scholar, att jämföra med 3193 för Beck och Katz (1995) och 725 för Beck och Katz (1996).


Vad Beck och Katz nu öppnar för är alltså RCMs. Beck (2006, pdf) presenterar RCMs, jämfört med separata regressioner för varje land som förstås också det är ett sätt att fånga kausal heterogenitet, så här:
"If T is large enough it is not ridiculous to estimate N separate time-series (and time series analysis on single countries surely has a long tradition). But T will typically not be large enough for unit time series analyses to be sensible. (Beck and Katz, N.d found with simulated data that the fully pooled model gives better estimates of the unit i even when there is heterogeneity when T < 30.) And even for larger T’s, separate time series analyses on each country make it difficult to claim that one is doing comparative politics. A very nice compromise is the 'random coefficients model' (RCM). This allows for unit heterogeneity, but also assumes that the various unit level coefficients are draws from a common (normal) distribution." (s 9)
Beck menar att Western (1998) faktiskt var den förste att introducera RCMs i ett statsvetenskapligt sammanhang, vilket säger en del om hur nytt det här är, förbluffande nog. Beck pekar också på att det funnits en del skepticism mot metoden: "While until recently it was hard for researchers to estimate the RCM, this can no longer be an excuse for not using it." Som ytterligare ett argument för att RCMs inte längre är så exotiska, så pekar han på att man inte behöver bayesianska metoder för att skatta RCMs, utan kan göra det med maximum likelihood, med R-paketet nlme av José Pinheiro och Douglas Bates. Beck argumenterar 2006 för att forskare rutinmässigt ska använda RCMs, om inte annat så för att diagnostisera om deras samples verkligen är sammanhängande eller snarare har stor heterogenitet. Beck och Katz beskriver 2007 situationen:
"Many analysts allowfor unit-specific intercepts, that is, fixed effects. But there are relatively few attempts to go beyond this limited heterogeneity. /.../
At first glance, the commitment to homogeneity is a bit odd since a model that allows for  heterogeneity, the random coefficient model (RCM), has been known under various names (hierarchical, mixed, multilevel, random coefficient, and varying parameter models, at least) for over half a century. Such models were considered in the light of comparative political economy by Western (1998). /.../
Western has described the RCM in a Bayesian context. Although his work is reasonably
well cited, we have found precious few (if any) applications of Western’s method to
substantive issues in comparative politics.3 In this article we show that the RCM, estimated via classical maximum likelihood, performs very well and should be more utilized by students of comparative political economy." (183f)
De senaste åren har nlme också kompletterats med lme4, där Bates är inblandad, och plm ("panel linear models") av Yves Croissant och Giovanni Millo. PLM är "a package doing panel data 'from the econometrician's viewpoint'", och använder GLS istället för maximum likelihood. I PLM kan kommandot pvcm användas för att köra modeller med (tids- eller enhets-) varierande koefficienter.

Som citerat ovan så hittade Beck och Katz år 2007 -- och de borde ha koll -- "few, if any" appliceringar av Westerns RCM-iver. Nu har det gått fem år sedan dess och jag har kollat i Google Scholar efter användningar av
a) Douglas Bates introduktioner till nlme- och lme4-paketen (t ex "Fitting linear mixed models in R" från 2005)
och
b) Western (1998)
och
c) Beck och Katz (2007)
Nedan är vad intressant jag hittade ur samhällsvetenskapen. Det är noterbart att det faktiskt inte är så mycket: Bates LME4 verkar användas mer inom zoologi, språkforskning m m än inom samhällsvetenskap. Faktum är att i princip det enda samhällsvetenskapliga som jag hittar som citerar Bates (2005) är från ett par statsvetare från Columbia-universitetet och Princeton; gissningsvis så känner dessa mästar-statistikern (och bayesianen) Andrew Gelman, som också själv dyker upp i min sökning med ett paper som gör om "What's the matter with What's the matter with Kansas?"-grejen för Mexiko.

---
Nedslag i litteraturen:

"We use a multilevel logistic regression model, estimated using the GLMER ('generalized linear mixed effects in R') function (Bates 2005). For data with hierarchical structure (e.g., individuals within states within regions), multilevel modeling is generally an improvement over classical regression. Rather than using 'fixed' (or 'unmodeled') effects, the model uses 'random' (or 'modeled') effects, at least for some predictors. The effects within a grouping of variables (say, state-level effects) are related to each other by their grouping structure and thus are partially pooled toward the group mean, with greater pooling when group-level variance is small and for less-populated groups. The degree of pooling within the grouping emerges from the data endogenously. This is equivalent to assuming errors are correlated within a grouping structure. (See Gelman and Hill 2007, 244–8, 254–8, 262–5.)" (s 384)
Jeffrey R Lax och Justin H Phillips, "Gay Rights in the States: Public Opinion and Policy Responsiveness" (pdf), American Political Science Review Vol. 103, No. 3 August 2009

"We study the relationship between state-level public opinion and the roll call votes of senators on Supreme Court nominees. Applying recent advances in multilevel modeling, we use national polls on nine recent Supreme Court nominees to produce stateof-the-art estimates of public support for the confirmation of each nominee in all 50 states. We show that greater public support strongly increases the probability that a senator will vote to approve a nominee, even after controlling or standard predictors of roll call voting." (ur abstract, min fetning)
"Multilevel models to produce state estimates and analyze roll call votes were estimated using the LMER command in R (Bates 2005)" (s 13n)
Jonathan P. Kastellec , Jeffrey R. Lax och Justin Phillips, "Public Opinion and Senate Confirmation of Supreme Court Nominees" (pdf), paper, Columbia University, 2008

"To determine the strength of our prior data, we need to know how much these state relative positions vary from election to election. For this, we need data from several elections. Let ds;y be the relative position for state s in year y. We fi rst estimate/.../ With only seven data points for each state, however, these estimates could be unreliable. We could get around this problem by assuming a common variance estimate for all states, but rather than forcing either one common estimate or fty individual estimates, we use shrinkage estimation (also called partial pooling). Exactly how much to pull each estimate to the common mean is determined via a hierarchical model which we t in R using lmer (Bates, 2005) and is ultimately based upon comparisons of within-state and between-state variability." 
Kari Lock och Andrew Gelman, "Bayesian Combination of State Polls and Election Forecasts", paper, 2010
"For the logistic models (1), positive slopes βj correspond to richer voters being more likely to support the PAN candidate. We summarize the models by plotting the curves logit-1 (αj + βjx) (for the logistic models) for each of the 32 states, and by plotting the estimated intercepts αj and estimated slopes βj vs. uj , the state-level GDP per capita. We fit the models using the lmer function in R (R Development Core Team 2006, Bates 2005), following the approach of Gelman et al. (2007)."


J Cortina, A Gelman, NL Blanco, "One vote, many Mexicos: Income and vote choice in the 1994, 2000, and 2006 presidential elections" (pdf), 2008

"We estimate the multilevel models using the GLMER command in R (Bates 2005)."
 fotnot i Charles M Cameron, Jonathan Kastellec och Jee-Kwang Park, "Voting for Justices: Change and Continuity in Confi rmation Voting 1937-2010" (pdf), paper, Princeton, 2010
"Beck and Katz (2007) show that the maximum-likelihood estimator is more efficient for random coefficient models of time-series-cross-section data than the GLS estimator used here. For the sake of consistency and comparability with the models of Iversen and Soskice as well as Persson and Tabellini, we report the results of our analysis using the same estimator as in our earlier models."
fotnot i Noam Lupu och Jonas Pontusson, "Income Inequality, Electoral Rules and the Politics of Redistribution" (pdf), paper, 2008

"Methodologically, we employ a mixed logit model with random coefficients. This allows us to not only examine the factors that affect which party obtains the prime ministership, but to also explore how the influence of these factors varies across prime ministerial party selection opportunities due to unique aspects of each case that are difficult or impossible to capture in a quantitative model." (ur abstract)
"we employ a mixed logit model specified to allow for random coefficients (Train 1998, McFadden & Train 2000, Glasgow 2001). This model treats the prime ministerial party selection opportunity as the unit of analysis and allows our model coefficients to vary for unobserved or unmeasured contextual reasons. This approach allows us to strike a balance between assuming that the only meaningful variation between prime ministerial selection opportunities is captured by our independent variables and assuming that each case is so unique that it cannot be meaningfully compared to others. More generally, our application demonstrates that a random coefficients approach can help quantitative researchers address the heterogeneity and causal complexity that underlies almost all comparative politics research (Beck & Katz 2007, Western 1998)." (s 4)
Garret Glasgow, Matt Golder och Sona N Golder, "Who 'wins'? Determining the party of the prime minister" (pdf), paper, 2009 -- notabelt är att två av dessa tre författare (GG och MG) undervisar i metodologi på Essex-sommarskolan; de torde alltså vara "framkant" metodologiskt..

"To take possible heteroscedasticity within the panel into account a random coefficient approach is specified. Preliminary analysis yielded that consideration of a random coefficient is sufficient for the persistence term α1 and the coefficient for the long term interest rates. Thus we assume that α(i)1 is a random variable following a normal distribution with parameters μα1 and σα1 and for one of the β we assume that it follows a normal distribution with parameters μβ3 and σβ3 , while all other parameters are constant in i. Estimation is done via the Maximum Likelihood method, see Beck and Katz (2007)." 
Christian Aßmann, Jens Boysen-Hogrefe and Nils Jannsen, "Costs of Housing Crises: International Evidence" (pdf), Kiel Institute, 2009

"I estimate two-way random effects models with varying country and year intercepts that more appropriately capture the complex relationships highlighted by the VoC approach (Western, 1998; Beck, 2001; Beck and Katz, 2007). The models present several unique advantages. First, group effects capture unit heterogeneity, that is, country-specific, time-invariant unobserved factors. Second, year effects capture differences over time that are common to all groups. The random structure of the model comes from employing varying intercept parameters for countries and years. This makes it unnecessary to use one of the indicator dummies as a base category. By employing varying intercepts for units, countries are neither assumed to be unique nor are their differences ignored (Beck, 2001:124–25).17 Instead, country effects are assumed to vary and this variance is estimated conditional on the data and parameters of the model. This partial pooling is particularly  desirable for unbalanced panels since it allows more accurate estimates of country effects. Partial pooling also alleviates the problem of slow-moving or completely time-invariant predictors (Shor et al., 2007:168). /.../
A more complicated random intercept, random slope model—also known as a random coefficient model (RCM)—can also be estimated (Western, 1998; Beck, 2001; Beck and Katz, 2007). In addition to the intercepts, RCMs allow the effect of covariates to vary by country, proving themselves useful in situations in which clusters of variables behave differently depending on the context. Neocorporatism and firm-level cooperation, for example, could be said to have effects on inequality that are different depending on the country, since high levels of neocorporatism are said to be associated with left governments, proportional representation electoral systems, highly unionized labor movements, and trade-dependent economies. In these models, all other independent variables affect the dependent variable through the variables with the random slopes. These variables are usually time-invariant institutional covariates that interact with the remaining time-varying covariates. However, the assumption that the main variables of interest—neocorporatism and firm-level cooperation—vary in their effects by country is a strong one. Even if these variables vary little over time, they cannot be said to be completely time invariant. Rather than attempt to estimate a RCM then, I let differences in context enter the model through the country and year intercepts, making the simple two-way fixed effects model a random effects multilevel model. Model coefficients can then be interpreted as reflecting the effect of a unit change in each of the independent
variables while controlling for the effect of other independent variables." (s 10f)
Jose´ A. Aleman, "Cooperative Institutions and Inequality in the OECD: Bringing the Firm Back In" (pdf), Social Science Quarterly 2011

Inga kommentarer: