Thursday, May 24, 2018

Do you study individual difference? A Challenge

Can you solve the following problem that I think is hard, fun, and important.  I cannot.

The problem is that of characterizing individual differences for individuals performing cognitive tasks.  Each task has a baseline and an experimental condition, and the difference, the effect, is the target of interest.   Each person performs a great number of trials in each condition in each tasks, and the outcomes on each trial is quite variable (necessitating the multiple trials).  There are a number of tasks, and the goal is to estimate the correlation matrix among the task effects.
That is, if a person has a large effect in Task 1, are they more likely to have a large effect in Task 2.

Let's try an experiment with 200 people, 6 tasks, and 150 replicates per task per condition with simulated data.  When you factor in the two conditions, there are 360,000 observations in total.  Our goal is to estimate the 15 unique correlation coefficients in the 6-by-6 correlation matrix.  Note we have what seems to be a lot of data, 360K observations, for just 15 critical parameters.   Seems easy.

Unfortunately, the problem is seemingly more difficult, at least for the settings which I think are realistic for priming and context tasks, than one might think.

Here is code to make what I consider realistic data:

library(mvtnorm)

set.seed(123)
I=200 #ppl
J=6 #tasks
K=2 #conditions
L=150 # reps
N=I*J*K*L
sub=rep(1:I,each=J*K*L)
task=rep(rep(1:J,each=K*L),I)
cond=rep(rep(1:K,each=L),I*J)
subtask=cbind(sub,task)

myCor=diag(J)
myCor[1,2]=.8
myCor[3,4]=.8
myCor[5,6]=.8
myCor[1,3]=.4
myCor[1,4]=.4
myCor[2,3]=.4
myCor[2,4]=.4
myCor[lower.tri(myCor)]  <- t(myCor)[lower.tri(myCor)]
myVar=.02^2*myCor

t.alpha=rmvnorm(I,rep(1,J),diag(J)*.2^2)
t.mu=rep(.06,J)
t.theta=rmvnorm(I,t.mu,sigma=myVar)
t.s2=.25^2
t.cell=t.alpha[subtask]+(cond-1)*t.theta[subtask]
rt=rnorm(N,t.cell,sqrt(t.s2))
dat=data.frame(sub,task,cond,rt)

When you create the data, you are trying to estimate the correlation matrix of t.theta, the true effects per person per tasks.

cor(t.theta)

You will notice that Tasks 1 and 2 are highly correlated, Tasks 3 and 4 are highly correlated, and Tasks 5 and 6 are highly correlated.  And there is moderate correlation across Tasks 1 and 3, 1 and 4, 2 and 3, and 2 and 4.  The rest are in the weeds.  Can you estimate that pattern?

If you just take means as estimators, you are swamped by measurement error.  The tight correlation among the pairs of tasks is greatly attenuated.  Here is the code:

mrt=tapply(dat$rt,list(dat$sub,dat$task,dat$cond),mean)
sample.effect=mrt[,,2]-mrt[,,1]
cor(sample.effect)


I guess I am wondering if there is any way to recover the correlations with acceptable precisions.    Perhaps they are forever lost to measurement noise.  I certainly cannot with my home-baked, roll-your-own Bayesian models.   If I tune the priors, I can get high correlations where I am suppose to, but the other ones are too variable to  be useful.  So either the problem is not so tractable or my models/methods are inferior.  I can share what I did if you wish.

So, can you recover the 15 correlations with acceptable precision?  I appreciate your help and insight.

Best,
Jeff

7 comments:

Mike Lawrence said...

Back in my frequentist days, I'd have done:

#create condition contrasts for each task
for(i in unique(dat$task)){
dat$temp = ifelse(dat$cond==1,-.5,.5)
dat$temp[dat$task!=i] = 0
names(dat)[ncol(dat)] = paste0('t',i)
}

library(lme4)
dat$sub = factor(dat$sub)
fit = lmer(
data = dat
, formula = rt ~ t1+t2+t3+t4+t5+t6 +
(1+t1+t2+t3+t4+t5+t6|sub)
)

Jeff Rouder said...

Thanks Mike, I am doing something similar I think,

dat$sub=as.factor(dat$sub)
dat$task=as.factor(dat$task)
dat$cond=as.factor(dat$cond)
freq=lmer(rt~cond*task+(1+task*cond|sub),data=dat)

maybe your approach is better.

Jeff Rouder said...

Mike, this is the output with your code. Mine doesnt ever finish, at least yours does. As tou can see, the correlations are nonsense given the true values. cor(t.theta). This is one hard problem. Thanks, for the input.


> cor(est)
t1 t2 t3 t4 t5 t6
t1 1.0000000 0.9647629 0.8244395 0.4975749 -0.0959138 -0.5465952
t2 0.9647629 1.0000000 0.6934528 0.2552608 -0.3476396 -0.7467056
t3 0.8244395 0.6934528 1.0000000 0.6791414 0.3895206 -0.1599528
t4 0.4975749 0.2552608 0.6791414 1.0000000 0.7466473 0.4520473
t5 -0.0959138 -0.3476396 0.3895206 0.7466473 1.0000000 0.8431023
t6 -0.5465952 -0.7467056 -0.1599528 0.4520473 0.8431023 1.0000000

Daniel Heck said...

Hi Jeff,

I think it is necessary to correct for measurement error, which reduces the estimated descriptive correlations. In a frequentist setting, the standard attenuation correction (https://en.wikipedia.org/wiki/Correction_for_attenuation) can be used for the full correlation matrix (assuming that the measurement error of the observations are comparable across participants):


##### (A) compute reliability and use attenuation correction:

# standard error of the means per person:
se <- function(x) sd(x)/sqrt(length(x))
sert <- tapply(dat$rt, list(dat$sub,dat$task,dat$cond), se)

# (squared) standard error of difference in means:
se2.diff <- colMeans(sert[,,2]^2 + sert[,,1]^2)

# (descriptive) variance of effect size estimates:
var.diff <- apply(sample.effect, 2, var)

# reliability of estimates:
# = ( VAR(estimates) - SE(estimates)^2 ) / VAR(estimates)
reliability <- (var.diff - se2.diff) / var.diff

# attenuation correction:
cor(sample.effect) / sqrt(reliability %o% reliability) # (ignore the diagonal!)


##### (B) same results, but more straightforward computation:

# substract variance that is due to measurement error of the difference in means:
cov.diff <- cov(sample.effect) - diag(se2.diff)

# get correlation matrix:
cov2cor(cov.diff)


>
1 2 3 4 5 6
1 1.00000000 0.92105424 0.69400657 0.09616434 0.0939602 0.01367065
2 0.92105424 1.00000000 0.07098673 -0.02379419 -0.3019451 -0.24332759
3 0.69400657 0.07098673 1.00000000 0.55259234 0.1436169 -0.28754694
4 0.09616434 -0.02379419 0.55259234 1.00000000 0.2136787 0.07872665
5 0.09396020 -0.30194507 0.14361689 0.21367874 1.0000000 0.69322743
6 0.01367065 -0.24332759 -0.28754694 0.07872665 0.6932274 1.00000000

Note that with I=3000, the estimates are even closer to the true correlations.
Moreover, it should be possible to construct a similar model in a Bayesian setting.

Jeff Rouder said...

Daniel, Thank you. That level of estimation accuracy is comparable with the hierarchical Bayesian approaches I am using. THe sad state is that this simulation is a realistic best case scenario. Oh, I push the I to 400 or so, but L is pretty big. The problem I think is that for these settings, the likelihood ridge on possible correlation matrices is pretty darn flat. And that is why I think these types of studies may fail.

Jeff Rouder said...

Daniel, Thank you. That level of estimation accuracy is comparable with the hierarchical Bayesian approaches I am using. THe sad state is that this simulation is a realistic best case scenario. Oh, I push the I to 400 or so, but L is pretty big. The problem I think is that for these settings, the likelihood ridge on possible correlation matrices is pretty darn flat. And that is why I think these types of studies may fail.

Daniel Heck said...

Yes, I completely agree. The error variances within and between persons add up and make it practically close to impossible to get a precise point estimate of the "true" correlation matrix.

However, it might still be possible to test specific theories about the overall structure of the matrix. Similar as in SEM, we could test the null hypothesis (diagonal matrix) against a specific, theoretically derived alternative correlation structure (e.g., a model with only two freely estimated correlations: r[1,2]=r[3,4]=r[5,6] and r[1,3]=r[1,4]=r[2,3]=r[2,4]).