Tuesday, November 17, 2015

Race At Mizzou: Comments on the Events of Last Week.

All week long people have been asking me about events at The University of Missouri, a system which serves 77,000 students across four campuses.  For those of you who don't know, what started as a hunger strike by a single student over the racial climate on campus garnered national attention when the football team decided they would not play football until the system-wide president, Tim Wolfe, stepped down.  He subsequently resigned along with Mizzou Chancellor Bowen Loftin.

Since Saturday, I have had numerous conversations with family and friends both in person and online.  They have been illuminating to say the least.  Even after all the reporting, the events seem too incredulous for many.  So I thought I would weigh in.   For the record, I am a White, male faculty member who has been here for 15 years and has no expertise in race relations.  This is my take; use it as you will.

Why the hunger strike?

If you read the media, you might think the hunger strike and protests were about one very gross swastika written in human excrement.   And you would be wrong.  The hunger strike, along with a series of protests, are motivated by calls to the administration to understand the racial climate and take concrete steps to improve it.

So, what is the racial climate at Mizzou?

According to the students and faculty, whom I fully believe, there are repeated incidents of overt racial hostility such as use of the N-word.  This is not to say that all Black people are called the N-word every day.  But it happens, and it happens far too often.  One of the more infamous incidences was the spreading of cotton balls at the Black Cultural Center.  Nothing says, "You are slaves and you don't belong here" like cotton.  (Infuriating aside: the perpetrators were convicted of littering.)  While overt racism is pretty easy to spot, no less difficult is covert racism.  More than a small fraction of White students wonder if that Black students are here because they are Black.  Black students rarely receive the same benefit of the doubt, and their place on campus is under constant but subtle assault.  And well-meaning White allies are often allies not out of any deep empathy for the experiences of Black students, but to feel better about themselves.  How many White people know a Black person in more than just a passing role?  How many know their story; how many siblings they have; where they grew up; what were the formative moments in their lives?

I suspect the racial climate here is about the same as it is on most campuses. Even so, Mizzou is intimately tied to St. Louis.  We are St. Louis' little school in Columbia, and St. Louis is one of the most segregated, disparate, inequitable cities in the US.  We necessarily have St. Louis' racial problems.  And the events here need to be viewed in the context of racial justice in the wake of Ferguson, just 100 miles away.

Aren't the Protesters Oversensitive Crybabies?

Times have changed, and we need to change with them.  Up to about 15 years ago, perhaps every single men's bathroom stall in America had a disparaging comment about gays, women, Blacks, or Jews.  It was so common that nobody I knew thought anything of it.  It just what was on bathrooms, along with toilets and toilet paper.  As kids, we called each other all sorts of epithets.   The most common insult was "you're so gay."  We were not sensitive enough then.  Period.   I like our newfound sensitivity.  It is a good thing.  Why should I have to experience swastikas in bathrooms.  Fuck that.  Why should these kids have to tolerate any indignities related to their Blackness?  Let's agree that racist assholes have the right to say what they do; the rest of us have the right to be pissed off about it and isolate them.  We also have the right to have our leaders condemn them and develop a culture that  isolates them.

Outsiders should know the true character of these impressive young people who protested.  They were cool, collected, and focused.  I never saw any hint of aggression or violence or even recriminations.   Other than the media, everyone was welcome.  Two elements were evident in the protest:  The first was God's grace. These young people had a love of people that they attributed to their love of God.  It was a theme, and it was the an instantiation of the best of Christianity.  The second was intersectionality.   Intersectionality refers to a common set of dynamics that marginalized people may experience due to race, religion, ethnicity, gender, gender orientation, position in society, disability, etc.  The protesters clearly understood intersectionality and saw their protest rooted in an administration that had not responded to graduate-student rights, had been slow to respond campus rape culture, and had been hostile toward Planned Parenthood.  This is not to say that all young people held this intersectionality, but it was in the consciousness in the Concerned Student 1950 leaders.

Was It Fair To Call For Wolfe's Head?

There is nothing that has perplexed outsiders as much as the protesters call for Wolfe's resignation.  "What did Wolfe do?" asked Joe Scarborough.  Tim Wolfe has a sterling reputation in town as a genuine nice guy and a tireless advocate for the University.  In actuality, he is not responsible for the campus racial climate---it is the Chancellor's responsibility.   So, it might seem that Wolfe was a victim.  People in town are upset that a good man was railroaded, and they are equally upset with the lack of process.  It is obvious to me that Wolfe cares deeply about the University.  To his credit, his resignation was voluntary and he did so without any severance package.

Nonetheless, Wolfe suffered perhaps two self-inflicted wounds.  Had he avoided either of them, he probably would be system president today.  Here they are:  1. Wolfe was not sufficiently empathetic to the plight of Black students on campus.  He froze when the students stopped his car at homecoming, and did not go out and talk to them.  He was defensive when he approached the protesters at a KC fundraiser.  There he said systematic oppression is, "when you believe you don't have the same opportunity."  In this response, it was painfully and nakedly obvious that he put the onus back on the students.  Perhaps it is no coincidence that the football team made their boycott decision the very next day.  The bottom line is that there is the impression that Wolfe did not or could not feel for his students.  Perhaps if Tim Wolfe had a diversity course at Mizzou or Harvard he would have been more empathetic.  2. Wolfe was too ideological in his corporate orientation.  Wolfe seemingly ran the University like a mid-cap technology company.  Except the University is far more varied and organic than a company.  Wolfe alienated faculty by treating them as dispensable labor.   He had each campus set strategic plans but gave us few resources to meet them.  Instead, he made the units give back 2% of their budgets to systemwide to redistribute to meet these goals.  We were suppose to hire superstar senior faculty to improve our research reputation.  It is a very expensive, difficult, and  slow way of improving.  And it by-and-large failed.  If you are in an underrated, out-of-the-way, difficult-to-travel-to, small city like Columbia, the best strategy is to hire hungry, appreciative junior faculty.   Loyalty comes from sound development of these junior faculty.    My department's experience is illustrative.  We have not been able to hire, and after this year we will be comprised of 35 senior  and 2 junior members.  That is an unhealthy proportion.   Additionally, raises were too draconian and too inequitable.  In one year, the top few performers received 20% and the rest received virtually nothing.  The next year there were no raises at all.  Morale plummetted.  Previously, I had always felt respected at Mizzou not only by my department but by the administration.  This year, in contrast, I felt respected by my  department but not by the administration.  Instead, I felt disposable.

By last week, Wolfe had few allies among the faculty on the Mizzou campus.  I suspect if he had simply listened fearlessly to the students with an empathetic ear or had been doing better by the bulk of the faculty and staff, he would still be here.

What About Chancellor Loftin?

One of the least appreciated aspects of this story is the forced resignation of Chancellor Loftin.  He had actually won over the support of the protestors, and they wanted him to stay on.  Instead, he was forced out by the deans.  In an unprecedented move, nine of the dozen or so deans got together and publicly called for his resignation.  It was a mutiny.  And for the record, they did so before Wolfe resigned though the letter was published afterwards.  It was a brave and unprecedented action.  And it was the right thing to do, and had not the deans done so, more and more departments or perhaps the faculty body would have formally expressed no confidence.   Loftin was a deeply flawed leader whose actions, words, and deeds led to his forced resignation.  I would rather not spend any more time on this part of the story.

What Actions Could the Administration Have Taken About Race?

Many view Wolfe as unfairly victimized.  I argued above that his inability to address students and his alienation of the faculty through an extreme business view of the University were factors in his undoing.  Yet, unaddressed is the all-important question about official University policy and action to address the racial climate.  Here is some background as I understand it.  Our former chancellor, Brady Deaton, said all the right things.  He started a Chancellor's Diversity Initiative and a unity campaign called, "One Mizzou."  We had a Chief Diversity Officer and the department had a commitment from administration that if we could hire minority faculty they would help with funding.  The number of Black undergraduate students has doubled since 2000, which is faster than the pace of overall student growth.

But the momentum stalled around 2010.  Diversity became more about Mizzou's image than any real culture change.  One Mizzou  was coopted into a slogan to sell football tickets; diversity was about getting the right colored students on the brochures and webpages.   Black faculty were resigning at the same rate they were being hired, often after just a few years here.  The faculty voted against a diversity-course requirement.   The chief diversity officer resigned and left, and was replaced by a staff person without tenure rather than a faculty member with tenure.  As the strategic focus at Mizzou shifted to research, diversity was shoved aside.

I will defer to more knowledgable others about what should be done.  Clearly, what we need is fearless listening and good will from faculty, administrators, and students.  We also need resources.  I suspect Mizzou probably has a Black faculty retention problem that needs critical attention.  I do not know if the reasons why Black faculty leave at the rate they do has been thoroughly explored or what can be done, but retention should clearly come into the spotlight.

The Blowback.

The most distressing consequence to me has been the severe blowback.  Individuals have had death threats made against them, and three white students terrorized the campus with threats of mass violence.  Outsiders have made up fake twitter accounts posting horrible things to discredit the protesters.  And we have been excoriated in the right-wing media.  Politicians lambast us, including those who control our purse strings.  Around town and across the state, people view the events negatively.   Some people are worried that we are or will be viewed as a racist, backwater, redneck place.  Some people are worried that White people will feel threatened.  It has been demoralizing and emotionally exhausting to say the least.

I am Proud of Mizzou.

I think the events of last week were hugely positive.  The Black students went from marginal to central, and their voices were truly heard perhaps for the first time.  The institution through its own dynamics asserted to the Curators, to the politicians, and to the people of Missouri that we are a university rather than a business.  We have our own culture and dynamics that cannot be micromanaged by political interests.

Mizzou is embarking on a difficult conversation about diversity.  I think much good will come of it, and we have already become a better institution for the protests.  We are now leading, stay tuned.

Sunday, May 31, 2015

Simulating Bayes Factors and p-Values

I see people critiquing Bayes factors based on simulations these days, and example include recent blog posts by Uri Simonsohn and Dr.-R. These authors assume some truth, say that the true effect size is .4, and then simulate the distribution of Bayes factors are like across many replicate samples.  The resulting claim is that Bayes factors are biased, and don't control long run error rates.   I think the use of such simulations is not helpful.  With tongue-in-cheek, I consider them frequocentrist.  Yeah, I just made up that word.  Let's pronounce it as  "freak-quo-centrists.  It refers to using frequentist criteria and standards to evaluate Bayesian arguments.

To show that frequocentric arguments are lacking, I am going to do the reverse here.  I am going to evaluate p-values with a Bayescentric simulation.

I created a set of 40,000 replicate experiments of 10 observations each.  Half of these sets were from the null model; half were from an alternative model with a true effect size of .4.   Let's suppose you picked one of these 40,000 and asked if it were from the null model or from the effect model.  If you ignore the observations entirely, then you would rightly think it is a 50-50 proposition.  The question is how much do you gain from looking at the data.

Figure 1A shows the histograms of observed effect sizes for each model.  The top histogram (salmon) is for the effect model; the bottom, downward going histogram (blue) is for the null model.  I drew it downward to reduce clutter.

The arrows highlight the bin between .5 and .6.  Suppose we had observed an effect size there.  According to the simulation, 2,221 of the 20,000 replicates under the alternative model are in this bin.  And 599 of the 20,000 replicates under the null model are in this bin.   If we had observed an effect size in this bin, then the proportion of times it comes from the null model is 599/(2,221+599) = .21.  So, with this observed effect size, the probability goes from 50-50 to 20-80.  Figure 1B shows the proportion of replicates from the null model, and the dark point is for the highlighted bin.  As a rule, the proportion of replicates from the null decreases with effect size.

We can see how well p-values match these probabilities.  The dark red solid line is the one-tail p-values, and these are miscalibrated.  They clearly overstate the evidence against the null and for an effect.  Bayes factors, in contrast, get this problem exactly right---it is the problem they are designed to solve.  The dashed lines show the probabilities derived from the Bayes factors, and they are spot on.  Of course, we didn't need simulations to show this concordance.  It falls directly from the law of conditional probability.

Some of you might find this demonstration unhelpful because it misses the point of what a p-value is what it does.  I get it.  It's exactly how I feel about others' simulations of Bayes factors.

This blog post is based on my recent PBR paper: Optional Stopping: No Problem for Bayesians.  It shows that Bayes factors solves the problem they are designed to solve even in the presence of optional stopping.

Monday, May 18, 2015

Ben's Letter: Who Raised This Kid?

UPDATE (5/21): With hurt feelings all around, Ben has been excused for the two days.  We are grateful.  It is a tough situation because their primary concern is safety, and I get that.  Hopefully, the hurt feelings will slowly melt, because the camp folks have done right by us over the years.



"I just wrote camp a letter," said Ben.  "Oh God," I thought.  "Please, this is a delicate situation," I said to myself.  "He is just going to make it worse...."

My 16-year-old son has been going to camp for some six or seven years, and he loves it there.  We trusted this camp with our kids and they have delivered year after year.  We have a relationship of sorts.  He is scheduled to be a first-year counselor, and is excited.

My mother-in-law  is 90 and is in failing health.  She won't be able to travel to my nephew's wedding or my daughter's Bar Mitzvah in the coming year, assuming she is still alive then.

The conflict is that after much wrangling over dates, my wife's family is honoring my mother-in-law the first weekend of training for Ben's counselor gig.  Everyone will be there, and we thought it would be a no-brainer for camp to excuse him for the weekend.  He knew everyone; he knew camp; he knew the routines.  But they did not.  And we are very hurt.

We have been going back and forth with them, expressing our hurt and listening to their reasons, and Ben has been cc'd on the emails.

And then, with dread, I read this from him:

Dear (I redacted the name, it is not important),

I'm writing this E-mail because I don't think my Mom is going to say this - you are completely missing the point.  In your emails you used words such as "Birthday Party" to describe the event that we wish to attend.  Not only is this inaccurate, it completely undertones the value of this event.  This isn't just some "Birthday Party." It’s meant to be the last time my Grandma will ever be able to see her ENTIRE family alive.  It’s about being able to celebrate my Grandma's life while she is still alive, because the scary truth is, if I don't go, the next time I will be in California will probably be her funeral.  To our family, this isn't a "Birthday Party," it is like a Bar/Bat Mitzvah, and to others in our family it is even more important and valuable than one. To define this event as a "Birthday Party" is not only severely incorrect, but just reflects a lack of understanding of what this means to my Mom and my Family.  

I understand that missing two days is very inconvenient for the camp, but if I do go the plan is to come back on the 14th. If camp starts on the 21st that gives me 7 days to bond with the other counselor's (whom I already know) and to learn the camp rules.  I also understand that you have to think about the safety of the campers and I truly do respect that, but to tell me that missing two days of content is going to endanger my campers, and that I won't be able to make up those two days of content, is absurd.  If missing two days is truly going to put my campers at risk, then tell us what we’re missing, prove us wrong, because to us it sounds like you're putting bonding time (with people I already know) over seeing my grandma for a possible last time, especially since you said that you would be able to handle it if I missed a few days for a family emergency.  If anything I said at all reflects that I don't understand the gravity of missing camp, then tell me exactly how missing camp will put my campers in danger, because I'm trying to understand your situation but I really don't. To us, your situation sounds like an excuse compared to what could be the last time our entire family is united before my grandma dies.  

The only point you've made in this argument that I've seen as valid is you mentioning the contract.  Yes, I signed the contract saying that I would be there, and yes, technically by not going I would NOT be honoring my contract.  But what is more important in life, and dare I say it, in Judaism - honoring a contract or honoring a family?  I personally feel like family is a much more important concept in Judaism than honoring a contract, and I feel like that should also be a value a Jewish Camp respects. It shouldn't be camp policy to turn someone down because they want to try to see their grandma one last time with their entire family.  My family has already tried to change the date. They tried everything before making me aware of this date, and it just won't work any other day.  You mentioned that there would be an exception if there was a family crisis (funeral), but that only further reflected your lack of understanding towards our scenario. In saying this, you send the message that it’s more important to celebrate the life of someone when they are dead, rather than celebrating their life when they are alive.  I feel like this contradicts many important values in Judaism.  Rules and policies shouldn't restrict the celebration of family, or the values of Judaism.  

You could tell me that the fact that I'm putting family over camp is my decision and not the camp's decision, but the point that we are trying to make is that it is camp's policy that is forcing me to have to make a decision, and that is disgraceful.  I shouldn't have to choose between my two families, especially if the one forcing the decision is Jewish, but right now I'm being forced to all because in your eyes, two days of bonding is more important than seeing my entire family together one last time.  Family is supposed to be an important value in Judaism and should not be a topic you can deescalate by calling our important gathering a mere "Birthday Party".  I hope this Email both makes our anger, and disappointment towards your decision clear, but also shows how we view your perspective.  If you could help us better understand how your two days of training is more important than seeing a scattered family united one last time before my Grandma dies, then maybe this decision will be easier to make.  If the only way you will excuse us is if we have a family emergency, then consider this a family emergency. That is how important this is.    

Ben Rouder

It is a wonderous feeling when your child is more elegant, logical, articulate, and authentic than you could have imagined.

Sunday, May 17, 2015

To Better Know A Bayesian

The Self-Propagated Myth of Bayesian Unity

Substantive psychologists are really uncomfortable with disagreements in the methodological and statistical communities.  The reason is clear enough----substantive psychologists by-and-large just want to follow the rules and get on with it.

Although our substantive colleagues would prefer if we had unified and uniform set of rules, we methodologists don't abide.  Statistics and methodology are varied fields with important, different points of view that need to be read, understood, and discussed.

Bayesian thought itself is not uniform,.  There are critical, deep, and important differences among us, so much so that behind closed doors we have sharp and negative opinions about what others advocate,  Yet, at least int he psychological press, we have been fairly tame and reticent to critique each other.   We fear our rule-seeking substantive colleagues may use these differences as an excuse to ignore Bayesian methods altogether.  That would be a shame.

In what follows, I give the briefest and most coarsest description to the types of Bayesians out there.  In the interest of being brief and coarse, I am going to do some points-of-view an injustice.  Write me a nice comment if you want to point it out a particular injustice.  My hope is simply to do more good than harm.

Also, I am not taking names.  You all know who you are:

Strategic vs. Complete Bayesians:

The first and most important dimension of difference is whether one uses Bayes Rule completely or strategically.  

Complete Bayesians are those that use Bayes rule always, usually in the form of Bayes factors.  They are willing to place probabilities on models themselves and use Bayes rule to update these probabilities in light of data.  The outline of the endeavor is that theories naturally predict constraint in data which are captured by models.  Model comparison provides a mean of assessing competing theoretical statements of constraint, and the appropriate model comparison is by Bayes factors or posterior odds.  In this view, models predict relations among observables and parameters are convenient devices to make conditional statements about these relations.  Statements about theories are made based on predictions about data rather than about parameter values.  This usage follows immediately and naturally from Bayes rule.

Strategic Bayesians are those that use Bayes rule for updating parameters and related quantities, but not for updating beliefs about models themselves.  In this view, parameters and their estimates become the quantities of interest, and the resultants are naturally interpretable in theoretical contexts. These Bayesians stress highest density regions, posterior predictive p-values, and estimation precision.  Strategic Bayesians may argue that the level of specification needed for Bayes factors is difficult to justify in practice especially given the attractiveness of estimation.

The Difference: The difference between Complete and Strategic Bayesians may sound small, but it is quite large.  At stake are the very premise of why we model, what a model is, how it relates to data, what counts as evidence, and what are the roles of parameters and predictions.  Some statisticians, philosophers, and psychologists take these elements very seriously.  I am not sure anyone is willing to die on a hill in battle for these positions, but maybe.

I would argue that the difference between Complete and Strategic Bayesians is the most important one in understanding the diversity of Bayesian thought in the social sciences.   It is also the most difficult and the most papered over.

Subjectivity vs. Objectivity in Analysis

The nature of subjectivity is debated in the Bayesian community.  I have broken out here a few positions that might be helpful.

Subjective Bayesians ask analysts to query their beliefs and represent them as probability statements on parameters and models as part of the process of model specification.  For example, if a researcher believes that an effect should be small in size and positive, they may place a normal on effect size centered at .3 with a standard deviation of .2.  This prior would then provide constraint for posterior beliefs.

A variant to the subjective approach is to consider the beliefs of a generic, reasonable analyst rather than personal beliefs. For example, I might personally have no faith in a finding (or, in my case, most findings), yet I still may assign probabilities to parameters and hypotheses values that I think capture what a reasonable colleague might feel.  This process is familiar and natural---we routinely take the position of others in professional communication.

Objective Bayesians stipulate desirably properties of posteriors and updating factors and choose priors that insure these desired properties hold.  A simple example might be that in the large-sample limit, the Bayesian posterior of a parameter should converge to a true value.  Such a desirada would necessitate priors that have certain support, say all positive reals for a variance parameter or all values between 0 and 1 for a probability parameter.

There are more subtle examples.  Consider a comparison of a null model vs. an alternative model.  It may be desirable to place the following constraint on the Bayes factor.  As the t-value increases without bound, the Bayes factor should favor without bound the alternative.  This constraint is met if a Cauchy prior is placed on effect size, but it is not met if a normal prior is placed on effect size.

There are many other desiderata that have been proposed to place constraints on priors in a variety of situations, and understanding these desiderata and their consequences remains the topic of objective Bayesian development.

The Difference:

My own view is that there is not as much difference between the objective and subjective points of view as there might seem.

1.  Almost all objective criteria yield flexibility that still needs to be subjectively nailed down.  For example, if one uses a Cauchy prior on effect size, one still needs to specify a scale setting.  This specification is subjective.

2.  Objective Bayesian statisticians often value substantive information and are eager to incorporate it when available.  The call to use desiderata is usually made in the absence of such substantive information.

3.  Most subjective Bayesians understand that the desiderata are useful as constraints and most subjective priors adopt some of these properties.

4.  My colleagues and I try to merge and balance subjective and objective considerations in our default priors.  We think these are broadly though not universally useful.  We always recommend they be tuned to reflect reasoned beliefs about phenomena under consideration.  People who accuse us as being too objective may be surprised by the degree of subjectivity we recommend; those who accuse us as being too subjective may be surprised by the desiderata we follow.

Take Home

Bayesians do disagree over when and how to apply Bayes rule, and these disagreements are critical.  They also disagree about the role of belief and more objectively-defined desiderata, but these disagreements seem more overstated, especially in light of the disagreements over how and when Bayes rule should be used.  

Sunday, May 10, 2015

Using Git and GitHub to Archive Data

This blog post is for those of you who have never used Git or GitHub.   I use Git and GitHub to archive my behavioral data.    These data are uploaded to GitHub, an open web repository where it may be viewed by anyone at any time without any restrictions.  This upload occurs nightly, that is, the data are available within 24 hours of their creation.  The upload is automatic---no lab personnel is needed to start it or approve it.  The upload is comprehensive in that all data files from all experiments are uploaded, even those that correspond to aborted experimental runs or pilot experiments.  The data are uploaded with time stamps and with an automatically generated log.  The system is versioned so that any changes to data files are  logged, and the new and old versions are saved.  In summary, if we collect it, it is there, and it is transparent.   I call data generated this way as Born Open Data.

Since setting up the born-open-data system, I have gotten a few queries about Git and GitHub, the heart of the system.  Git is the versioning software; GitHub is a place on the web (github.com) where the data are stored.  They work hand in hand.

In this post, I walk through a few steps of setting up GitHub for archiving.  I take the perspective of Kirby, my dog, who wishes to archive the following four photos of himself:

Here are Kirby's steps:

1. The first step is to create a repository on the GitHub server.

1a.  Kirby goes to GitHub (github.com)  and signs up for a free account (last option).  Once the account is set up (with user name KirbyHerby) he is given a screen with a lot of options for exploring GitHub.  He ignores these as they are not relevant for his task.

1b.  To create his first repository on the server, Kirby presses the green button that says ``+ New repository" on the bottom left.

1c. Kirby now has to make some choices about the repository.  He names it ``data," enters a description of the repository, makes it public,
initializes it with a README and does not specify which files to ignore or a license.  He then presses the green ``Create repository" button on the bottom, and is given his first view of the repository

Kirby's repository is now at github.com/KirbyHerby/data, and he will bark out this URL to anyone interested.  The repository contains only the README.md file at this point.

2.  The next step is getting a linked copy of this repository on Kirby's local computer.

2a. Kirby  downloads the GitHub application for his operating system (mac.github.com} or windows.github.com), and on installation, chooses to install the command-line tools (trust me, you will use these some day).

2b.  Kirby enters his GitHub username (``KirbyHerby") and password.

2c. He next has to create a local repository and link it to the one on the server.   To do so, he chooses to ``Add repository" and is given a choice to ``Add," ``Create," or ``Clone."  Since the repository already exists at GitHub, he presses ``Clone."  A list of his repositories shows up, and in this case, it is a short list of one repository, ``data."   Kirby then selects ``data" and presses the bottom button ``Clone repository."  The repository now exists on the local computer under the folder ``data."   There are two, separate copies of the same repository: one on the GitHub server and one on Kirby's local machine.

3. Kirby wishes add files to the server repository so others may see them.

3a. Kirby first adds the photo files to the local repository as follows: Kirby copies the photos to the files in the usual way, which for Mac-OSX is by using the Finder.  The following screen shot shows Finder window in the foreground and the GitHub client window in the background.  As can be seen, Kirby has added three files, and these show up in both applications.  Kirby has no more need for the Finder and closes it to get a better view of the local repository in the GitHub client window.

3b.  Kirby is now going to save the updated state of the local repository, which is called committing it.  Committing a local action, and can be thought of as a snapshot of the repository at this point in time.  Kirby turns his attention to the bottom part of the screen.  To commit, Kirby must add a log entry, which in this case is, ``Added three great photos."  The log will contain not only this message, but a description of what files were added, when, and by whom.  This log message is enforced---one cannot make a commit without it.  Finally Kirby presses ``Commit to master."

3c.  Kirby now has to push his changes to the repository to the GitHub server so everyone may see them.  He can do so by pressing the ``sync" button.

That's it.  Kirby's additions are now available to everyone at github.com/KirbyHerby/data

Suppose Kirby realizes that he had forgotten his absolutely favorite photo of him hugging his favorite toy, Panda.  So he copies the photo over in Finder, commits a new version of the repository with a new message, and syncs up the local with the GitHub server version.

There is a lot more to Git and GitHub than this.  Git and GitHub are very powerful, so much so that they are the default for open-source software development world wide.  Multiple people may work on multiple parts of the same project.  Git and GitHub have support for branches, tagging versions, merging files, and resolving conflicts.  More about the system may be learned by studying the wonderful Git Book at git-scm.com/book/en/v2.

Finally, you may wonder why Kirby wanted to post these photos.  Well, Kirby doesn't know anything about Bayesian statistics, but he is loyal.  He knows I advocate Bayes factors.  He also knows that others who advocate ROPEs and credible intervals sell their wares with photos of dogs.  Kirby happens to believe that by posting these, he is contributing to my Bayes-factor cause.  After all, he is cuter than Kruschke's puppies and perhaps he is more talented.  He does know Git and GitHub and has his own repository to prove it.

Thursday, April 30, 2015

Brainstorming: What Is A Good Data-Sharing Policy?

This blog post is coauthored with Richard Morey.

Many talented and generous people are working hard to change the culture of psychological science for the better.  Key in this change is the call for transparency and openness in the production, transmission, and consumption of knowledge in our field.  And change is happening---the field is becoming more transparent.   It is delightful to see.

To keep up with these changes, I have recently reconfigured how data are curated in my lab.  In that process I examined APA's statement on the ethical obligation of researchers to share data:

8.14 Sharing Research Data for Verification
(a) After research results are published, psychologists do not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis and who intend to use such data only for that purpose, provided that the confidentiality of the participants can be protected and unless legal rights concerning proprietary data preclude their release. This does not preclude psychologists from requiring that such individuals or groups be responsible for costs associated with the provision of such information. 
(b) Psychologists who request data from other psychologists to verify the substantive claims through reanalysis may use shared data only for the declared purpose. Requesting psychologists obtain prior written agreement for all other uses of the data.

More commentary is provided in APA's 6th Edition Publication Manual (p 12-13):

To avoid misunderstanding, it is important for the researcher requesting the data and the researcher providing the data to come to a written agreement about the conditions under which the data are to be shared.  Such an agreement must specify how the shared data may be used (e.g., for verification of already published results, for inclusion in meta-analytic studies, for secondary analysis).  The written agreement should also include a formal statement about limits on the distribution of shared data....  Furthermore, the agreement should specify limits on the dissemination .... of the results of analyses performed on the data and authorship expectations.

What's Good About This Policy?

The above policy shines a spotlight on verification.  Verification is needed for a healthy psychological science.  Researchers are prone to mistakes, and allowing others to see the reasoning from data to conclusion provides an added layer of integrity to the process.  It also provides at least some expectation that data may be shared, albeit perhaps with stringent limits.

Understanding The Difference Between "Data" and "Interpretation"?

To act ethically, we need to differentiate data from the interpretation of data.  

Data serve as facts, and like facts, are interpreted in the context of theories and explanations.  The data themselves are incontrovertible in as much as the actual values stand on their own.  The interpretation of these facts, whether and how they support certain theories, explanations, and accounts, is not incontrovertible.  These interpretations are just that.  They are creative, insightful, varied, judged, negotiated, personal, etc.   

Who is qualified to interpret these data?  We all feel like we are uniquely qualified to interpret our own data, but we are not.  On the contrary, we should respect the abilities of others to interpret the data, and we should respect that their interpretations may differ from ours.  Our interpretations should derive no special authority or consideration from the fact that we collected the data.  When we collect data, we gain the right to interpret them first, but not last.  

What's Wrong With The Policy

Jelte Wicherts and Marjan Baaker have recently pointed out the flaws in APAs policies.  Their unpublished paper and  commentary in Nature are important reads in understanding the possible ramifications.  

The APA policy does not insure others' rights to independently interpret the data.  The PI explicitly retains sanctioned control of subsequent interpretations.  They may exercise this control by refusing to share without being granted authorship, limiting the scope of use of data, or limiting where the results may appear.  The APA policy serves to privilege the data collectors' interpretation where no such privilege should ethically exist.  

What Does A Good Policy Look Like?

Let's brainstorm this as a community.  As a starting point, we propose the following statement:  Ethical psychologists endeavor to insure that others may independently interpret their data without constraint.  How that should play out?  We welcome your ideas.

Tuesday, April 21, 2015

How many participants do I need? Insights from a Dominance Principle

The Twitter Fight About Numbers of Participants

About a month back, there was an amusing Twitter fight about how many participants one needs in a within-subject design.  Cognitive and perceptual types tend to use just a few participants but run many replicates per subject.  For example, we tend to run about 30 people with 100 trials per condition.  Social psychologists tend to run between-subject designs with one or a handful of trials per condition, but they tend to run experiments with more people than 30.  The Twitter discussion was a critique of the small sample sizes used in cognitive experiments.

In this post, I ask how wise are the cognitive psychologists by examining the ramifications of these small numbers of participants.   This examination informed by a dominance principle, a reasonable conjecture on how people differ from one another.  I show why the cognitive psychologist right---why you need small numbers of people even to detect small effect.


Consider a small priming effect, say an average 30 ms difference between primed and unprimed conditions.  There are a few sources of variation in such an experiment:

The variability within a person and condition across trials.  What happens when the same person responds in the same condition?  In response times measures, this variation is usually large, say a standard deviation of 300 ms.  We can call this within-cell variability.

The variability across people.  Regardless of condition, some people just flat out respond faster than others.  Let's suppose we knew exactly how fast each person was, that is we had many repetitions in a condition.  In my experience, across-people variability is a tad less than within-person-condition variability.  Let's take the standard deviation at 200 ms.  We can call this across-people variability.

The variability in the effect across people.  Suppose we knew exactly how fast each person was in both conditions.  The difference is the true effect, and we should assume that this true effect varies.  Not everyone is going to have an exact 30 ms priming effect.  Some people are going to have a 20 ms effect, others are going to have a 40 ms effect.   How big is the variability of the effect across people?    Getting a handle on this variability is critical because it is the limiting factor in within-subject designs.   And this is where the dominance principle comes in.

A Dominance Principle

The dominance principle here is that nobody has a true reversal of the priming effect.  We might see a reversal in observed data, but this reversal is only because of sample noise.  Had we enough trials per person per condition, we would see that everyone has at least a positive priming effect.  The responses are unambiguously quicker in the primed condition---they dominate.

The Figure below shows the dominance principle in action.  Shown are two distributions of true effects across people---an exponential and a truncated normal.  The dominance principle stipulates that the true effect is in the same direction for everyone, that is, there is no mass below zero.  And if there is no mass below zero and the average is 30 ms, then the distributions cannot be too variable.  Indeed, the two shown distributions have a mean of 30 ms, and the standard deviations for these exponential and truncated normal distributions are 30 ms and 20 ms, respectively.  This variability is far less than the 300 ms of within-cell variability or the 200 ms of across-people variability.  The effect size across people, 30 ms divided by these standard deviations, is actually quite large.  It is 1 and 1.5 respectively for the shown distributions.

You may see the wisdom in the dominance principle or you may be skeptical.  If you are skeptical (and why not), then hang on.  I am first going to explore the ramifications of the principle, and then I am going to show it is probably ok.

Between and Within Subject Designs

The overall variability in a between subject design is the sum of the variabilities, and it is determined in large part by the much larger within-cell and across-people variabilities.  This is why it might be hard to see a 30 ms priming effect in a typical between subject design.  The effect size is somewhere south of .1.

The overall variability in a within subject design depends on the number of trials per participant.   In these designs, we calculate each person's mean effect.  This difference has two properties: first, we effectively subtract out across-participant variability; second, the within-cell variability decreases with the number of trials per participant.  If this number is great, then the overall variability is limited by the variability in the effect across people.  As we stated above, due to the dominance principle, this variability is small, say about the size of the effect under consideration.  Therefore, as we increase the numbers of observations per person, we can expect effects of 1 or even bigger.

Simulating Power for Within Subject Designs

Simulations seem to convince people of points perhaps even more than math.  So here are mine to show off the power of within-subject designs under the dominance principle.  I used the 300 ms within-cell and 200 ms across-people variabilities and sampled 100 observations per person per condition.  Each person had a true positive effect, and these effects we sampled from the truncated normal distribution with an overall mean of \( \mu \).  Here are the power results for several sample sizes (numbers of people) and  value of an average effect \( \mu \).

The news is quite good.  Although a 10 ms effect cannot be resolved with fewer than a hundred participants, the power for larger effects is reasonable.  For example the power to resolve a 30 ms effect with 30 participants is .93!  Indeed, cognitive psychologists know that even small effects can be successfully resolved with limited participants in massively-repeated within-subjects designs.  It's why we do it routinely.

The bottom line message is that if one assumes the dominance principle, then the power of within-subject designs is surprisingly high.  Of course, without dominance all bets are off.  Power remains a function of the variability of the effect across people, which must be specified.

Logic and Defense of the Dominance Principle

You may be skeptical of the dominance principle.  I suspect, however, that you will need to assert it.

1. The size of effects are difficult to interpret without the dominance principle.   Let's suppose that the dominance principle is massively violated.  In what sense is the mean effect useful or interpretable.  For example,suppose one has a 30 ms average effect with 60% of people having a true positive effect and 40% of people having a true a negative priming effect .  The value of the 30 ms seems unimportant.  What is critically important in this case is why the effect is different in direction across people.  A good start is exploring what person variables are associated with positive and negative priming?

2. The dominance principle is testable.  All you have to do is collect a few thousand trials per person to beat the 300 ms within-cell variability.  If you want say 10 ms resolution per person, just collect 1000 observations per person.  I have done it on several occasions, collecting as much as 8,000 trials per person on some (see Ratcliff and Rouder, 1998, Psych Sci).  I cannot recall a violation though I have no formal analysis....yet.   The key is making sure you do not confound within-participant variability, which is often large, with between participant variability.  You need a lot of trials per individual to deconfound these sources.  If you know of a dominance violation, then please pass the info along.

Odds are you are not going to collect enough data to test for dominance.  And odds are that you are going to want to interpret the average effect size across people as meaningful.  And to do so, in my view, you will therefore need to assume dominance!  And this strikes me as a good thing.  Dominance is reasonable in most contexts, strengthens the interpretation of effects, and leads to high-power even with small sample sizes in within-subject designs.

Thursday, April 9, 2015

Reply to Uri Simonsohn's Critique of Default Bayesian Tests

How we draw inferences about theoretical positions from data, how we know things, is central to any science.  My colleagues and I advocate Bayes factors, and Uri Simonsohn provides a critique.  I welcome Uri’s comments  because they push the conversation forward.  I am pleased to be invited to reply, and am helped by Joe Hilgard, an excellent graduate student here at Mizzou.

The key question is what to conclude from small effects when we observe them.  It is a critical question because psychological science is littered with these pesky small effects---effects that are modest in size and evaluate to be barely significant. Such observed effects could reflect many possibilities including null effects, small effects, or even large effects.

According to Uri, Bayes factors are prejudiced against small effects because small effects when true may be interpreted as evidence for the null. By contrast, Ward Edwards (1965, Psy Bull, p. 400) described classical methods as violently biased against the null because if the null is true, these methods never provide support for it.

Uri's argument assumes that observed small effects reflect true small effects, as shown in his Demo 1. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths?  I discuss the difference between these questions; why the Bayes factor question is more intellectually satisfying; and how to simulate data to assess veracity of Bayes factor in my 2014 Psy Bull Rev paper.

The figure below shows the Bayes-factor evidence for alternative model relative to the null model conditional on data, which in this case is the observed effect size. We also show the same for p-values, and while the figures are broadly similar they also have dissimilarities.  I am not sure we can discern which is more reasonable without deeper analysis.  Even though we think Uri's Demo 1 is not that helpful, it is still noteworthy that the Bayes factors favor the null to greater extent as the true effect size decreases.

The question remains how to interpret observed small effects.  We use Bayes factors to assess relative evidence for the alternative vs. the null.  Importantly, we give the null a fighting chance.  Nulls are statement of invariance, sameness, conservation, lawfulness, constraint, and regularity.  We suspect that if nulls are taken seriously, that is, that if one can state evidence for or against them as dictated by data, then they will prove useful in building theory.    Indeed, statements of invariances, constraint, lawfulnesses and the like have undergirded scientific advancement for over four centuries.  It is time we joined this bandwagon.

Bayes factors are formed by comparing predictions of competing models.  The logic is intuitive and straightforward.  Consider two competing explanations.  Before we observe the data, we use the explanations to predict where the observed effect size might be.  We do this by making probability statements.  For example, we may calculate the probability that the observed effect size is between .2 and .3.  Making predictions under the null is easy, and these probabilities follow gracefully from the t-distribution.

Making predictions under the alternative is not much harder for simple cases.  We specify a reasonable distribution over true effect sizes, and based on these, we can calculate probabilities on effect size intervals as with the null.

Once the predictions are stated, the rest is a matter of comparing how well the observed effect size matches the predictions.  That's it.  Did the data conform to one prediction better than another?  And that is exactly what the Bayes factor tells us.  Prediction is evidence.

My sense is that researchers need some time to get used to making predictions because it is new in psychological science.   I realize that some folks would rather not make predictions---it requires care, thought, and judiciousness.  Even worse, the predictions are bold and testable.  My colleagues and I are here to help, whether it is help in specifying models that make predictions or deriving the predictions themselves.

I would recommend that researchers who are unwilling to specify models of the alternative hypothesis avoid inference altogether.  There is no principled way of pitting a null model that makes predictions against a vague alternative that does not.  Inference is not always appropriate, and for some perfectly good research endeavors, description is powerful and persuasive.  The difficult case, though, is continuing in what we are doing---rejecting nulls with small effects with scant evidence while denying the beauty and usefulness of invariances in psychological science.

Sunday, March 22, 2015

What Are the Units In Bayes Rule?

I often better understand statistics by drawing physical analogs. Expected value is the balancing of a distributions on a pin, skewness is a misloaded washing machine banging away, and so on. Sometimes the analogs come easily, and other times I have to work at it. I find that I tend to look for the units of things.

Surprisingly, consideration of units is not too common. Let me give you an example. Suppose I am modeling the mass of Brazil nuts (in grams) as a normal with some mean and variance. Most analysts know immediately that the units of the mean and variance parameters are \(gm\) and \(gm^2\), respectively. How about the units for the likelihood function for 10 observations? For any hypothetical value of the mean and variance, the evaluation of the likelihood is a number, and it has physical units. I'll get to them later.

Considering the units of Bayes Rule has affected how I view what it means, and describing this effect is the point of this blog post. Let's start with the units of density, progress to the units of likelihood, and finally to Bayes Rule itself.

The Units of Density

The figure shows the typical normal curve, and I am using it here to describe the mass of Brazil nuts. Most of us know there are numbers on the y-axis even though we hardly ever given them any thought. What unit are they in? We start with probability, and note that the area under the curve on an interval is the probability that an observation will occur in that interval. For example, the shaded area between 1.15 gm and 1.25 gm is .121, and, consequently, the probability that the next observation will fall between 1.15 and 1.25 is .121. We can approximate this area, \(A\), for a small interval between \(x-\Delta/2\) and \(x+\Delta/2\) as:
A=f(x) \times \Delta.
The figure shows the construction for \(x=1.2\) and \(\Delta=.1\). Now let's look at the units. The area \(A\) is a probability, so it is a pure number without units. The interval \(\Delta\) is an interval on mass in grams, so it has units \(gm\). To balance the units, the remaining term, \(f(x)\) must be in units of \(1/gm\). That is, the density is the rate of change in the probability measure per gram. No wonder the curve is called a probability density function! It is just like a physical density—a change in (probability) mass per unit of the variable.

In general, we write may write that the probability an observation \(X\) is in a small interval around \(x\) as
Pr(X \in [x-\Delta/2,x+\Delta/2]) = f(x)\Delta.
Here the units of \(f(x)\) (in \(gm^{-1}\)) and of \(\Delta\) (in \(gm\)) cancel out as they should.

All density functions are in units of the reciprocal of the measurement unit. For example, the normal density is
f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt(2\pi)}\exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right).
The unit is determined by the \(1/\sigma\) in the leading term. And where \(\sigma\) is in units of \(x\), \(1/\sigma\) is in units of \(1/x\). A valid density is always in reciprocal measurement units.

The Probability of An Oberservation is Zero

One of the questions we can ask about our density is the probability that a Brazil nut weighs a specific value, say 1 gram. The answer, perhaps somewhat surprisingly, is 0. No Brazil nut weights exactly 1.0000… grams to infinite precision. The area under a point is zero. This is why we always ask about probabilities that observations are in intervals rather than at points.

The Unit of Likelihood Functions

Let's look at the likelihood function for two parameters, \(\mu\) and \(\sigma^2\), and for two observations, \(x_1\) and \(x_2\). What is the units of the likelihood? The likelihood is the joint density of the observations, which may be denoted \(f(x_1,x_2;\mu,\sigma^2)\) treated as a function of parameters. Now this density describes how probability mass varies as a function of both \(x_1\) and \(x_2\); that is, it lives on a plane. We may write
Pr(X_1 \in [x_1-\Delta/2,x_1+\Delta/2] \vee X_2 \in [x_2-\Delta/2,2_1+\Delta/2]) = f(x_1,x_2)\Delta^2
It is clear from here that the units of \(\Delta^2\) are in \(gm^2\), and, hence, the units of \(f(x_1,x_2)\) are in \(gm^{-2}\). If we had \(n\) observations, the unit of the joint density among them is \(gm^{-n}\).

The likelihood is just the joint density repeatedly evaluated at a point for different parameter values. The evaluation though is that of a density, and the units are those of the density. Hence, the likelihood functions units are the reciprocal of those of the data taken jointly. If the mass of 10 Brazil nuts is observed in grams, the likelihood function has units \(gm^{-10}\).

The Units of Bayes Rule

Bayes Rule is the application of the Law of Conditional Probability to parameters in a statistical model. The Law of Conditional Probability is that
There are no units to worry about here as all numbers are probabilities without units.

The application for a statistical model goes as follows: Let \(y=y_1,y_2,\ldots,y_n\) be a sequence of \(n\) observations that are modeled as \(Y=Y_1,Y_2,\ldots,Y_n\), a sequence of \(N\) random variables. Let \(\Theta=\Theta_1,\Theta_2,\ldots,\Theta_p\) be a sequence of \(p\) parameters, which, because we are Bayesian are also random variables. Bayes Rule tells us how to update beliefs about a particular parameter value \(\theta\), and it is often written as:
\Pr(\theta|y) = \frac{\Pr(y|\theta)\Pr(\theta)}{\Pr(y)}
which is shorthand for the more obtuse
\Pr(\Theta=\theta|Y=y) = \frac{\Pr(Y=y|\Theta=\theta)\Pr(\Theta=\theta)}{\Pr(Y=y)}.

At first glance, this equation is not only obtuse, but makes no sense. If \(Y\) and \(\Theta\) are continuous, then all the probabilities are identically zero. Bayes rule is then \(0=(0\times 0)/0\), which is not very helpful.

The problem is that Bayes rule is usually written with hard-to-understand shorthand. The equation is not really about probabilities of the random quantities at set points, but about how random quantities fall into little intervals around the points. For example, \(Pr(\Theta_1=\theta_1)\) is horrible shorthand for \(Pr(\Theta_1 \in [\theta_1-\Delta_{\theta_1},\theta_1+\Delta_{\theta_1})\). Fortunately, it may be written as \(f(\theta_1)\Delta_{\theta_1}\). The same holds for the joint probabilities as well. For example \(Pr(\Theta=\theta)\) is shorthand for \(f(\theta)\Delta_\theta\). In the Brazil-nut example, let \(\theta_1\) and \(\theta_2\) are mean and variance parameters in \(gm\) and \(gm^2\), respectively. Then, \(\Delta_\theta\) is in units of \(gm\times gm^2\) (or \(gm^3\)) and \(f(\theta)\) is in units of \(1/gm^3\).

With this notation, we may rewrite Bayes Rule as
f(\theta|y)\Delta_\theta = \frac{f(y|\theta)\Delta_y f(\theta)\Delta_\theta}{f(y)\Delta_y}.
The units of \(\Delta_\theta\) are \(gm^3\); the units of \(\Delta_y\) are \(gm^n\); the units of \(f(y)\) and \(f(y|\theta)\) are in \(gm^{-n}\); the units of \(f(\theta)\) and \(f(\theta|y)\) are in \(gm^{-3}\). Everything is in balance as it must be.

Of course, we can cancel out the \(\Delta\) terms yielding the following form of Bayes Rule for continuous parameters and data:
f(\theta|y) = \frac{f(y|\theta)f(\theta)}{f(y)}.

Bayes Rule describes a conservation of units, and that conservation becomes more obvious when each side is unit free. Let's move the terms around so that Bayes Rule is expressed as
\frac{f(\theta|y)}{f(\theta)} = \frac{f(y|\theta)}{f(y)}
The left side describes how beliefs should change about a parameter value \(\theta\) in light of observations \(y\). The right side describes how well a parameter value \(\theta\) changes the predicted probability of the data compared to all parameter values. The symmetry here is obvious and beautiful. It is my go-to explanation of Bayes Rule, and I am going to write more about it in future blog posts.

Sometimes, Bayes Rule is written as
f(\theta|y) \propto f(y|\theta) f(\theta)
or, more informally:
\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior}.
Jackman (2009) and Rouder and Lu (2005) take this approach in their explanations of Bayes Rule. For example, Jackman's Figures 1.2, the one that depicts updating, is not only without units, it is without a vertical axes altogether! When we don't track units, the role of \(f(Y)\) becomes hidden, and the above conservation missed. My own view is that while the proportional expression Bayes Rule exceedingly useful for computations, it precludes the deepest understanding of Bayes Rule.

Saturday, March 14, 2015

Estimating Effect Sizes Requires Some Thought

The small effect sizes observed in the Many Labs replication project has me thinking….

Researcher think that effect size is a basic, important, natural measure for summarizing experimental research. The approach goes by “estimation” because the measured sample effect size is an estimate for the true or population effect size. Estimation seems pretty straightforward: the effect-size estimate is the mean divided by the standard deviation. Who could disagree with that?

Let's step back and acknowledge that true effect size is not a real thing in nature. It is a parameter in a model of data. Models are assuredly our creations. In practice, we tend to forget about the models and think about effect sizes as real quantities readily meaningful and accessible. And it is here that we can occasionally run into trouble, especially for smaller effect-size measurements.

Estimating Coin-Flip Outcomes

Consider the following digression to coin flips to build intuition about estimation. Suppose a wealthy donor came to you offering to help fund your research. The only question is the amount. The donor says:

I have a certain coin. I let you see 1000 flips first, and then you can estimate the outcome for the next 1000. If you get it right, I will donate $10,000 to your research. If you're close, I will reduced the amount by squared error. So, if you are off by 1, I will donate $9,999; off by 2, I will donate $9,996; if you are off by 20, I will donate $9,600, and so on. If you are off by more than 100, then there will be no donation.“ *”

Let's suppose the outcome on the first 1000 flips is 740 heads. It seems uncontroversial to think that perhaps this value, 740, is the best estimate for the next 1000. But suppose that the outcome is 508 heads. Now we have a bit of drama. Some of you might think that 508 is the best estimate for the next 1000, but I suspect that we are chasing a bit of noise. Perhaps the coin is fair. The fair-coin hypothesis is entirely plausible—--after all, have you ever seen a biased coin? Moreover, the result of the first 1000 flips, the 508 heads, should bolster confidence in this fair-coin hypothesis. And if we are convinced then after seeing the 508 heads, the best estimate is 500 rather than 508. Why leave money on the table?

Now let's ratchet up the tension. Suppose you observed 540 heads on the first 1000 flips. I chose this value purposefully. If you were bold enough to specify the possibilities, the fair-coin null and a uniformly distributed alternative, then 540 is an equivalence point. If you observe between 460 and 540 heads, then you should gain greater confidence in the null. If you observe less than 460 or greater than 540 heads, the you should gain greater confidence in the uniform alternative. At 540, you remain unswayed. If we believe that both the null and alternative are equally plausible going in, and our beliefs do not change, then we should average. That is, the best estimate, the one that minimizes squared error loss, is half way in between. If we want to maximize the donation, then we should estimate neither 500 nor 540 but 520!

The estimate of the number of heads, \(\hat{Y}\) is a weighted average. Let \(X\) be the number of heads on the first 1000 and \(P(\mbox{Fair}|X)\) be our belief that the coin is fair after observing \(X\) heads.
\[ \hat{Y} = \mbox{Pr}(\mbox{Fair}|X) \times 500 + (1-\mbox{Pr}(\mbox{Fair}|X)) X.\]
The following figure shows how the estimation works. The left plot is a graph of \(\mbox{Pr}(\mbox{Fair}|X)\) as a function of \(X\). (The computations assume that the unfair coin probabilities a prior follow a uniform and that, a prior, the fair and unfair coin models are equally likely.) It shows that for values near 500, the belief increases but for values away from 500, the belief decreases. The right plot shows weighted-averaged estimate of \(\hat{Y}\). Note the departure from the diagonal. As can be seen, the possibility of a fair coin provides for an expanded region where the fair-coin estimate (500) is influential. Note that this influence depends on the first 1000 flips: if the number of heads from the first 1000 is far from 500, then the fair-coin hypothesis has virtually no influence.

Model Averaging For Effect Sizes

I think we should estimate effect sizes by model averaging including the possibility that some effects are identically zero. The following figure shows the case for a sample size of 50. There is an ordinary effects model, where effect size is distributed a priori as a standard normal, and a null model. When the sample effect size is small, the plausibility of the null model increases, and the estimate is shrunk toward zero. When the sample effect size is large, the effects model dominates, and the estimate is very close to the sample value.

Of interest to me are the small effect sizes. Consider say a sample effect size of .06. The model-averaged effect-size estimate is .0078, about 13% of the sample statistic. We see here for small effect sizes dramatic shrinkage, as it should be if we have increased confidence in the null. I wonder if many of the reported effect sizes in Many Labs 3 are more profitably shrunk, perhaps dramatically, to zero.

But the Null is Never True….What?

Cohen and Meehl were legendary contributors. And one of their most famous dictums was that the null was never true to arbitrary precision. If I had a $10 for every researcher who repeats to this proposition, then I would be quite well off. I have three replies: 1. the dictum is irrelevant, 2. the dictum is assuredly wrong on some cases, and 3. in other cases, the dictum is better treated as an empirical question rather than a statement of faith.

  • Irrelevant: The relevant question is not whether the null is true or not. It is whether the null is theoretically important. Invariances, regularity, lawfulness are often of interest even if they are never true to arbitrary precision. Jupiter, for example, does not orbit the sun in a perfect ellipse. There are, after all, small tugs from other objects. Yet, Kepler's Law are of immense importance even if they hold only approximately. My own take of psychological science is that if we allowed people to treat the null as important and interesting, they surely would, and this would be good.
  • Wrong in some cases: Assuredly, there are cases where the point null does hold exactly. Consider the random-number generator in common packages, say those that produce a uniform distribution between 0 and 1. I could hope that the next number is high, say above .9. I contend that this hope has absolutely no effect whatsoever on the next generated number.
  • Testable in others: The evidence for null vs. a specified alternative may be assessed. But to do so, you need to start with real mass on the null. And that is what we do with our Bayes factor implementations. So, why be convinced of something a priori that you know is wrong in some cases and can be tested in others.

If You've Gotten This Far

If you are convinced by my arguments, then may I suggest you downweight effect sizes as a useful measure. They are clearly marginal or averaged across different models. Of more meaning to me is not the averaged size of effects, but the probability of an effect. If you turn your attention there, then welcome to Bayesian model comparison!

Monday, March 9, 2015

High Reliability in the Lab

Ten days ago, Tim pulled me aside.  "Jeff, there's the keyboard from Room 2," he said pointing at the trash.  "When I pressed the 'z' key, I had to hit it three times for it to respond.  I wanted to tell you first because I know it affects some of the experiments."

"Thanks, we'll deal with it," I said as that sinking feeling set in.   The questions immediately came to mind.  How long has this keyboard been bad?  What data are compromised?  What can we do to catch this sooner?  Are our systems working?  What the hell are we doing?

About five-years ago I had the good fortune of being on a nursing PhD committee.  The candidate was applying the theory of highly-reliable organizations to the nursing setting in hopes of improving patient safety.  As I read the characteristics of highly-reliable organizations in the thesis, I started making comparisons to my lab.  We made mistakes in the lab, and although none had bitten us in the ass, perhaps we were more fortunate than good.  And I knew my way of dealing with mistakes, asking people to be more careful, wasn't that effective.  But the described characteristics of highly reliable organizations seemed abstract and not immediately translatable.  They were for large-scale organizations like hospitals and air-traffic control, and they were for dangerous environments where people could die.  My lab is small.  Not only have we never injured anyone, it seems highly unlikely that we ever will.

But mistakes and other adverse events not only cost us time, they affect the trust other people may place in our data and analyses.  We need to keep them at a minimum and understand them when they occur.  So I began adapting the characteristics of highly reliable organizations for my lab.   Making changes does take time and effort, but I suspect the rewards are well worth it.

Here are the characteristics of highly reliable organizations (http://en.wikipedia.org/wiki/High_reliability_organization) and our evolving implementations: 
  1. Sensitivity to Operations.  Sensitivity to operations in the context of a lab means enhanced attention to the processes by which data (and interpretations) are produced.   We implemented the characteristic by (a) Defining a clear audit trail of lab activities via a MySQL relational database.  This database has tabs for IRBs, experiments, experimenters, participants, and runs.   (b) Automating many tasks so that smooth operation is less reliant on human meticulous.  (c)  Codifying a uniformity among different experiments and experimenters so that more people have a greater understanding of more elements of the operations.  
  2. Preoccupation with Failure.  My goal is to anticipate some of the mistakes that can be made and think of ways to prevent them.  For me this has been achieved primarily in two ways:  First, I am structuring the lab so computers do more and people do less.  Examples include writing scripts that all experiments use to name files, add metadata, log demographics, populate the MySQL databases, and upload data to repositories. These steps have increased the accessibility and quality of the data and metadata we collect.  It's still evolving.  Second, we discuss near-miss mistakes a lot.  Near-miss mistakes are things that didn't quite go right even though they had no effect on the integrity of the lab.   Near-miss mistakes if not identified could be an experiment-ruining mistake next time around.
  3. Resilience.  It is hard to define a resilient lab.  For us, it means that we accept that mistakes will happen and we put in place procedures to deal with them and, hopefully, learn from them.  We have an adverse-events tab in our MySQL database.  All issues are logged as an adverse event without blame or shame.  What's the problem?  What's the short term solution?  What are the long-term solutions and needed policy changes?  Adverse events are discussed at lab meetings, perhaps repeatedly, until we can get some resolution. 
  4. Reluctance to simplify interpretations.  The simple interpretation of mistakes in my lab is that people occasionally make mistakes and hardware occasionally fails.  We should be more careful where we can and live with the rest.  We, however, are reluctant to take this stance.  Mistakes are a failure (by the PI) to anticipate a certain constellation of conditions, and we should try to identify that constellation as we can envision and adapt where necessary. 
  5. Deference to Expertise.  All lab members have some expertise in data collection.  Perhaps the most qualified experts are the graduate students and post-docs.  They write the code, organize the undergraduates, and analyze the data.   The undergrads are on the front lines.  They are setting up the machines, briefing, debriefing, and interacting with the participants.  I bring a long-run view.  Together, with deference to appropriate expertise, we figure out solutions.
So, with last weeks' s keyboard failure, we can see some of these principles in action.  I logged an adverse event, and typed up the problem.  I asked Tim how and why he discovered it (he was changing an experiment in Room 2).  We brought it up at the next lab meeting and divided our discussion into immediate concerns and log-term solutions as follows:

The immediate concerns were about the integrity of the data in a few experiments that use the "z" key.   We had a record of which participants were run on these experiments.  We also have a record of which room and computer they were run in.   It is automatically logged.    I had previously set up an environmental variable for each machine and in the start up scripts, this variable is read and outputted to the database along with other session variables.  Therefore we know exactly which data is at risk, about 10\% of participants.  Moreover, the bad "z" key was used for catch trials.  The response times weren't analyzed so it is feasible that the failure is inconsequential to the integrity of the experiments.  A bit of good luck here.

The long-term solutions were harder.  Hardware will fail.  Yet, the single-sticky-key keyboard failure is pretty rare in that it is so hard to detect.  The two most common hardware failures are hard-drive and mother-board failures which are immediately detectable.  We may go another decade without seeing a hard-to-detect hardware failure.  So, with this in mind, we considered a few options.  One option we consider was doing nothing.  After all, the failure is rare and we were able to mitigate the damage.  Another option we could do is make some hardware assurance protocol that could be run weekly.  Yet, the idea of spending much time on such a rare problem that we could mitigate seemed was deemed wasteful of lab resources.  So we settled on the following.  When each of us develops a new experiment, we are required to run it on ourselves once fully to confirm the conditions appear as we intended, appear at the frequencies we intended, and that the data are analyzable as intended.  We switch experiments say once-a-month or so, and so there is this check as often.  Most experimenter run these checks in Rooms 1 and 4 or on their own machines.  From now on, the checks are to be run in the lab and the machine the check was run is noted in the database.   Experimenters are asked to see which machine was last checked and make sure all machines get checked now and again.  In this manner, we can detect these failures more quickly without any additional time or effort.  Another solution I thought of after our meeting is asking participants if they noticed anything strange that we need to be aware of.    I will bring up this issue at the next lab meeting, to my experts, to see if it is helpful. 

Thursday, March 5, 2015

Just Tell Us What Is Proper and Appropriate: Thoughts on the Upcoming ASA Statement

These are great times to be a methodologist.  There is a crisis in reproducibility; there is a loss of confidence across several domains; and standard are being debated and changing.  It is a time of flux, innovation, and chaos.  None of us is quite sure how it is going to shake out.  

The most recent installment in this drama is the banning of significance tests in a social-psychology journal.  This banning has received much attention, and the comments that has caught my eye are those from American Statistical Association's (ASA) Ronald Wasserstein.  In it, Wasserstein reports that a group of more two dozen experts is developing an official ASA response.  Until then, ASA "encourages journal editors and others... [to] not discard the proper and appropriate use of statistical inference." (emphasis added)  ASA's statement provided me an opportunity to more closely examine the relationship between scientists and statisticians.  Here is my message to ASA:

We psychologists tend to put you statisticians on a pedestal as experts who tell us what to and what not to do.  You statisticians should not welcome being put on this pedestal for there is a dark side.  What we psychologists tend to do is cleave off analysis from the other parts of the research process.  Whereas we see our research overall as perhaps creative and insightful, many of us see analysis as procedural, formulaic, hard, and done to meet sanctioned standards.  In this view, we psychologists are shifting our responsibilities to you statisticians.  We ask you to sanction our methods, to declare our calculations kosher, to absolve from being thoughtful, judicious, transparent, and accountable in analysis.  My sense is that this transfer of responsibility is as rampant as it is problematic.  It results in a less transparent, more manufactured, more arbitrary psychological science. 

When ASA weighs in, it should be mindful of the current pedestal/responsibility dynamic.  ASA should encourage researchers to be responsible for their analysis---that is, to justify them without recourse to blind authority or mere convention.  ASA should encourage thoughfulness rather than adherence.  Telling us which options are *proper and appropriate* won't do.  Promoting repsonsiblity and thoughtfulness, however, seems easier said than done.  Good luck.  Have fun.  Knock yourselves out.

Perhaps my experience is helpful.  Statisticians add immeasurable value by helping me instantiate competing theories as formal models.  These models imply competing constraints on data.  We work as a team in developing a  computationally-convenient analysis plan for assessing the evidence for the models from the data.  And if we do a good job instantiating theories as models, then we may interpret our statistical inference as inference about the theories.  In the end we share responsibility as a team for understanding how the constraint in the models is theoretically relevant, and how patterns in the data may be interpreted in theoretical terms.