Sunday, April 10, 2016

This Summer's Challenge: Share Your Data

"It would take me weeks of going through my data and coordinating them, documenting them, and cleaning them if I were to share them." anonymous senior faculty member

"Subject 7 didn't show. There is an empty file. Normally the program would label the next person Subject 8 and we would just exclude Subject 7 in analysis. But now that we are automatically posting data, what should I do? Should I delete the empty file so the next person is Subject 7?" anonymous student in my lab

"Why? Data from a bad study is, by definition, no good." @PsychScienctists, in response to my statement that all data should be curated and available.

All three of the above quotes illustrate a common way of thinking about data. Our data reflect something about us. When we share them, we are sharing something deep and meaningful about ourselves. Our data may be viewed as statements about our competence, our organizational skills, our meticulousness, our creativity, and our lab culture. Even the student in my lab feels this pressure. This student is worried that our shared data won't be viewed as sufficiently systematic because we have no data for Subject 7. Maybe we want to present a better image.

The Data-Are-The-Data Mindset

I don't subscribe to the Judge-Me-By-My-Data mindset. Instead, I think of data as follows:
  • Scientific data are precious resources collected for the common good.
  • We should think in terms of stewardship rather than ownership. Be good stewards
  • Data are neither good nor bad, nor are they neat nor messy. They just are.
  • We should judge each other by the authenticity of our data

Mistake-Free Data Stewardship through Born-Open Data

To be good stewards and to insure authentic data, we upload everything, automatically, every night. Nobody has to remember anything, nobody makes decisions---it all just happens. Data are uploaded to GitHub where everyone can see them. In fact, I don't even use locally stored data for analysis; I point my analyses to the copy on GitHub. We upload data from well-though-out experiments. We upload data from poorly-thought-out-bust experiments. We upload pilot data. We upload incomplete data. If we collected it, it is uploaded. We have an accurate record of what happened in the lab, and you all are welcome to look in at our GitHub account. I call this approach born-open data, and have an in-press paper coming out about it. We have been doing born-open data for about a year.

So far, the main difference I have noticed is an increase in quality control with no energy or time spent to maintain this quality. Nothing ever gets messed up, and there is no after-the-fact reconstruction of what had happened. There is only one master copy of data---the one on GitHub. Analysis code points to the GitHub version. We never analyze the wrong or incomplete data. And it is trivially easy to share our analyses among lab members and others. In fact, we can build the analyses right into our papers with Knitr and Markdown. Computers are so much more meticulous than we will ever be. They never take a night off!

This Summer's Challenge: Automatic Data Curation

I'd like to propose a challenge: Set up your own automatic data curation system for new data that you collect. Work with your IT people. Set up the scripts. Hopefully, when next Fall rolls around, you too are practicing born-open data!



No comments: