Monday, March 9, 2015

High Reliability in the Lab

Ten days ago, Tim pulled me aside.  "Jeff, there's the keyboard from Room 2," he said pointing at the trash.  "When I pressed the 'z' key, I had to hit it three times for it to respond.  I wanted to tell you first because I know it affects some of the experiments."

"Thanks, we'll deal with it," I said as that sinking feeling set in.   The questions immediately came to mind.  How long has this keyboard been bad?  What data are compromised?  What can we do to catch this sooner?  Are our systems working?  What the hell are we doing?

About five-years ago I had the good fortune of being on a nursing PhD committee.  The candidate was applying the theory of highly-reliable organizations to the nursing setting in hopes of improving patient safety.  As I read the characteristics of highly-reliable organizations in the thesis, I started making comparisons to my lab.  We made mistakes in the lab, and although none had bitten us in the ass, perhaps we were more fortunate than good.  And I knew my way of dealing with mistakes, asking people to be more careful, wasn't that effective.  But the described characteristics of highly reliable organizations seemed abstract and not immediately translatable.  They were for large-scale organizations like hospitals and air-traffic control, and they were for dangerous environments where people could die.  My lab is small.  Not only have we never injured anyone, it seems highly unlikely that we ever will.

But mistakes and other adverse events not only cost us time, they affect the trust other people may place in our data and analyses.  We need to keep them at a minimum and understand them when they occur.  So I began adapting the characteristics of highly reliable organizations for my lab.   Making changes does take time and effort, but I suspect the rewards are well worth it.

Here are the characteristics of highly reliable organizations ( and our evolving implementations: 
  1. Sensitivity to Operations.  Sensitivity to operations in the context of a lab means enhanced attention to the processes by which data (and interpretations) are produced.   We implemented the characteristic by (a) Defining a clear audit trail of lab activities via a MySQL relational database.  This database has tabs for IRBs, experiments, experimenters, participants, and runs.   (b) Automating many tasks so that smooth operation is less reliant on human meticulous.  (c)  Codifying a uniformity among different experiments and experimenters so that more people have a greater understanding of more elements of the operations.  
  2. Preoccupation with Failure.  My goal is to anticipate some of the mistakes that can be made and think of ways to prevent them.  For me this has been achieved primarily in two ways:  First, I am structuring the lab so computers do more and people do less.  Examples include writing scripts that all experiments use to name files, add metadata, log demographics, populate the MySQL databases, and upload data to repositories. These steps have increased the accessibility and quality of the data and metadata we collect.  It's still evolving.  Second, we discuss near-miss mistakes a lot.  Near-miss mistakes are things that didn't quite go right even though they had no effect on the integrity of the lab.   Near-miss mistakes if not identified could be an experiment-ruining mistake next time around.
  3. Resilience.  It is hard to define a resilient lab.  For us, it means that we accept that mistakes will happen and we put in place procedures to deal with them and, hopefully, learn from them.  We have an adverse-events tab in our MySQL database.  All issues are logged as an adverse event without blame or shame.  What's the problem?  What's the short term solution?  What are the long-term solutions and needed policy changes?  Adverse events are discussed at lab meetings, perhaps repeatedly, until we can get some resolution. 
  4. Reluctance to simplify interpretations.  The simple interpretation of mistakes in my lab is that people occasionally make mistakes and hardware occasionally fails.  We should be more careful where we can and live with the rest.  We, however, are reluctant to take this stance.  Mistakes are a failure (by the PI) to anticipate a certain constellation of conditions, and we should try to identify that constellation as we can envision and adapt where necessary. 
  5. Deference to Expertise.  All lab members have some expertise in data collection.  Perhaps the most qualified experts are the graduate students and post-docs.  They write the code, organize the undergraduates, and analyze the data.   The undergrads are on the front lines.  They are setting up the machines, briefing, debriefing, and interacting with the participants.  I bring a long-run view.  Together, with deference to appropriate expertise, we figure out solutions.
So, with last weeks' s keyboard failure, we can see some of these principles in action.  I logged an adverse event, and typed up the problem.  I asked Tim how and why he discovered it (he was changing an experiment in Room 2).  We brought it up at the next lab meeting and divided our discussion into immediate concerns and log-term solutions as follows:

The immediate concerns were about the integrity of the data in a few experiments that use the "z" key.   We had a record of which participants were run on these experiments.  We also have a record of which room and computer they were run in.   It is automatically logged.    I had previously set up an environmental variable for each machine and in the start up scripts, this variable is read and outputted to the database along with other session variables.  Therefore we know exactly which data is at risk, about 10\% of participants.  Moreover, the bad "z" key was used for catch trials.  The response times weren't analyzed so it is feasible that the failure is inconsequential to the integrity of the experiments.  A bit of good luck here.

The long-term solutions were harder.  Hardware will fail.  Yet, the single-sticky-key keyboard failure is pretty rare in that it is so hard to detect.  The two most common hardware failures are hard-drive and mother-board failures which are immediately detectable.  We may go another decade without seeing a hard-to-detect hardware failure.  So, with this in mind, we considered a few options.  One option we consider was doing nothing.  After all, the failure is rare and we were able to mitigate the damage.  Another option we could do is make some hardware assurance protocol that could be run weekly.  Yet, the idea of spending much time on such a rare problem that we could mitigate seemed was deemed wasteful of lab resources.  So we settled on the following.  When each of us develops a new experiment, we are required to run it on ourselves once fully to confirm the conditions appear as we intended, appear at the frequencies we intended, and that the data are analyzable as intended.  We switch experiments say once-a-month or so, and so there is this check as often.  Most experimenter run these checks in Rooms 1 and 4 or on their own machines.  From now on, the checks are to be run in the lab and the machine the check was run is noted in the database.   Experimenters are asked to see which machine was last checked and make sure all machines get checked now and again.  In this manner, we can detect these failures more quickly without any additional time or effort.  Another solution I thought of after our meeting is asking participants if they noticed anything strange that we need to be aware of.    I will bring up this issue at the next lab meeting, to my experts, to see if it is helpful. 


Anonymous said...

Great post! I have an adverse event to log: The link to the Wikipedia article is missing the final "n" and thus does not work.

Jeff Rouder said...

Thank you for the kind words. The link is fixed. Low reliability blogging from a guy trying to practice high reliability theory.