Be Prepared: Lessons from an Extended Outage of a Hospital’s EHR System
If your hospital has deployed an electronic health record (EHR) system, you probably have a contingency plan in the event of a system outage. After all, computing systems go down, and when an EHR system is not working, it affects nearly every aspect of a hospital’s operations, from patient care to admissions to finance to supply chain and more.
Having an effective response plan is critical for mitigating the impact of downtime, and your organization has likely put a tremendous amount of thought and care into its contingency plan. But your plan may have an Achilles’ heel that your organization is completely unaware of—a weakness that could leave your organization as poorly prepared as if you had no contingency plan at all. Where are the holes in your plan? Find them by asking a simple question: What is the longest hypothetical outage you have planned for?
Are you prepared for an EHR outage of an hour? Eight hours? Twenty-four hours? What about 10 days?
We mention 10 days for a good reason. That is the length of an EHR outage that Boulder Community Hospital in Colorado experienced earlier this year. And we are not alone. We have heard and read about similar extended EHR outages at other hospitals, so this was not an isolated experience. It could happen anywhere, and when it does, it can undermine even the best contingency plan. Our plan’s “worst case scenario” for an EHR outage did not anticipate a situation where our entire hospital information system would be down for 10 days. We got through the outage thanks to a tremendous amount of hard work and ingenuity by members of our clinical team, administrative team and technology team, and we learned a lot of lessons and developed new best practices through that experience.
The purpose of this article is to share what we learned, to allow your organization to integrate our lessons learned into your contingency plan to better prepare for a similar technology failure, if (or when) it happens.
How Long Is Long-Term?
One of the biggest lessons we learned through this experience is that our imagination was not pessimistic enough when anticipating what would constitute “long-term” downtime for the EHR system. Most hospitals define a “short” outage as an hour or less, followed by an escalating set of protocols that work their way up to a “long-term” outage of 12 to 18 hours at the most. That approach measures potential outages in hours, but our experience, and the recent experiences of other hospitals, shows that contingency plans should redefine long-term outages to be measured in days or even weeks, in order to adequately prepare for what might happen if your EHR software goes down.
In addition to the extended length of our outage, there were three additional aspects of the situation that are important to discuss:
- Our outage involved something that most hospitals are probably unprepared for: a loss of data from the EHR system. EHR software is architected with numerous safeguards to protect against the loss of patient data in the event of downtime, so most health care organizations probably do not have a section of their contingency plans dedicated to that possibility. But it happened to us, and it could easily happen to your organization. That data loss added a layer of complexity to our recovery process, because we had to initiate a data recovery process in parallel to the EHR system recovery process. This involved recovering data from the memories of disparate medical devices (i.e., lab instruments, diagnostic imaging equipment, etc.) and information systems across the hospital in a sleuthing process that would have made Sherlock Holmes proud. It was a process we had to figure out on the fly. Your contingency plans can actually build in procedures that will prepare for that possibility, and then organize a rapid-response initiative if data is discovered missing. While we are on the topic of data recovery, we should note that our involvement in a Health Information Exchange (HIE) played an important role in helping us recover from this data loss. In 2011, we became the first hospital in Colorado to participate with the statewide HIE operated by the Colorado Regional Health Information Organization (CORHIO). As a result of that effort, lab tests, diagnostic imaging results and transcribed reports like operative reports, discharge summaries and inpatient progress notes were available for use in re-creating the patient record. Our HIE involvement was a significant asset in our recovery process, and it may play a similarly important role for your organization.
- Another important lesson we learned through this process is the importance of having a pre-determined timeframe for enacting your contingency plan for a long-term outage. In our case, in the early hours and days of our downtime, it appeared several times that our system would be brought back on line. Those expectations did not become reality, and they delayed our decision to officially enact our long-term plan. If we had it to do over again, we would start the clock as soon as the outage began and we would officially implement the highest-level actions once we got to hour X, rather than assume that we would be saved by the potential fixes being worked on by the vendor and our diligent IT department. We also learned the importance of giving the IT staff adequate time to diagnosis and repair the problem before jumping to conclusions about when things would be “back on line.” It is important to have realistic expectations about the length of time it will take to fully diagnose and resolve the technical problems, and it is equally important to move ahead with implementation of the contingency plan at a predetermined time while the IT staff continues to do their job.
- Another important lesson is about preparedness for secondary emergencies that occur during an IT outage. An outage of this length could easily coincide with another separate, unrelated crisis incident, causing a catastrophic event that most hospitals’ contingency plans would be unprepared for. Most hospitals have extensive plans for how to deal with a natural disaster or other incident that result in IT downtime; but contingency plans do not always anticipate the possibility of an unrelated event occurring while you are already struggling with a major IT issue. Many hospital contingency plans assume that everything else is static at the hospital, but what if there is a mass casualty event that occurs during the downtime? Are you prepared to manage a second emergency while you are trying to get your EHR system running and visa versa? We were lucky that no second event emerged during the 10-day period of our EHR outage, but luck is not a plan. We are now revising our contingency plans to address the simultaneous occurrence of a “traditional “emergency that also involves an IT outage.
Never Say Never Again to Paper
EHR systems help achieve many efficiencies, but during an extended IT outage you are going to appreciate a paper-based fallback system like a kid wanting cold, soothing ice cream after a tonsillectomy. It was not so long ago that our hospital was still using paper records extensively: as recently as two years ago, paper was still very prevalent in our operations, but we learned that it does not take long for paper-based procedures to become a distant memory.
- If your contingency plans for an IT outage simply suggest using your old paper records as a fallback plan until the computers are back up, you will be setting yourself up for a lot of headaches. The first obstacle to that plan is a very practical one: hospital staffs change fast, and it does not take long for your clinical and administrative team to have a lot of new faces who:1--Come from other hospitals that used very different paper processes, which means they do not have familiarity with your prior paper processes; and
- Are young employees whose entire training has been on electronic systems, which means they have never actually seen a paper medical record.
In our case, we experienced both obstacles and there was a learning curve to help get everyone on the same page (literally).
Another hurdle to simply dusting off old paper records is that they quickly become out of date. Old paper records that have been sitting on a shelf are static documents that have not kept up with the evolution of actual processes at the hospital, which means they quickly become a poor substitute for electronic software. Old paper forms will likely require a thorough review and updating in order to make sure they are usable. Reviewing and updating during an outage will cause you to lose valuable time. As a best practice, we have adopted a policy of doing the review and update proactively on a concurrent basis to ensure that paper medical records can immediately be used in the case of an emergency.
We should also mention that our staff came up with a novel solution for makeshift paper medical records that worked for a while before we recognized its limitations. Once we realized that our EHR system would not be brought back online immediately, our clinical staff began using a printed version of the online interface of our EHR software. It was exactly what they would see on the screen, but it was printed on paper. That had the advantage of being familiar, was in a format that would make data entry simple once the EHR system was restored, and reflected the current processes and workflow built into the EHR. It was a clever solution and it worked for a day or so before its limitations became clear: it does not take long for this style of medical record to grow to 50+ pages long, making them monstrous documents that are difficult to work with. Each time the nursing unit got new lab results or a new progress note from a doctor, it meant another piece of paper to put in the chart. The pages grew quickly, becoming unwieldy and confusing to the staff. The answer was to have true “traditional” paper records, which we implemented in time to avoid any additional confusion.
As follow up to our outage, one of the critical updates we are making to our contingency plans is an organizational commitment to have continuously updated paper records that can be used at a moment’s notice. We have also made a commitment to regularly train staff, particularly new team members, to ensure that they are familiar with our paper system.
Get Ready to Put on a New Hat
By design, hospitals are regimented work environments with clear roles and responsibilities for both clinical and non-clinical staff, which ensure patient safety and allow us to efficiently manage the minute-to-minute operations of a large, complex health care organization. Everyone knows their role and how they are supposed to do their jobs, but an extended EHR system outage will likely push everyone into unfamiliar territory that alters or completely changes the hats they wear at the hospital. Everyone needs to be flexible and adaptable in order to make it work, and that starts at the top.
The first team to be impacted by the outage was our hospital’s IT staff (myself included, as CIO), who shifted into emergency response mode when the hospital information system went down. The team instituted its contingency plan and recognized the need to restore from a backup, but the specific nature of the outage prevented the plan from being effective. The IT team entered into an intensive problem-solving process. They quickly discovered that the most recent backup was corrupt, so the team could not perform a near-term restore. The team then had to go back to the most recent full backup, but that was also corrupt. What they discovered was that two major EHR failures had occurred:
- The problem that caused the outage actually started several hours before the system became unavailable, and it eroded the system in a way that went undetected for more than eight hours.
- The error was replicated in the back ups, which made several of them unusable.
The most recent successful and uncorrupted backup had been at noon on the day of the outage. Nine file servers needed to be restored back to that point in time. To make things more complicated, the hospital staff had to work through locating the eight hours of data that was no longer in the system. Per the master contingency plan, the hospital activated an incident command center to oversee the system restore efforts and to assist the rest of the hospital’s process of working without the EHR system. Having a single command center is a best practice outlined in our contingency plan, but the complexity and scope of the IT issues (involving our internal technical team, the EHR vendor’s support team, other external vendors and data center operations teams) supported our decision to split the command center into two: one focused on the technology restore process and one focused on hospital operations. There were some downsides to operating two parallel command centers but the strategic decision to have the IT team manage its own “war room” was necessary during the early stages of the outage.
While our IT team was responding to technical challenges of the outage as its full scope became visible, the rest of the hospital also had to demonstrate its flexibility with many people adapting or changing their roles altogether. In many ways, our senior management team may have made the most dramatic changes in their roles. With the EHR system down, the built-in clinical communication it provides between departments also went down. Senior management team members set aside their typical duties and spent a lot of time on the hospital floors acting as liaisons, serving as messengers between key team members and departments, and in many ways acting as assistants to clinical staff so they could stay focused on patient care.
Entire departments at our hospital put on new hats as well. For example, the billing and quality teams were not able to do their normal activities because of the EHR outage, so we re-assigned them to provide support in other areas of need such as data collection and restoration. Our CFO and his team set aside their usual duties to tackle high-priority tasks such as ensuring that payroll went out on time when the IT outage interfered with data transmission to external third-party entities like the bank, insurance carriers and tax authorities. A number of clinical staff members were re-assigned to support special roles that replicated built-in processes in the EHR, including coordination with the lab and imaging departments, helping those departments track where patients were, delivering lab results and scans to doctors and nurses in a timely manner, ensuring that the pharmacy had accurate dosage information that was previously tracked by our hospital information system, and much more. An EHR outage means that in many areas it will not be business as usual and having a contingency plan that anticipates those role changes will help avoid the delay of figuring it out during the emergency.
One final thought on the role of hospital leadership in preparing for this kind of emergency: many organizations leave this contingency planning up to the individual department, without strong oversight from the hospital’s senior leadership team. Few organizations establish a protocol that gives the organization’s leadership team a central role in developing emergency plans and ensuring that all departmental plans are cohesive and consistent, but this should be a best practice that all organizations adopt. It is critical for hospital leadership to play an active role in the process of developing, coordinating and testing contingency plans. For many executives, this will be a new hat to wear, but it may be one of the most important.
Start Preparing for Afterward while the IT Outage Is Still Going On
At a certain point in an outage, the IT team, clinical staff, and the administrative team will all turn the corner and have a firm handle on their responsibilities. They will have adapted to the new “normal.” That is not a time for you to take a breather, though. That is when the people heading up the command centers and other senior executives and department managers need to prepare for the work that will need to be done when the EHR system is fully restored, including reentering the missing data, performing a forensic investigation, etc.
An outage of several days will create major challenges for accurate billing after the outage ends. All those paper medical records will need to be deciphered and translated into billing, and that process could take weeks for an outage of this duration. A lengthy EHR outage may also delay filings with insurance companies and federal agencies. Our team had to proactively alert insurance payers and relevant agencies that we might miss deadlines for providing billing information and reports about health care services/outcomes.
In addition to being proactive regarding billing, it is also important to assess and track the costs of the event for insurance purposes. We formed a special cost recovery team to account for all expenses related to the downtime, including overtime for staff. This has been critical in working with our insurance carrier to quantify our claim for business interruption losses.
Fixing the EHR system was just the beginning of the process. There was a lot to do after the EHR was back up and running, and the time to prepare for that is before the error messages on the computer screens are resolved. Our team is revising our contingency plans to incorporate all of those forward-looking steps.
Food for Thought
We have spent most of this article talking about process-focused advice for responding to an EHR outage, but it is important to remember that a hospital is not a machine. It is an organization of people, and this kind of emergency will put a tremendous amount of pressure and stress on them. Staff get tired, hungry and worn out. Taking care of our most precious asset, our people, is just as important as everything else we have discussed above.
The nurses, doctors, lab technicians, IT team members and other hospital staff are the glue that will hold things together until your systems are restored. You cannot underestimate how important it is for the hospital staff to see and feel the support of the leadership team. Walking the floor, talking to team members, lending a hand, and looking for ways to be a problem-solver greatly impact morale and efficiency.
These situations also take a significant physical toll on the staff because of the extra work and hours that are often involved. Lending a hand and thanking them for their efforts goes a long way, but so do snacks and naps. Do not overlook the kind of impact you can have if you order a thank you pizza for a hungry and tired IT team, or deliver a tray of fruit and granola bars to the nursing staff who have kept patients happy despite the situation unfolding behind the scenes. Keep an eye on signs of sleep deprivation and insist that people head home for some sleep. Issue a batch of Starbucks gift cards to department heads to hand out to team members in recognition of their above-and-beyond efforts. Having the right contingency plans and emergency training is important, but many times it’s the human touch that makes the greatest impact in keeping people motivated and engaged when you need them most.