Skip directly to content

Open data and the four tiers of health data sharing

on Sat, 02/23/2013 - 06:44

Today is Open Data Day. Open Data enthusiasts, activists, developers, hackers, scientists, and other entrepreneurial data geeks are gathering around the world to demand more open data, work in hackathons or code-a-thons, and engage in data discussions. Just google "open data day events" to see the scope. It's very encouraging.

The benefits of releasing open data are manifold. Take as an example open government data: It increases trust in the government by providing more transparency and accountability. It helps improve public services. It can stimulate economic activity and generate jobs. It helps governments improve the use of their own data. It helps increase the exchange of information among different departments and ministries (which are often siloed) and improve collaboration. And as an additional perk, open data will also lead to savings by reducing work on specific data requests.

There are plenty of examples where this works really well. The release of weather data has led to great weather apps, insurances, and other services, GPS data is used in fabulous apps and services in almost any mobile device now, public traffic data are used to make commuting and traveling easier and so on. The situation for health data is slightly different.

Releasing health data as open data requires consent of the subject and privacy considerations, and there are specific regulations aimed at data collected in the delivery of health care (e.g. HIPAA in the US) and oversight by Institutional Review Boards (IRB). At the same time, the stakes in health are higher than in many other fields. The sharing of health data enables data users to provide evidence for policy and decision making, track performance, evaluate and improve quality, identify effective interventions, optimize healthcare pathways, and improve health of individuals and populations. In short, sharing health data saves lives.

There are four different degrees of openness for health data sharing. Given the potential impact, the goal for every organization holding health data should be to publish as much data as possible at the highest level of detail possible, while protecting subjects and complying with regulations.

Tier 1: Open Indicators

Aggregated or tabulated data should always be shared as open data through as many channels as possible, including organizations' websites, data visualization sites, data aggregators, and open data portals. Examples can be found on healthdata.gov or the health section of data.gov.uk, as well as at WHOWorld Bank, and the recent data release of the Global Burden of Disease (GBD 2010 regional results; my employer, IHME, is the coordinating organization of the study). 

Tier 2: Open Microdata

Detailed or micro-data at the respondent or individual level can often be carefully de-identified and shared as open data. Sample surveys, mortality data, and even hospital discharge data are often shared openly, e.g. CDC's Reproductive Health Survey series on IHME's Global Health Data Exchange (yes, that's the platform that I manage) or the public use datasets for US mortality from NCHS. If access to funding or other considerations require registration, it should be fast (ideally instantaneous) and free, as is the case for microdata for the Demographic & Health Surveys from MeasureDHS.

Tier 3: Data Use Agreements

When data cannot be shared without restrictions, there should be a clearly defined process for data users to request access to more detailed or partially identified data (if consent from individuals to share the data was obtained). These processes need to balance the proposed purpose of using the data with the risk of identification of individuals, and provide proper oversight and safeguards that protect subjects' privacy. US mortality data with county identifiers are only available under Data Use Agreement.

Tier 4: Fully controlled data access

If data are too sensitive to hand out at all, data owners can offer options to access and analyze data on their own premises, and allow data users to only take the results of their analyses with them. The US Census Bureau operates Census Research Data Centers (RDCs), where researchers can access the full detail of data on controlled premises; no microdata can be taken out and research results are carefully vetted before being released to the researchers. Short of implementing full-fledged programs, data owners can also collaborate with researchers to provide this kind of access.

Last not least, sharing information about data collected is a minimum requirement. Over the past few years, my team at the Institute for Health Metrics and Evaluation (IHME) has cataloged and published information for over 8000 health-related datasets in the GHDx, and we are adding more daily. We are cataloguing data from 200 countries around the world, and it is often incredibly hard to even identify what data have been collected, and who to contact for access. Websites are are in different languages and structures, constantly in flux, can be down for periods of time and data available one day may be gone the next. Data and information about them is often only available in reports, statistical yearbooks, or published literature. Data owners should make an effort to add information about their data and at least aggregated results to open data platforms and catalogues to make them easier to find. And subsequently try to release as much data as possible in each of the four tiers.

Happy Open Data Day!

 

Launch of the Global Burden of Disease Study 2010 results

on Fri, 12/21/2012 - 05:23

On Thursday, 12/13/2012, The Lancet published seven papers with the results of the Global Burden of Diseases, Injuries and Risk Factors Study 2010. The epic, 5-year study involved hundreds of collaborators to compile and analyze all available data on health outcomes globally. My role in the project focused on finding and obtaining input data, managing data at IHME, and creating visualizations. This is the first in a series of blog posts in which I’ll discuss the sources of data used in the different components of the study, the availability of health outcomes data in general, and the metrics that were generated, and share some stories from the trenches. This post provides an introduction to the study. Follow me on Twitter or subscribe to my RSS feed to find out about future installments on mortality, causes of death, non-fatal health outcomes, and covariates.

The Global Burden of Diseases, Injuries and Risk factors Study 2010 (GBD 2010) is arguably the most comprehensive assessment on human health ever conducted. Richard Horton, Editor of The Lancet, and Peter Piot, Director of the London School of Hygiene and Tropical Medicine, compared the GBD 2010 to the Human Genome Project in terms of scope and importance. The results were published by The Lancet in seven papers that took up an entire triple issue of the journal. It's the first time in the 189-year history of the Lancet that an entire issue was dedicated to one study (the Lancet was founded in 1823). The results were officially presented at a launch event at the Royal Society in London last week (picture on the left).

GBD 2010 was coordinated by the Institute for Health Metrics and Evaluation (IHME) – my employer – in collaboration with 6 other organizations, the University of Queensland, Harvard School of Public Health, Johns Hopkins Bloomberg School of Public Health, the University of Tokyo, Imperial College London, and the World Health Organization (WHO). Professors Christopher Murray, director of IHME, and Alan Lopez, head of the school of population health at the University of Queensland, developed approach and methodology for global burden of disease analysis in the 1990s, and oversaw this iteration with a complete revision of all the steps of the analytic process (on the right, editor-in-chief of The Lancet Richard Horton is taking a picture of Chris and Alan at the GBD Study 2010 press conference).

GBD 2010 mapped all known diseases and injuries to 291 causes that were then analyzed for the burden they caused through fatal and non-fatal outcomes. This required compiling and analyzing all available published and unpublished data and evidence on health outcomes (notice that available is in italics? More on that later). Data sources include censuses, surveys, vital statistics, disease registries, hospital records, and many more. Especially for non-fatal outcomes and risk factors, systematic literature reviews were a key source of data. Hundreds of researchers provided data and expertise, and the seven published papers included 486 authors from over 50 countries. The analysis encompassed 18 different components that are highly interconnected (see the overview paper for details).

Compiling these data was a monumental task, but analyzing the overall global burden provided a key advantage compared to studies that focus on one or few diseases or injuries. There are 235 causes that can lead to death, and in GBD 2010, deaths from these causes always sum up to all-cause mortality in each age-sex-region group, i.e. every death is counted only once or - in scientific terms - all-cause mortality estimates constrain the cause-specific estimates. Studies that estimate mortality for only one or few causes don’t have this constrain and will often provide higher numbers of death.

A key challenge for all parts of this study was the availability of input data (this is the main reason for me to write this series). For many developed countries, data to estimate mortality and non-fatal health outcomes by cause and risk factors are readily available. However, in many of the 187 countries that are part of the 21 GBD regions, these data are not being collected, incomplete, of poor quality, insufficiently documented, only available on paper, or stuck on obsolete storage media. In addition, data are often simply not being shared for political or other reasons, even for this kind of research that provides a global public good. A fundamental take-away from this study is that we need to improve the collection and distribution of health-related data, in developing but also developed countries. For mortality data, this means improving civil registration and vital statistics systems to make sure that we track every death and its cause everywhere in the world. Survey, census and other health-related data are just as important for governments to share. With regard to understanding non-fatal health outcomes, access to health record, disease registry and other detailed health data is essential, of course with proper attention to privacy, confidentiality, and consent of the individuals. All of these can be overcome but that requires commitment and political will by the data owners.

GBD 2010 compiled all data on health outcomes available to us. For the first time ever, we now have estimates on mortality and non-fatal health outcomes by cause and risk factor that were developed with a consistent methodology for several points in time (1990, 2005, and 2010). To provide information about the availability and consistency of input data, 95% uncertainty intervals were calculated at each step, propagated throughout the analytic process, and are available for all results. GBD 2010 provides estimates even for causes, risk factors, regions, or age groups where data was limited or no data was available; these were imputed using different statistical methods and covariates like GDP, education, and many others to inform the estimates. We believe that estimates based on limited data are better for policy and decision making than no evidence at all. The result is a gigantic database with structured results by age and sex that are comparable across geographies and time, all publicly available. The data will allow global health practitioners, policy makers, donors, media, researchers and others to explore patterns and trends in health over time.

Here are your key resources to explore further:

Global Development Data Jam at the White House: 10 take-aways

on Tue, 12/11/2012 - 13:41
Yesterday, I was honored to participate in the Global Development Data Jam at the Eisenhower Executive Office Building at the White House. It was a great crowd of sharp, motivated data geeks with passion for development. A series of insightful and inspiring presentations on data, open data, data collection and more was followed by working sessions to come up with concrete project ideas that can be implemented over the next 90 days. Here are my 10 take-aways for data for development, in order of appearance during the day:
  1. We need to focus on the big opportunities. Todd Park kicked the day off with this his usual display of boundless energy and can-do attitude, posing a fundamental question: "What is the next GPS of development?" What are the vital datasets that should be broadly available to enable innovative solutions (examples for GPS include mapping, directions, location-enabled services, etc.)? Great question. A list of suggestions will follow in a separate blog post.
  2. Data are essential infrastructure for development. Making data broadly available will speed up an evidence-based process of planning, implementing, measuring, and adjusting.
  3. Engaging the crowd to clean or digitize datasets, map infrastructure, and do other related tasks can be very successful, and all the tools needed are available. Examples: USAID cleaned 10,000 records in 16 hours with 300 volunteers at 85% accuracy; Ushahidi's SwiftRiver enables users to let the crowd filter and verify data and organize and present the results.
  4. Existing (and free) social media and mobile phone usage data can be mined for early detection, real-time feedback (disaster assessment), and  prediction of trends (flu trends, food prices). UN Global Pulse's Robert Kirkpatrick showed a number of great examples, including many from developing countries. Did you know that there are 100 million mobile users in Nigeria, 100,000 new Facebook users per month in Senegal, Jakarta is one of the world's "tweetiest" cities, and that 24% of residents in Mogadishu check into Facebook at least once a month? Me neither.
  5. Where these data don't suffice, companies like Jana and Mobile Accord can help roll out short mobile phone surveys in any country in a matter of days.
  6. More organizations and platforms provide comprehensive access to their data, including UNDP (last month), the Millenium Challenge Corporation, Foreignassistance.gov, and others
  7. Open data can be done anywhere. Literally: Development Seed's Eric Gundersen featured an open data platform for Election Data in Afghanistan.  
  8. Development funding needs more coordination. That's not exactly a new insight, but a problem we can tackle from two sides: AidData and partners geocoded all 550 current development projects in Malawi with a volume $5.6bn. The World Bank is mapping and sharing data for their project portfolio (Mapping for Results). More countries and donor agencies should do the same. 
  9. Without data scientists, you can "share data until the cows come home" without results. True words from DataKind founder Jake Porway. It's a key issue in development, that is only partly mitigated by organizations like DataKind. We need more training in statistics for civicl society groups, journalists, and others.
  10. In the White House, even paper cups and napkins feature the Seal of the President of the United States.

USAID Administrator Rajiv Shah summed the topic up nicely: The single biggest thing we can do to eradicate poverty? Open data! Data are turning into essential infrastructure for development. And events like the Global Development Data Jam help connect people, organizations, and fields within the development arena. Thanks to the White House Office of Public Engagement, the Office of Science and Technology Policy , and the U.S. Agency for International Development (USAID) for hosting a very inspiring event.

Things to watch at Strata Rx: 5 underlying challenges for sharing health data

on Tue, 10/16/2012 - 06:37

This week brings us the first Strata Rx conference, which explores the role of data and data science in health care. Very timely, because health care is at a crossroads. In many more developed countries, rising cost combined with stagnating outcomes and aging populations make health care systems unsustainable. In less developed countries, a dual or triple disease burden and stagnating development assistance for health hamper progress. Tim O'Reilly said in a recent conversation on health care (worth watching!) that "change happens when the pain of not changing is greater than the pain of changing". Health care is there, ready to be disrupted, and data is key to driving that disruption. It's one of our biggest challenges in the 21st century. 

Changes in technology have revolutionized the possibilities for collecting and analyzing health and health related data (sorry about the buzzword bingo): patient data are captured in electronic health records, smart phones capture and transmit volumes of personal data, social media capture health self-assessments, wearable sensors enable uninterrupted data collection and transmission, genome sequencing is now almost affordable, and cloud computing, open source software, machine learning, and big data management enable sophisticated analysis of all these data. With all these opportunities, leveraging health data to fix health care is not only one of the biggest, but also one of the coolest challenges in the 21st century.

However, there are 5 underlying challenges for leveraging data to fix healthcare which center around transparency and accessibility.

  1. Privacy:  sharing data about individuals requires protecting their privacy. However, there is an inverse relation between the availability of identifiers and the usefulness of the data. In addition, linking data from different sources enables much more powerful analysis but also increases privacy risks. When sharing useful health data, there always remains a (often very low) risk of identification. Therefore, we need strong de-identification techniques as well as powerful legal deterrents from using data to identify individuals. And we need to create trust in individuals that their data are handled responsibly.
  2. Consent: individuals need to agree that their data are being shared with others. They should be able to decide exactly what their data can be used for, and be able to remove that consent if they wish. Currently, there is limited transparency and very little control for patients over how their data are shared.
  3. Data Use Agreements: fully de-identified data (i.e. data with a very low risk of identification of individuals) should be shared as open data. Data with identifiers can be shared as limited use data for appropriate uses and with data use agreements. However, there are currently no standards around these kinds of agreements and their stipulations, making it often difficult to negotiate and implement them.
  4. Research ethics: research that involves collecting data from individuals or using data with direct identifiers often require ethics oversight, e.g. by Institutional Review Boards. Regulations like the United States' HIPAA detail what can be shared and how. While this oversight is necessary, it often hampers progress by being too strict and difficult to implement. Regulations and their interpretations need to keep pace with the current rapid developments in data collection and analysis, the globalization of research, and individuals' attitude towards data sharing, e.g. in social media.
  5. Incentives for sharing: there are powerful arguments for sharing. Open data can create entire ecosystems. Sharing unlocks external creativity and analysis, and most of the world's smartest people don't work for you. Most importantly, sharing and using health data can save lives, so sharing data becomes a moral imperative. However, many reasons beyond privacy and consent keep data owners from sharing data: competition, fear of misuse, reluctance to share the power of information, political agendas, academic publication plans, etc. The fragmentation of  health systems compounds the number of different players that have a plethora of different motivations for not sharing health data. We need better incentives and frameworks to encourage and facilitate data sharing. Patients can take a lead role here by sharing their own data and requesting providers and others to share their data responsibly.

The next two days will touch heavily on these areas, and I'm looking forward to connecting with other health data innovation enthusiasts. Follow me on Twitter for instant updates, and stay tuned for follow-up posts.

Providing access to detailed demographic and health data: Census Research Data Centers

on Tue, 09/25/2012 - 10:43

Sharing health data and making them 'open' isn’t easy, one of the key reasons being privacy. You can remove direct identifiers, but detailed other data like treatment dates can make it possible to identify subjects. Quickly growing amounts of information in marketing databases and social media further add to the risk of identification. On the other hand, the more details you remove from a dataset, the less useful it becomes for analysis and research. In the end, it’s about balancing the risk of identification with the usefulness of the data.

In order to make data with lots of detail along with direct and indirect identifiers available to researchers, data owners need to create controlled environments in which researchers can use the data for approved purposes and retrieve results which create little or no risk of identification. The US Census Bureau runs 14 Research Data Centers (RDC) across the US that do just that. The latest one, the Northwest Census Research Data Center (NWCRDC), was opened yesterday at the University of Washington in Seattle by the acting Director of the Census Bureau, Tom Mesenbourg, and the Director of the NWCRDC, Dr. Mark Ellis.

Like the other Census Research Data Centers, the NWCRDC provides access to demographic, economic and health microdata (i.e. respondent level data), including censuses, surveys, administrative data, and health data from the National Center for Health Statistics (NCHS) and the Agency for Healthcare Research and Quality (AHRQ). The datasets go back to the 1970s including the 1970 decennial census, and the Census Bureau is working with the University of Minnesota to make the microdata from the 1960 census available. The data are available to qualified researchers for projects that are reviewed carefully to prevent abuse. However, once approved, researchers can link individuals across datasets in the RDC and even link in own datasets with identifiers. This provides very unique opportunities for research otherwise not possible and is a fantastic resource for researchers in the Pacific Northwest.

Why should data holders consider going this route? For the US Census Bureau, the most important benefit are new estimates and data products, efficiency, expanded measurement capabilities, and improved documentation of their own data. They are tapping into creative and innovative thinkers to find additional uses for the data that contribute to the Census Bureau and the American public. Currently, more than 650 researchers are working on 150 projects across the 14 RDCs.

Researchers that want to use the RDC need to write a proposal about their planned research (more details here), which is reviewed by the US Census Bureau for their scientific merit and benefit for the US Census Bureau and the public. Proposals to use data provided by other agencies like NCHS and AHRQ are reviewed by those organizations. All work has to be conducted at the NWCRDC on campus at the University of Washington.

More data owners or data holders should consider making more detailed data available. Research Data Centers are one possible solution. I'll discuss others in future posts.

Pages