Skip directly to content

Providing access to detailed demographic and health data: Census Research Data Centers

on Tue, 09/25/2012 - 10:43

Sharing health data and making them 'open' isn’t easy, one of the key reasons being privacy. You can remove direct identifiers, but detailed other data like treatment dates can make it possible to identify subjects. Quickly growing amounts of information in marketing databases and social media further add to the risk of identification. On the other hand, the more details you remove from a dataset, the less useful it becomes for analysis and research. In the end, it’s about balancing the risk of identification with the usefulness of the data.

In order to make data with lots of detail along with direct and indirect identifiers available to researchers, data owners need to create controlled environments in which researchers can use the data for approved purposes and retrieve results which create little or no risk of identification. The US Census Bureau runs 14 Research Data Centers (RDC) across the US that do just that. The latest one, the Northwest Census Research Data Center (NWCRDC), was opened yesterday at the University of Washington in Seattle by the acting Director of the Census Bureau, Tom Mesenbourg, and the Director of the NWCRDC, Dr. Mark Ellis.

Like the other Census Research Data Centers, the NWCRDC provides access to demographic, economic and health microdata (i.e. respondent level data), including censuses, surveys, administrative data, and health data from the National Center for Health Statistics (NCHS) and the Agency for Healthcare Research and Quality (AHRQ). The datasets go back to the 1970s including the 1970 decennial census, and the Census Bureau is working with the University of Minnesota to make the microdata from the 1960 census available. The data are available to qualified researchers for projects that are reviewed carefully to prevent abuse. However, once approved, researchers can link individuals across datasets in the RDC and even link in own datasets with identifiers. This provides very unique opportunities for research otherwise not possible and is a fantastic resource for researchers in the Pacific Northwest.

Why should data holders consider going this route? For the US Census Bureau, the most important benefit are new estimates and data products, efficiency, expanded measurement capabilities, and improved documentation of their own data. They are tapping into creative and innovative thinkers to find additional uses for the data that contribute to the Census Bureau and the American public. Currently, more than 650 researchers are working on 150 projects across the 14 RDCs.

Researchers that want to use the RDC need to write a proposal about their planned research (more details here), which is reviewed by the US Census Bureau for their scientific merit and benefit for the US Census Bureau and the public. Proposals to use data provided by other agencies like NCHS and AHRQ are reviewed by those organizations. All work has to be conducted at the NWCRDC on campus at the University of Washington.

More data owners or data holders should consider making more detailed data available. Research Data Centers are one possible solution. I'll discuss others in future posts.

A Buffet of Health Data

on Wed, 09/19/2012 - 14:55

This is a cross-post from the blog and co-authored by Aman Bhandari (@GHideas) and Steven Randazzo (@worksteven). Aman and Steven work with US CTO Todd Park and are driving forces behind the Department of Health and Human Services' Health Data Initiative.

Hundreds of codeathons are held throughout this country every year resulting in the development of innovative applications, like the “Like” button on Facebook, or solutions to critical social and health problems, like childhood obesity. 

The Department of Health and Human Services is interested in the development of innovative applications and solving critical social and health problems, and to help you optimize the opportunity you have to solve some of the most critical health issues this country faces we have developed is populated with resources for developers, entrepreneurs and people who just want to play around with health data. On there are over 300 datasets listed  which include everything from the FDA adverse events reporting database to information on over 120,000 clinical trials to Head Start locations nationwide to the Health Indicators Warehouse.  In addition to the robust amount general health data, the Centers for Medicare and Medicaid Services (CMS) has national compare data available, ranging from hospital compare, to nursing compare to dialysis compare data, all of which can be found on and on  To help you navigate and the available datasets, we have a slide deck that is our health data starter kit that will take you through an introduction of some of the datasets we have available.   

We have already seen success from developers who have taken open health data and leveraged it to tackle important health issues like FDA recalls at the Hokie Hackathon in Blacksburg, Virginia, childhood obesity at the Cajun Codefest in Lafayette, Louisiana or unemployment and its contributing factors at unWIREd in Baltimore, Maryland.  By participating in your own codeathon or the upcoming codeathon with the Greater Baltimore Tech Council, Groundwork, September 28th and 29th in Baltimore, Maryland, data is the fuel to solve some of the biggest health care problems in the nation.

At the largest highlight show of what developers and entrepreneurs are doing with open health data, this past June we held our 3rd Annual Health Datapalooza with over 1500 participants where we profiled how over 75 companies are using open government data to power their services, applications and insights. If you need some inspiration or ideas we have video of all the companies presenting at the 2012 Health Datapalooza.

If you want to stay abreast of related events and what we have going on you can sign up for our HHS Innovation Update and our weekly data news feed that focuses on the intersection between data, health and technology. Finally we will be opening up a call for applications to present at the 4th Annual Health Data Palooza in December for which anyone can apply.

Codeathons across the country have used open data as a raw material to supply their creations. In addition, open health data is being leveraged in several prize competitions that are currently open. Some data and non-data focused examples are listed below.


Health Data Platform Simple Sign-On Challenge

  • Deadline for submissions: October 3, 2012
  • Total Prizes: $35,000

Health Data Platform Metadata Challenge

  • Deadline for submissions: October 3, 2012
  • Total Prizes: $35,000

My Air, My Health Challenge

  • Deadline for submissions: October 6, 2012
  • Total Prizes: $160,000

The Million Hearts Risk Check Challenge

  • Deadline for submissions: October 31, 2012
  • Total Prizes: $125,000

Ocular Imaging Challenge

  • Deadline for submissions: November 9, 2012
  • Total Prizes: $150,000

Medicaid Provider Enrollment Screening Challenge Series

  • Deadline for submissions: November 16, 2012
  • Total Prizes: $500,000

Challenge: Reducing Cancer Among Women of Color

  • Deadline for submissions: February 5, 2013
  • Total Prizes:  $100,000


Open Government Data at IOGDC

on Sun, 09/16/2012 - 23:07

Below is a presentation I just put together with insights from the International Open Government Data Conference (IOGDC) which took place in July 2012 in Washington, D.C. I am presenting this deck at an international work group meeting tomorrow and would love to get your feedback or additional insights.

If you didn't have a chance to attend the conference, I wrote an overview of open health data and some nuggest of wisdeom from the conference, as well as thoughts on creating an open data ecosystem. There are also lots of presentations and great materials posted online on the conference website

10 key ingredients of health data innovation

on Thu, 09/13/2012 - 21:57

As a reader of this blog, you have already seen various aspects of health data innovation. This post starts a series of more concise overviews of its 10 key ingredients. If you have feedback or ingredients to add, I'd be happy to discuss.

Why do we need health data innovation? Rising health care cost are getting to unsustainable levels while health improvements are stagnating. Health data innovation aims to improve health and reduce cost through creative, scientific and entrepreneurial use of health data. The open data movement provides a great blueprint: share data, market the hell out of them, and encourage entrepreneurs, developers, and other interested folks to create transparency, accountability, new products and services, economic activity, and jobs. This benefits the innovators, but also the field overall and the data sharers themselves; weather and GPS data are good examples for this.

In the case of health, sharing data becomes vital in the truest sense of the word: data can save lives by providing evidence for research and evidence-based medicine, health care and public and global health. However, the fact that those data cover human subjects creates issues around privacy and consent that require a layered approach for data sharing. Privacy and rights of the subjects need to be balanced with broad data access for innovation. Facilitating access to health data is the responsibility of the data holder but requires consent of the individual (patient or healthy individual). Patients can request a copy of their health data and share those. And other stakeholders can create the incentives and frameworks that encourage health data sharing and innovation. The graph on the right provides a semi-structured overview of related key players, types of data and trends.

There are 10 key activities to create and foster health data innovation:

Holders of health data, including providers, payers, producers and researchers, should do what they can to make data available and get them used.
  1. Provide individuals with access to their own data and ensure their authority over other uses of those data
  2. Maximize the quality of data, metadata, and documentation, and adhere to standards where possible
  3. Make fully de-identified data publicly available as open health data at the highest level of detail possible
  4. Use restricted access mechanisms for data where individuals can be identified
  5. Make it easy for data users to find and use relevant data
  6. Contribute to a health data ecosystem that encourages innovation

Patients and healthy individuals play an increasingly active role in health data innovation, leveraging technology to access their health records and collect data about themselves (quantified self)

  1. Get individuals to share their own health and quantified self data 

Other stakeholders like governments, academic journals, regulatory authorities, and funders can leverage their influence over organizations that hold health data

  1. Create incentives (financial, academic, and other) and requirements (regulatory or tied to funding or publication) for data holders to share data
  2. Create and enhance the regulatory framework to facilitate data sharing
  3. Create and foster innovation infrastructure by supporting entrepreneurship, technology, and education

Before I start going into details, let's pause. Do you agree? Are there ingredients / activities to add? Let me know in the comments or contact me.

Olympic Games coverage & what IOC and NBC can learn from the open data movement

on Sun, 08/12/2012 - 10:54

Today, the Games of the XXX Olympiad are coming to a close. Every four years in August, the Olympic Summer Games become part of people’s lives around the world. In 2008, 4.7 billion people or 2/3 of the world's population saw part of the Beijing Olympics, according to Nielsen. Naturally, conversations during those two weeks (and afterwards) keep coming back to the Games. And in an increasingly connected world, conversations about the Olympic Games are going global on Twitter, Facebook, blogs and other social media.

One would think that NBC as the exclusive broadcast and online rights holder in the US would help inform and facilitate that global conversation about the Games. But as a never ending stream of rants (follow #NBCfail on Twitter) and ample coverage on blogs and news sites show, they are failing their audience. Top events including the opening and closing ceremonies are not broadcast live but delayed until prime time to maximize ad revenue. Online streams are only available to subscribers of specific cable packages, and they are low-res and choppy on top of that. Broadcast content posted online by fans of the Olympic Games gets removed quickly because of copyright infringement. And NBC prime time coverage fixates on American athletes, largely ignoring foreign athletes and sports where Americans are not likely to medal. I would love to see the best of Olympic sports without a national angle, but that’s nowhere to be found in the US.

The open data movement is currently gaining a lot of traction. Governments and organizations get several benefits and opportunities from opening up their data. Open data obviously increase transparency by providing interested parties a closer look. More accessibility of data will enable others to hold data publishers accountable, but also to provide feedback and input. “No matter who you are, most of the smartest people in the world don’t work for you” (Sun co-founder Bill Joy). Opening up data provides those smart people a chance to create products, services, analyses and insights from the data that the data publisher could never have dreamed of. And by enabling innovators like entrepreneurs, developers, journalists and others to develop innovative and (potentially) useful products and services, it can power whole new ecosystems like the ones around weather data and GPS data. Open data can help data publishers and at the same time contribute to the greater public good. 

How does this relate to the Olympic Games? According to the Olympic Charter, "The International Olympic Committee (IOC) takes all necessary steps in order to ensure the fullest coverage by the different media and the widest possible audience in the world for the Olympic Games." How would the IOC get the ‘fullest coverage’? By opening up data from the Olympic Games and ideally video, audio and imagery along with them and making them available so that innovators can create more products and services that audiences want.

Obviously, that won't happen because the licensing of broadcast rights provides a major contribution to the budget for the Olympics. For the US alone, NBC spent $2.3B for broadcast and online rights for the 2010 Vancouver and 2012 London Olympics, and $4.38B for rights to all Games until 2020. So the IOC won’t be able to simply share the complete feed from the Olympic Broadcasting Services for free. But there are a few things they can do:

  • Share comprehensive data from the Olympic Games, including information about the athletes, real-time feeds with results, and other data. Right now, the IOC is still not using its data treasure troves to create an open data Olympics, and still, some amazing visualizations have brought more insights about the London Olympics.
  • Work with broadcasters to remove those legacy national viewing restrictions online (e.g. you can’t watch BBC in the US). These national Olympic Games video monopolies stifle competition and lead to sub-par coverage (did I mention #NBCfail?). Allowing competition between broadcasters from different countries will force all of them to provide the coverage that their audiences want (until they do this, there is always TunnelBear).
  • Prevent broadcasters from creating walled gardens around the content and ensure live coverage of all events. Sports need to be watched live (especially since the ubiquitous social media are natural born spoilers), and Jeff Jarvis argues that this also makes economic sense.  
  • Ask broadcast partners to share their content as "open content" and allow audiences to reuse and redistribute broadcast content. Broadcasters will have exclusive rights for first broadcast, but innovators and the crowd can then repackage the content, show highlights, show coverage from different countries, and so much more. This will create buzz, stimulate conversation, and may just drive up broadcast viewership overall. For now, however, people that want to share stories will have to get creative, like the Wall Street Journal with its home made highlights.

The Olympic Games are iconic. They show that sports, competition and team play are important around the world. They can inspire us to lead more healthy and active lives (a message that would be more consistent if we could get rid of those counterproductive McDonald’s and Coca-Cola commercials and endorsements during the Games). By taking a lesson from the Open Data movement, the IOC and NBC have a huge opportunity to expand coverage and audiences for the Games, and make watching, following and talking about them even more fun, in person and online. Fingers crossed for Sochi 2014.