Skip directly to content

data catalog

Improving data discoverability - an important step towards open health data?

on Fri, 03/25/2011 - 14:57

In global public health, open data is still in its infancy. Finding health-related data - and even information about those data - is a continuous challenge. The presentations and discussions at the recent Global Health Metrics and Evaluations (GHME) conference showed again that essential data are often not available; other participants like Karen Grepin and Amanda Makulec also made note of that. Of course, relevant data are often simply not collected. In many cases, however, data are not being made available (see some thoughts on that in my post on the Global Health Data Charter). Knowing more thoroughly what data exist, where they live, and what exactly they contain can help increase the availability of data for health analysis. (read on below the presentation)

These are the slides from my launch presentation for the Global Health Data Exchange (GHDx) at GHME. The current primary objective for our new data catalog is to improve data discoverability. Finding existing data currently is a very labor intensive process. Some health indicators are available from sites like the WHO Global Health Observatory or World Bank Open Data. Data repositories like IPUMS, IHSN's data catalog, SodaPOP, or Dataverse provide good starting points for certain types of underlying data. Open government intiatives like HealthData.gov, data.gov.uk, or opengov.se are starting to be good sources of data for selected countries. However, most data are mentioned or available on a variety of websites like ministries of health, statistics bureaus and other organizations (I wrote about health data sources recently). Discovering data from those different sources has to be done via web searches, as well as browsing websites, searching library catalogs, and conducting literature reviews.

Cataloging datasets is currently the most direct path to discoverability, and the GHDx is aimed at providing this path. But it is very labor intensive. Once a dataset is discovered, titles, covered geographies and dates, and other metadata need to be further researched and validated manually. Only certain tasks can be automated (like assigning keywords) or at least supported by software (like automated searches, web scraping, etc.). And we will explore how to collect and update some of the information via crowdsourcing.

With a catalog like the GHDx in place to provide accurate and reliable information about data, it becomes straightforward to find data, download them directly or contact data owners, and use them for analysis. It will also show gaps in data collection and illuminate where data are systematically not being shared. In addition, it provides insight into what data are most needed and used, enables proper credit to data collectors and owners, allows data owners to find new audiences for their data, and shows the usefulness and impact of additional data availability. This in turn will hopefully motivate data owners to embark on the path towards open data (ideally completely open '5-star' linked data') for global health and healthcare.