Skip directly to content

Why the Open Data movement is no joke

on Fri, 05/04/2012 - 06:40

On May Day, Tom Slee wrote a post labeled "Why the "Open Data Movement" is a Joke". While this is quite ridiculous (pun intended), the article prompted a number of very thoughtful responses:

 
Together, they provide a lot of information about Open Data (see graphic by Justin Grimes on the right) with good examples and additional resources, as well as a great list of arguments for Open Data:
  • Open Data creates transparency and accountability
  • Open Data leads to better informed and broader policy debates
  • Open Data enables innovators and developers to create new products and services
  • Open Data creates economic opportunities and new data ecosystems
  • Open Data impacts people's lives through better services, apps and more
  • Open Data doesn't equal Open Government, but is an important part of it
  • Open Data helps government become more efficient
  • Open Data benefits local government
 
So if you haven't followed the open data movement, here is a chance to get a concise summary. And then engage. Which may just be the opposite of what Tom Slee wanted to achieve ...

Key question for quantified self (QS): how do we set incentives to get everyone to capture relevant QS data?

on Mon, 04/02/2012 - 13:00

Today, Kent Bottles wrote a very information-rich article about quantified self (QS) on the Health Care Blog. Titled, "Will the Quantified Self Movement Take Off in health Care?", Bottles provides a whirlwind tour of QS with lots of examples and links. Two key pieces are worth pointing out:

He quotes a New York Times article on "The Data-Driven Life", which mentions the four key enablers of QS. Good summary:
  1. Small electronic sensors (capture data)
  2. Mobile computing devices (i.e. cell phones, store and compute data)
  3. Social media (help share and engage with others)
  4. The cloud (stores everything)

While it is now possible to capture and use data on most every aspect of your life (and there are lots of examples of people who do), why would you? Bottles mentions Jay Parkinson, who argues that "Health and Social Media don't mix". While he doesn't mention QS, he provides a useful differentiation of patients into three main groups with very different information needs:

  1. Young active people who don't want to think about health issues
  2. People newly diagnosed with a chronic illness (I would include patients that have an acute disease that needs treatment and management for a limited group of time in this group)
  3. Chronically ill patients that have to think about their disease every day

Parkinson makes the case that none of these groups have a continuous need to engage about health in social media (and only group 2 will likely do so temporarily), and Bottles concludes that " The potential to improve the life of patients with chronic diseases is clearly apparent; whether most people will use the increasingly sophisticated tools being developed is open to debate."

While that's true, I would add that we (as in public, patients and providers) should focus on how to collect relevant QS data for everyone. People with different health statuses (healthy, acutely ill, chronically ill) will have different propensities to 'quantify themselves', just like people are engaged in their health or disease management to different extents. It is obvious that QS information can help manage health and treat diseases. In an ideal world, health care providers and patient in collaboration would collect all relevant data to make better decision about a patient's health and disease management, including clinical and QS data. And the data would be stored and made accessible when needed with the individual's permission, including for patient care, public health, or other research. The technology is all there. Now we need the right incentives to do it.

Fabulous new health data for Mozambique: INCAM study uses verbal autopsy to provide information on causes of death

on Wed, 03/21/2012 - 17:05

A new report, "Mortality in Mozambique: Results from a 2007-2008 Post-Census Mortality Survey", provides valuable data on mortality and causes of death in Mozambique. The underlying survey, known in Portuguese as Inquérito Sobre Causas de Mortalidade (INCAM), was conducted by the National Institute of Statistics in collaboration with the Ministry of Health in Mozambique. The study followed up deaths reported in the 2007 census with a country-wide verbal autopsy survey. It is representative on the national and provincial level and includes information on area of residence, age group, sex and other characteristics, such as the use of health services prior to death.

Verbal autopsy is an innovative method to determine the cause of death of a deceased where the causes of death had not previously been established (e.g. by a physician or coroner in a death certificate). In those cases, a trained interviewer administers a standardized questionnaire to someone familiar with a deceased person which covers his or her symptoms, known diagnoses, demographic characteristics etc. Often, the questions are then analyzed by two physicians to determine the cause of death; a third physician is consulted if the two disagree. Lately, innovative algorithms have been developed that use machine learning or other methods to analyze the cause of death. These machine learning tools now even outperform physicians (disclosure: the cited paper was published by my employer, IHME).

The INCAM survey is particularly valuable because Mozamibique does not have a complete civil registration system, and cause of death information had not been available on a nationally representative level.

Useful tools to review, refine, clean, analyze, visualize and publish data

on Fri, 03/02/2012 - 16:47
Over the last few days, O'Reilly's Alex Howard (aka @digiphile) has published a series of very informative interviews with data journalists. As journalists get more and more sophisticated in collecting, collating, analyzing and visualizing data, their learnings are really useful for anyone working with data. The interviews contain lots of great insight, very useful information, and interesting links to more resources and examples, and I encourage you to read them in their entirety (see links below).

However, most interesting to me are the tools that the interviewees mention and which Alex calls the "Newsroom Stack". Any number of those tools may be used in sequence to get from your set of data to useful insights. I used the additional comments from the journalists to add to my own list of useful data tools; some key ones below, the rest on the Health Data Innovation Tools page. Let me know what other tools you think I should add.

Data tools: conversion, exploration, analysis

  • Microsoft Excel - still the standard for many as the easy first stop to review data
  • Data Science Toolkit - collection of useful tools to extract and convert test, GIS and other data (my overview here)
  • ScraperWiki - provides software and instructions to extract data and information from web sites
  • Google Refine - clean, organize, refine (duh!) and explore your new datasets,  great for exploring new datasets
  • Overview - clean, visualize and interactively explore large documents and data set (started by AP)
  • The PANDA Project - the new newsroom data appliance
  • Stat/Transfer - converts data between formats of statstical analysis packages
  • Ruby on Rails - powerful open source framework for budding programmers with helpful frameworks like Django or Remote Table (mapping)
  • Python - programming language, very useful for data analysis and visualization
  • JavaScript - prototype based scripting language
  • R - open source software environment for statistical computing and graphics
  • Git - to track versions of code and share with others

Data visualization and GIS packages

  • Protovis/D3 - JavaScript-based library of very slick visualizations
  • MetaLayer - discover and share insights from data via infographics
  • WEAVE - Web-based Analysis and Visualization Environment
  • PostGIS - spatially enabled PostgreSQL server
  • Tilemill - design studio to create maps, powered by MapBox
  • Leaflet - JavaScript library to create interactive maps

Databases

  • MySQL
  • PostgreSQL - open source object-relational database system
  • SQLite - Firefox extension that allows SQL queries without setting up a full database

Here are the articles; check back on the O'Reilly Radar data page for more:

Interview 1Liliana Bounegru (@bb_liliana), project coordinator of SYNC3
Interview 2Dan Nguyen (@dancow), news app developer at ProPublica
Interview 3Derek Willis (@derekwillis), news developer at New York Times
Interview 4Ben Welsh (@palewire), Web developer at Los Angeles Times
Interview 5Michelle Minkoff (@MichelleMinkoff), investigative developer/journalist at AP

This should put you in the right mood to have a look at the "Effective Data Visualization" presentation by Hjalmar Gislason (aka @datamarket) at Strata this week. It's a great account of the considerations necessary for anyone that wants to create visualizations. Very useful: if you download the PDF from Slideshare, the slides contain links to more information online.

 

How Target figured out a teenage girl was pregnant before her father ... and why their analytic energy would be better spent on health

on Fri, 02/17/2012 - 18:23
In a fascinating article, How Companies learn your secrets, New York Times reporter Charles Duhigg provides a great case study on how Target uses sophisticated analysis to identify pregnant women and target them (pun intended) with well-timed coupons and ads. He goes on to describe the Science of Habit Formation and the interaction of cue, routine and reward (also the subject of Duhigg's upcoming book, The Power of Habit).
 
I was particularly intrigued by Duhigg's description of the analytics activities in Target's marketing effort. Their analysis is based on very rich and deep data on their customers. Target marketers know that the best opportunity to get consumers to change purchasing habits is at life-changing events like weddings, moves, divorces, and particularly with the arrival of a baby. So they focused their activities on identifying pregnant women. Quoting from the article:

For decades, Target has collected vast amounts of data on every person who regularly walks into one of its stores. Whenever possible, Target assigns each shopper a unique code — known internally as the Guest ID number — that keeps tabs on everything they buy. “If you use a credit card or a coupon, or fill out a survey, or mail in a refund, or call the customer help line, or open an e-mail we’ve sent you or visit our Web site, we’ll record it and link it to your Guest ID,” Pole said. “We want to know everything we can.”

Also linked to your Guest ID is demographic information like your age, whether you are married and have kids, which part of town you live in, how long it takes you to drive to the store, your estimated salary, whether you’ve moved recently, what credit cards you carry in your wallet and what Web sites you visit. Target can buy data about your ethnicity, job history, the magazines you read, if you’ve ever declared bankruptcy or got divorced, the year you bought (or lost) your house, where you went to college, what kinds of topics you talk about online, whether you prefer certain brands of coffee, paper towels, cereal or applesauce, your political leanings, reading habits, charitable giving and the number of cars you own.


Quite the arsenal of data. Target’s Guest Marketing Analytics department then crunches the data to develop insights that help target marketing and advertising campaigns. Target analysts managed to develop a pregnancy prediction score based on a customer's purchasing history of 25 products; they also manage to estimate a woman's due date "within a small window" and can send women advertising that aligns very well with the stage of their pregnancy.

The targeting can be quite successful as an example in the article illustrates. The example fits so well it almost seems to be made up:

About a year after [Target] created [their] pregnancy-prediction model, a man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation.

“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

What does that have to do with Health Data Innovation? "The best minds of my generation are thinking about how to make people click ads," says Jeff Hammerbacher, co-founder and Chief Scientist at Cloudera. "That sucks." This quote also applies here. There is a lot of money to be made in understanding consumer behavior. But there is a lot of health gain to be made by putting the same energy into improving people's health by enhancing prevention, identifying early warning signs for diseases, better understanding effectiveness and side-effects of drugs and devices, and treat specific combinations of diseases. Instead of analyzing people's purchasing records, browsing history and reaction to different types of discounts, the smartest minds of our generation should analyze health records, personal history and reaction to different types of treatment.

A lot of health data are being collected on each individual. There are health records at general practitioners, outpatient facilities, hospitals, and emergency services; pharmacy records; vital registration records; health insurance data; responses to surveys, census, and other data collection efforts; participation in clinical trials; and many others. In addition, data used for marketing analytics can also be used for identifying health issues, including purchasing and browsing history, social media data, and many of the other data points listed in Target's data arsenal above. There are many triggers that could suggest that an individual should consult a physician, e.g. if people start buying unusual quantities of over-the-counter drugs, research specific symptoms online, tweet about sudden weight loss (or gain), or changing reading habits. Putting all these data sources together could provide a much deeper picture of someone's health than any provider's record.

Most consumers are appalled when they learn how much data retailers and other organizations amass on them. Compiling health data should happen with the permission of the patient. Some health care providers have started to collect broader data about their patients and use those data for prevention and more effective treatment. Kaiser Permanente has been a front runner in the quest to better use data for the benefit of the patient; they call it "collecting information for personalized high quality care". They just launched an Android App to make it easier for patients to access and submit information about their health.

There are many ways to improve how we currently deliver health care. Having more of the best minds of our generation working on data to improve health instead of improving ad responses and click rates would certainly help. Instead of a retailer knowing that a girl is pregnant (before her dad does), a physician should know that his patient is increasingly likely to have a heart attack (before he has one).

Pages