CareSet Labs Releases Root NPI Graph – a Greatly Improved Medicare Doctor Referral Dataset

CareSet Founder and CTO Fred Trotter
CareSet Founder and CTO Fred Trotter

CareSet Labs will be releasing Root NPI Graph, a new version of the “shared Medicare patients in time” provider teaming dataset on June 25 at the 2017 Academy Health ARM Meeting. Conference attendees will be able to get free copies of the academic licensed dataset.  This is the next version of the “DocGraph” dataset. This dataset is usually called the Doctor Referral dataset, and it includes both implicit and explicit referral relationships between doctors, hospitals and other types of entities. Medicare is the largest single payer in the United States, and the implicit map of the healthcare system that can be generated from its claims database is one of the most comprehensive pictures of the healthcare system in the United States.

The initial release of this dataset will be simplified from the currently available FOIA release. The reasons for this are complicated and you really have to care deeply about the details for a full explanation to matter. If you are a general reader, this is where you should stop reading. You know the big news: a reimagined classic dataset is going to be released soon. What follows is geeky breakdown of what the new dataset is, how it is different from the old referral dataset and “why?” You might need to really care about this to continue reading. You have been warned.

The geeky details

Technically a “referral dataset” is a misnomer, because any frequent collaboration between healthcare providers (i.e. outpatient surgery, and outpatient anesthesiology for instance) that is detectable in the Medicare claims database will appear in the dataset. But calling the dataset a “referral database” is really the simplest way to talk about it in shorthand, so if the reader will permit us that simplification, we will refer to the dataset that way from now on. That is also the source of the name “Root” in “Root NPI Graph”, it stands for “Referrals Or Other Things.”

Normally, we do not pre-announce our dataset’s releases, preferring to have big “conference announcements.” But hundreds of organizations rely on this dataset, in some cases as the backbone of their analytics processes. This data will be changing in both structure and meaning and we want to provide a heads up of these changes and start explaining why this is.

First, a little context. Over the last year, we’ve been working with CMS to release improved and verified versions of the original DocGraph dataset that address some of the core weaknesses of the dataset. The original DocGraph dataset, which we released in 2012, was based on a FOIA request we made to CMS almost 7 years ago. We have recently withdrawn that FOIA request.

There are several reasons for us withdrawing the FOIA request. The first, and most important of these is that CMS now offers several mechanisms that allow private organizations to gain substantial access to claims data. CareSet leverages the VRDC innovator access to gain access to Medicare data, but there is also the Qualified Entity Program, both of which allow private entities to commercialize deidentified and aggregated Medicare data, under specific parameters. Using our VRDC access, we are now in a position to create our own “patient-sharing dataset”. So the first reason we are working with CMS to sunset the current Medicare referral dataset is that we are able to generate a RootGraph, which is a compelling replacement.

The next reason is that the current dataset structure for the FOIA data release has too many bugs for us to continue investing in maintaining the dataset through the CMS FOIA office. These problems are not the fault of the CMS FOIA office, which has generously spent time working with us to maintain this data set for several years. Instead, the most substantial of these issues relate to the “origin story” of the dataset.

When we originally requested the dataset in 2011, there was a federal court order that prevented CMS from releasing physician payment information under the FOIA process. In the same year, The Wall Street Journal initiated court proceedings to get that injunction overturned. In 2013, they won in court, overturning the injunction. This is quite an accomplishment and a victory for government transparency. It is something that the Wall Street Journal and their parent company, Dow Jones, deserve ongoing credit for. This was the end result of a battle that Consumers Checkbook and others had been fighting for years. After this court victory, CMS announced a change in policy that would ultimately enable them to release the physician utilization dataset, which explicitly detailed physician payments. (Of note, the public comments on the policy are truly fascinating!) Finally in 2014, CMS released the physician-level data which detailed financial payments to Medicare physicians.

What does all of this have to do with the referral data?

In 2011, when I first started designing the FOIA request that would inspire the DocGraph referral dataset, my main concern was developing a “shared patient in time” algorithm that could not be used to reverse engineer physician salaries. In short, the design of the original DocGraph referral dataset was not “the best graph dataset we could design” it was “the best graph dataset we could design that was not against current FOIA policy.”

The CMS FOIA office has generously worked with us over the years to improve the dataset. There have been at least three major iterations and improvements in the design of the FOIA DocGraph teaming dataset. But the original design constraints of the data specification we provided in the FOIA requests remain problematic, most especially, the window problem. There are other issues with the dataset and some of them are more important than the “window problem.” But based on our current beta-testing we think the planned changes with the sliding window algorithm might be the most controversial and wanted to open the discussion to the wider community.

The Window Problem

The original DocGraph FOIA referral dataset has become somewhat famous in data science community. As far as we know, it remains the largest publicly available graph data set that uses real names. Obviously the Twitter, Facebook, Gmail and LinkedIn graphs are all much larger, but they are not available in their totality to researchers. Obviously the dataset has also been used extensively by healthcare researchers. But there is no research that we appreciate more than the work done by Martin Zand’s team at the University of Rochester Medical Center. They specifically considered the impact that the data mining algorithm had on the resulting structure of the networks in the data. Reading this research helped us start to carefully consider what the issues were with a sliding window approach in building the healthcare graph.

In discussing the issues with the algorithm, let’s first examine the algorithm itself. Essentially, all shared patient graphs are built by folding a bipartite graph of physicians and patients into a unipartite graph of just providers. In a picture:


So Dr. Smith and Dr. Jones share Fred Trotter as a patient. We remove the patients and directly connect the two providers with the number of shared patients as the weight of the edge of the new graph. (BTW, we try to stay loyal to the formal terminology from Graph Theory. Unless it gets annoying, then we don’t.)

Should we make a graph of every patient a provider has ever shared? Well we could, but that would be very noisy. Patients move, patients go on vacations, etc. If we included a connection for every patient that two providers had EVER shared, there would be a huge number of connections that would not be meaningful in understanding how typical healthcare collaborations work. Here, CMS privacy policy comes to a rescue. We cannot reveal facts about patient cohorts that make up less than 11 patients. We have found that a 1 year data period, along with the 11-patient threshold, can reveal a pretty decent map of the recent patient flows in the healthcare system. Data sets generated with these kinds of constraints are typically around 50 million rows worth of data. So limiting to 1 year’s worth of data with an 11 patient threshold gives a comprehensive map of Medicare providers.

But we also want to have some measure of the degree which two providers are collaborating. We certainly concede that sharing 100 patients between two providers in one year could take very different forms and intensity. Two providers who live in a very small town might share many patients incidentally, without any efforts to coordinate care. Alternatively, two providers could be working on the same patients at the same time (like a surgeon does with an anesthesiologist, for instance).

One way to correct this is to limit the counting using a fixed-length sliding window, which is how the original DocGraph data is designed. Ideally a sliding window works like this, across a timeline:


In this example we have two providers, A and B, and one patient X. This diagram shows how patient X sees the two providers in January and February. First, the patient visits provider A on Jan 10th. Each time a visit occurs, the sliding windows clock begins (30 days in this example), and if a different provider sees the same patient within 30 days, then this event is added to the directed graph. This diagram has this happening in both directions since a visit occurred where A was seen first, and B was seen second, and a second event occurred where B was seen first and A second. We use a slightly modified version of a standard sequence notation to illustrate what this picture says


Which is intended to read “patient X saw provider A and then 15 days later saw provider B and then 18 days later saw provider A again”.  Let’s suppose on the second visit, the patient saw both provider A and B, then we would write


So now that we have a useful “claim sequence shorthand” we can discuss patterns that are problematic with the sliding window.

Overcounting transactions

The current FOIA dataset counts transactions as the number of total visits (as opposed to patients) paired between two providers. Given that a treatment pattern of


(which if you do not like our notation just means that every other day, you saw a different provider) Would result in a single +1 for the patient count for both A->B and B->A, but a +21 (assuming I am calculating the combinatorics correctly).

Basically there is a flowering of transaction pairings that occur because every instance of treatment from B is paired with every previous instance of treatment of A, because they are all in the same window.

Because this is is a way for a specific treatment pattern of a single patient to count tremendously more than the transactions that come from typical treatment patterns, this can really skew the numbers. 5 patients treated in this manner between two physicians could easily account for more “transactions” than 50 or even 100 treated in typical sequences.

Currently, our policy on interpreting transaction counts is “it’s a reliable measure when it is small, but when it is big, it can no longer be interpreted simply and should not be considered reliable.” This is not a great component of this model.

Plus one day on the window size

First, there is a substantial problem where the typical time between a patient visiting provider A and B, is just the window size plus one day. Let’s suppose a managed care organization (called “Maginary Care”) has a policy that cardiologist should be seen exactly 35 days after primary care doctors were seen. For non-urgent care, assuming careful enforcement, that wait time would be an improvement on some industry wait times, but an analysis using a 30 day window would miss it entirely.

One way to handle the basic limitations of a window size is to perform the calculation for more than one window size. So if we have an identical, sliding window algorithm run for 60 days, now Maginary Care provider will start showing up in the graph. When we add a third, or fourth windowed data set, Maginary will not add additional edges or additional weights to existing edges. They are entirely hidden in the 30-day windowed data, and they are completely present in anything larger than a 35-day window.

The problem with the edges and false directionality

When you run a sliding window algorithm, you are artificially constraining things by added an “edge” on either side of the time period.

When you start the algorithm, usually on Jan 1 of a given year, you are already breaking pairs that would have occurred between events in December, and events in January.

The algorithm does not understand “John Doe went to see his primary care doctor on December 20th who referred him to his cardiologist on Jan 10th.” The algorithm assumes that the first event on Jan 10th was ‘Genesis’ for John Doe. When John Doe returns to his primary care doctor in Feb, the algorithm models that even as a Cardiologist -> Primary Care “referral.”

Many people have been interpreting, incorrectly, the “directedness” in the original DocGraph FOIA data as an indication of “seen first” but that is not actually what it means. It is an indication of sequence, but not one that is intuitively simple to conceptualize.

The second problem with the edges is that it currently requires a claims runout. With a 30 day window, it requires 13 months worth of data to calculate, so that all events that are 30 days “after” an event on Dec 31st can be properly considered for the second half of a pairing. An 180 day window requires 18 months of data for the same reason.

This means there is an implicit delay in releasing any of these teaming datasets, assuming a windowed algorithm is used.

Evaluating Basketball teams

Suppose you believed that the height of NCAA or NBA basketball teams was an important factor in their success. Of course, you would start with saying “what is the average height of each basketball team?”

Taking an average like this, is one of the best ways to approximate data. There are all kinds of mathematical reasons for this, but you might say, “well I also want to know the mode or median” (which can be better measured in some specific circumstances).

What you would probably not want to do is say: Give me the count of players who are over 7ft, the count of players over 6ft6, the number over 6ft, the number over 5ft6 and the number over 5ft.

In some ways, that is a good representation of how having multiple window sizes in the data sets work. Of course, we would love to provide an “average amount of time the patient waited” but such an algorithm is problematic too.

Supposed you were to take the average time period between visits, and you had the following data:


Or, in English, provider a was seen, and then 10 days later provider b was seen and then 50 days later provider b was seen again.

In this case, do you average 10 days and 50 days together? Do you just take the first pair? Finding an algorithm that is roughly equivalent to the “average height of basketball players” is problematic. But this concept now influences our current logic.

The Root NPI Graph concept.

So how do we address these concerns? Well to start, we realize there are many other things people really want to understand with how patients flow through the Medicare ecosystem. We appreciate these ‘wants’, but believe a base graph with added additional layers on top of a simpler graph dataset is a better approach.

With that in mind, the Root NPI Graph is very simply the count of all Medicare beneficiaries shared between two NPIs over the course of a given year. No sliding window, no claims run out, no same-day count and no transaction counts.

Our plan is to provide additional decorations to this data set as time moves forward. RootGraph may seem like a step backwards from the original FOIA version of the DocGraph, but more importantly, it a step forward in expediency and accuracy. In the future, the dataset will be released in a more timely manner, since it does not need to calculate claims runout. We will find more accurate ways to categorize the time between paired patient visits, and hopefully, other clinical meaningful descriptions as additional data points between the edges.

More differences between the FOIA data and Root NPI Graph include the “directedness”. The Root NPI Graph dataset is undirected, so whenever you see a score for A,B that means the relationship is the same from B,A.  That means the new data acts like facebook (reciprocal “friend” relationships) rather than like Twitter (nonreciprocal “follow” relationships). As we mentioned before, this is because it was difficult to accurately interpret what “one going first” means.

As its name implies, Root NPI Graph is intended to grow into a series of datasets that meets everyone’s needs without being confusing or unreliable. We know that we are going to be releasing “decorations” to the Root NPI Graph that at least emulate, if not replicate, the functionality implied in the original FOIA data release. We might even release “decorations” to Root NPI Graph that are specific to specific treatment pathways or types of providers. Please get ahold of us (Twitter is good for this @CareSet) if you would like to discuss what we should do next with these data releases.