How incomplete information & bias are driving bad assumptions and inappropriate action
Right now the world is in pandemonium about the risks associated with covid-19; most of which appear to be less about virus symptoms, and more about the larger social implications of the panic.
What are the current data limitations?
Our information is currently limited to what we’re hearing from the media, a variety of dissenting healthcare professionals/leaders, and a slew of dashboards & visualizations often powered by the following dataset, that was compiled by John Hopkins University: https://github.com/CSSEGISandData/COVID-19.
Covid-19 reporting method up to the February 15th has fundamentally changed
Beyond the lack of understanding and differing opinions; the data we are depending on has a variety of major limitations.
Hubei province officials have changed the case definition for positive testing of corona; as such, cases before and after mid February are inconsistent.
How covid-19 testing is introducing bias
Think about the bias implicit in the approach to reporting cases of covid-19. The only way we can know that someone has covid-19 is if they’ve had a positive result to the covid-19 test. There are various components to this reporting approach that create issue:
- Price: Tests are expensive. As you can imagine, this may relate to different income brackets having a greater propensity to take the test than others. While income alone likely would not be the single determining factor, it could also play a part. With some governments stepping in to subsidize tests, there is serious inconsistency in the effect that age and other potential factors might have on the testing taking place.
- Supply: With tests in short supply, we are at the behest of whatever covid-19 test deployment approach medical professionals are using. Without detail around said approach, it would be impossible for us to account for any nuances that they may be incorporating.
Symptoms & demographics
Beyond price and supply, there are a number of other factors that create additional issues for current testing & reporting going forward.
To think about it simply, if you had an 18-70 year old individual with no underlying health concerns and you came down with a common cold, you likely wouldn’t go to any extreme length to be tested.– especially prior to two weeks ago, before news and panic of the virus had started to accelerate.
As you’d imagine the severity of coronavirus symptoms would likely dictate the need for one to seek testing. As we’re seeing, those at greatest risk and those experiencing the most severe symptoms are those over the age of 70 with underlying health issues. What I’m suggesting is that there is likely a correlation between the number of positive tests and the age of those being tested at all, which could very well be a representation of the age distribution/demographic of a given country.
Growing awareness & confusion and its effect on varying demand & availability
We are experiencing widespread panic as individuals are beginning to be tested. With greater awareness comes a growing desire for anyone with a cold to be tested.
Think of it this way; Tom Hanks and his wife recently tested positive for covid-19… lets consider the following two scenarios:
A. Tom & Rita get a fever, fatigue, and a persistent cough in Janurary… Probable outcome: they follow the typical cold remedies, don’t seek medical attention… and even in the event that they do, they elect not to and are discouraged from enduring the painful (an additional factor) test for covid-19.
B. They experience the same symptoms in March and feel compelled to see if they have ‘the virus’.
What is compelling them to take the test in the March scenario? The growing societal unknowns, panic, and fear. What people are asking is, should I be more worried? As a result they are defaulting to action when it comes to testing.
Now we assign these two scenarios to the entire American population and as testing becomes more available, we may be on the verge of seeing an insane expansion of ‘incidents’– an expansion due to an increase in testing versus contagiousness.
The coverage of virus related topics has led to increased fear, pandemonium, government intervention to make testing more available, and as a final result– more testing.
While I’m not discouraging continued testing; what I am concerned about is the societal response we will likely see when the number of confirmed domestic cases shoots through the roof.
Point of clarification
Let me clarify why I think confirmed cases may sky rocket. Prior to March, we all thought this was a common cold; but China has been experiencing the disease since the late Fall. What I am suggesting is that covid-19 may have made its way to US soil far before it was being “spoken about”. If this is the case, it could be that there are many individuals feeling mild cold symptoms who have yet to take a test as outlined in the two Tom Hanks scenarios. With the dawn of mass reporting on the topic– increased individual testing seems imminent.
The takeaway here is if we see a massive increase to the number of incidents appearing, we can’t actually attribute the growth in incidents to the disease’s propensity to be passed– as it could also be due to the influx of testing.
Why is this an issue?
Getting additional tests in and of itself is not a bad thing… the issue is the potential interpretation of an increase in incidents as an indicator of covid-19’s propensity to be passed quickly.
There are many different factors that may contribute to different outcomes, and a sudden influx of incidents reported may lead outcomes even worse than those we’ve seen thus far.
Our domestic growth rate of the virus is being closely related to that of Italy. One statement that has people worried is that our growth is 10 days behind Italy.
Here are some of the issues of this statement:
- As eluded to, the similar Covid-19 growth rate between Italy & the US could be due to similar approaches to virus testing deployment
- There is very little questioning around how these growth numbers relate to healthcare capacity. The suggestion that we will have the exact same healthcare shortages as Italy due to a similar growth rate makes no consideration of the certain disparate capacity of the two countries.
- With varying demographics, health issues likely see varying outcomes as it relates to the number of infected as well as mortality rates.
What is my conclusion to all of this?
Seek to collect the best information possible & make sure you understand the data you do have. Bias represents the notion of incomplete information. There can be many drivers to a dataset’s incompleteness. We never actually have all data; hence the purpose of inference. The problem is not just the lack of data, rather the limited consideration we lend to understanding and handling said data. we handle a lack of data to inform decisions and behaviors.
It could be that covid-19 is spreading like wildfire domestically, or it could be that covid-19 has been in the country for months and months and the increased awareness and testing capability is driving mass pandemonium; in any case, it’s important to familiarize ourselves with our data, its limitations, and to never overreact to a potentially incomplete picture.
I hope this proves helpful as you go about thinking of data acquisition, assumptions, inference, data modeling, and the like. Happy data science-ing!