Guide to Exploratory Data Analysis with JHU COVID-19 Data

There is a lot of pandemonium and energy around covid-19 and it’s potential implications. There are many parties out there saying many things. One of the amazing about being a data scientist is having the ability to dive into available data on your own.

Lets dive into some data currently being accumulated by John Hopkins University.

Before we dive in; it’s worth saying that we are subject to the biases included in this data.

There is heavy selection bias in that those that are being confirmed are those who are being tested and there are many discriminatory forces that would determine who gets tested; age, availability, price/demographic, and on. Beyond that, these forces are likely to grow or weaken with time. There is massive investment by different countries to make these tests more available which would have major implications on what’s reported.

The more you work with a broad variety of real world datasets the better your ability to consider that data will be.

Now with that out of the way, lets dive in!

Load up the following packages! You can load them up separately or you could alternatively call library(tidyverse)


Data Acquisition

First things first we’ve got to pull the data down. Here is the link that you’ll want to pass into the readr read_csv function:

This link is from the John Hopkins University github page where they are providing updated data. There are many dashboards online right now that are powered (at least in part) by this dataset.

If you visit the link directly you’ll see that it’s the actual csv of the positive test cases by province, country, & date.

In RStudio, you’ll want to format the string as follows and pass that into the read_csv function as follows.

jhu <- paste("",
"COVID-19/master/csse_covid_19_data/", "csse_covid_19_time_series/",
"time_series_19-covid-Confirmed.csv", sep = "")

jhu_df <- read_csv(jhu)

Now that we have our data pulled in, lets start digging in.

A few great functions to initially explore your data are head, glimpse or str, dim, and summary. dplyr provides glimpse which effectively does away with the need to use the classic str & dim as it combines the functionality. It also allows you to see more records in a cleaner format than str.

Head will give us a sample of the first 6 rows of data. As we can see the data is setup such that province/countries are rows and each day is casted out as a column


Glimpse is used to understand the dimensions of the dataset, the datatypes, as well as a sample of the data.


Summary is standard and in this case can give us an idea of the average cases per province day over day. We see a mean of 1.2 on the first day where the mean on the last day is 392. There are additional means to investigate these numbers, but that gives us a start.

Often in exploratory data analysis you might make a correlation matrix to see how numeric fields relate to one another, you might also leverage proportion tables for categoricals; however in this case we’re really dealing with the same variable over time so we’ll keep it simple.

Restructuring the dataset

I’m going to restructure this dataset such that it’s tidy and easier to work worth for our purposes.

I a going to clean up a couple naming conventions using dplyr‘s rename function, and will next use pivot_longer to make my dataset tidy. What this effectively means is that the fields that are included in our vector (province, country_region, etc). All other columns were to move under a single column called date and the associated values would now be the value of a final field: cumulative_cases.

To keep it’s simple lets filter to China.

We will then group by the province, country_region, and date to summarise cumulative infections up to a given date. From there we then take the one lagged diff to effectively give us the number of new cases each day.

jhu_clean <- jhu_df %>% 
  rename(province = "Province/State", 
         country_region = "Country/Region")%>% 
         pivot_longer(-c(province, country_region, Lat, Long), names_to = "Date", values_to = "cumulative_cases") %>%
  filter(country_region == 'China') %>%
  group_by(province, country_region, Date) %>%
  summarise(cumulative_cases = sum(cumulative_cases)) %>%
  mutate(Date = mdy(Date) - days(1),
         incident_cases = c(0, diff(cumulative_cases))) %>%
  arrange(country_region, Date)

Lets visualize

Now that we have the data how we want it, let’s start visualizing. Using the dataset we just created, lets plot the Chinese provinces over time.

Lets take our jhu_df, and create a line chart of the cumulative cases by province

jhu_clean %>% 
  ggplot(aes(x = Date, y = cumulative_cases, col = province))+
  geom_line(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 90))

A quick considerations here: 1. this does not take into account recovered, active, or diseased cases of covid-19.

The good thing we can quickly see is that the curve appears to be flattening. Like I had mentioned this curve cannot go downward as we are tracking cumulative cases independent of recovery; so a flat curve is a very good sign.

Lets look at the same chart, but for new cases each day rather than the cumulative (adding each day to the previous day’s total).

Looking at the same data as a bar chart, but filtered down to just Hubei province, where Wuhan is.

jhu_clean %>% 
  filter(Date > '2020-02-10' & province == 'Hubei') %>%
  ggplot(aes(x = Date, y = new_cases, col = province, fill = province))+
  geom_bar(stat = 'identity')+
  theme(axis.text.x = element_text(angle = 90))

What’s interesting here is that we’re seeing three major spikes until a consistent drop-off… hopefully the glimmer of hope. The spikes may have at the same time due to availability of testing kits among other things.

While this data has given us a start, we aren’t able to break things out by recovered, diseased, active and it seems that the data for the US ad other countries is not consistent with what’s being reported elsewhere.


Hopefully this proves a helpful introduction to familiarizing yourself with a dataset and starting to dive in on visualization. With more complete data detailing things like active vs recovered vs deaths there is a lot more to be done.

Happy data science-ing!


Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: