Tutorial: Plotting Lyme Disease Cases across Michigan counties with ggplot2

Here is a tutorial for anyone interested in learning how to create maps with ggplot2. We’re going to investigate the incidence of Lyme Disease cases across Michigan.

output

Disclaimer: This is not very elegant coding, it’s more brute force coding.  Various ways to get the same output with less and cleaner code.

1) Getting the Data

CDC has data on Lyme Disease Incidence  and so let’s download it and put it into a working directory.

Go to this website CDC Lyme Disease statistics and scroll to the bottom where we can see a data file called ” County-level Lyme disease data from 2000-2015 Microsoft Excel file“. The main file’s name was too wordy so I renamed it ‘lyme_county_0015.csv”.

#All the packages we're going to need
library(ggplot2)
library(dplyr)
library(viridis)
library(reshape2)
library(gganimate)
library(animation)
library(maps)
devtools::install_github("dgrtwo/gganimate")
#The Lyme Disease Data
main.dat = read.csv("lyme_county_0015.csv")
head(main.dat)

2) Subsetting to only include Michigan

MI. main.dat = subset(main.dat, STNAME == "MI")
MI.main.dat = MI.main.dat[-nrow(MI.main.dat),]
colnames(MI.main.dat)[2] <- 'county'

The subset function is telling R to give me data that only relates to Michigan and I decided to put this in my new subsetted variable called MI.main.dat. Also the last row is the total cases in Michigan, so I removed it. I also renamed the CTYNAME to county


3) Getting spatial data of Michigan counties

I need a map with the Michigan counties or specifically, latitude/longitude data for all of Michigan counties. Thankfully, ggplot2 and maps has us covered. Make sure you have both packages loaded.

map = map_data("county", "michigan")
colnames(map)[6] <- 'county'

We use the ‘map_data‘ function  and when we view it, we can see the important spatial information to produce a map of Michigan.  Now we want to merge our two data frames or in other words, we want the spatial data and our cases in a data frame.


4) Merging our data frames together

If I had a longer data-set, I say that the following method is not very safe. But I noticed that for both data frames, the county names are the same except some minute differences (capitalization, ‘county’ inclusion, periods,etc).

Why is this important? To merge data frames, we need to have similar entries! So I’ll use the county column in the map data frame to replace the county column in the Lyme Disease data frane. The ‘unique’ function retrieves a vector of only unique elements. If this seems confusing, run through the code first:

cbind(MI.main.dat$county, unique(map$county))
MI.main.dat$county =  unique(map$county)

When you eyeball it, the counties are the same! So I’m saying to R, replace the county column of the tick data frame with the county column of the map data frame.
I use a  function from the {dplyr} package where I merge the data frames together.

tick.MI.map =  left_join(map,MI.main.dat, by= c('county'))

Specifically, I am telling R to join the map data with the tick cases data by the county column.

tick.dat.map_sub =  tick.dat.map[,c(1,2,3,6,10:25)]
colnames(tick.dat.map_sub)[5:20]=(seq(2000,2015,1))

I only want some columns so I’m creating a final variable that I will be plotting with called tick.dat.map_sub.

I also decided to rename the columns that say Cases2000 to the actual years. I’ll show you why in a bit


5) Melting your data

If you look at your data, it’s in something called a wide format. This means that every dependent variable (the cases for each year) is in its column. If you think about it, plotting this will be somewhat difficult. The location of the counties will stay the same, but the Lyme Diseases cases across the years will be different. Plotting each column is not a great idea, it will be about 15 lines of code for each year!

So we have to change it to a long-format and the package ‘reshape2’ is best and we use the function called melt.

tick_melt= melt(tick.dat.map_sub, id = c('long','lat','group','county'))

Take a look at it and try to understand what happened!


6) ggPlotting

Now to make a ggplot graph.

ggplot(tick_melt, aes(x = long, y= lat, group=group))

Let’s go part by part. I am telling R to use the tick_melt data and for mapping, I need to list the longitude, latitude, and to group it by the group column. That is the data part of my ggplot code.

The second part is how I am going to represent it?

ggplot(tick_melt, aes(x = long, y= lat, group=group)) +
geom_polygon(aes(fill=value))

geom_polygon is used for maps and the fill function is specifying that the color is going to be dependent on the number of Lyme Disease cases. But I want to see the map year by year so I use a facet_wrap which separates the graphs by the years.

ggplot(tick_melt, aes(x = long, y= lat, group=group)) +
geom_polygon(aes(fill=value))+&nbsp; facet_wrap(~variable)

I want the map to have a nicer projection so I use coord_map() and I want to have a better color palette and the {viridis} package is a good one. Finally, I want to make the axis, background, all white for better aesthetics so I use a function called theme_void()

ggplot(tick_melt, aes(x = long, y= lat, group=group)) +
geom_polygon(aes(fill=value))+facet_wrap(~variable)+
coord_map()+
scale_fill_viridis(option='plasma')+theme_void()


Extra: Creating it into a gif

This is a bit more involved and getting ImageMagick to work can be very finicky

  1. Install ImageMagick
  2. If you’re using Windows10, click the option to install “legacies” (very important, will not work if you don’t)
  3. Go back to Rstudio and install {animation}
  4. Install {devtools}
  5. Type in console: devtools::install_github(“dgrtwo/gganimate”)
  6. Load all the libraries

We want the frame by frame to be year, so I simply add ‘frame= Year’ and assign the whole ggplot to a variable which I call ‘p’  and then I call the gganimate function that will create a gif in the working directory.

p = ggplot(tick_melt, aes(x = long, y= lat, group=group, frame=Year)) +
geom_polygon(aes(fill=value))+facet_wrap(~variable)+
 coord_map()+
scale_fill_viridis(option='plasma')+theme_void()

gganimate(p, "output.gif")

Here is the full code on my gist. So we learned that Yoopers (The people that live in the upper peninsula) seem to be facing a bit more Lyme Disease problems than the Trolls (The people who live in the mitten peninsula)


How could this be better?

This method works best for one state but might be a bit more difficult when doing all counties across United States. An alternative would be avoiding strings but utilizing  FIPS code. An easier alternative would have been using a package that specifically deal with this kind of spatial mapping (choroplethr).

About damiepak

PhD student in Bjornstad Lab
This entry was posted in News. Bookmark the permalink.

One Response to Tutorial: Plotting Lyme Disease Cases across Michigan counties with ggplot2

  1. Cool! Nicely explained.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s