I'm teaching a Software Carpentry workshop in December which will be R based for MPAC, so I wanted to see if I could work with data similar to the kind they might be interested in. I mainly use R to analyze kinetics and microarray data. Having recently been exposed to the wonders of dplyr, I thought I would see how difficult it would be to download some HUD data and plot it over a US map. All the tutorials I found tended to plot values that were already embedded in the shape file, which is nice but doesn't make clear how to get your own data in.

library(dplyr)
library(RColorBrewer)
library(classInt)
library(maptools)

I downloaded the Fair Market Rent data from HUD, excel direct link here.

hud_data <- read.csv('FY2014_4050_Final.csv', header = TRUE)
head(hud_data)
##    fips2000  fips2010 fmr2 fmr0 fmr1 fmr3 fmr4 county State CouSub
## 1 100199999 100199999  710  566  597  976 1161      1     1  99999
## 2 100399999 100399999  827  536  698 1219 1431      3     1  99999
## 3 100599999 100599999  583  428  431  726  779      5     1  99999
## 4 100799999 100799999  743  524  627  976 1098      7     1  99999
## 5 100999999 100999999  743  524  627  976 1098      9     1  99999
## 6 101199999 101199999  566  454  477  748 1002     11     1  99999
##       countyname       Metro_code                                 Areaname
## 1 Autauga County METRO33860M33860                       Montgomery, AL MSA
## 2 Baldwin County NCNTY01003N01003                       Baldwin County, AL
## 3 Barbour County NCNTY01005N01005                       Barbour County, AL
## 4    Bibb County METRO13820M13820 Birmingham-Hoover, AL HUD Metro FMR Area
## 5  Blount County METRO13820M13820 Birmingham-Hoover, AL HUD Metro FMR Area
## 6 Bullock County NCNTY01011N01011                       Bullock County, AL
##   county_town_name pop2010 state_alpha fmr_type metro
## 1   Autauga County   54571          AL       40     1
## 2   Baldwin County  182265          AL       40     0
## 3   Barbour County   27457          AL       40     0
## 4      Bibb County   22915          AL       40     1
## 5    Blount County   57322          AL       40     1
## 6   Bullock County   10914          AL       40     0

I'm not interested in the county level data for this exercise, so I made a simpler dataframe

hud_simple <- select(hud_data, state_alpha, pop2010, fmr0, fmr1, fmr2, fmr3, fmr4)
head(hud_simple)
##   state_alpha pop2010 fmr0 fmr1 fmr2 fmr3 fmr4
## 1          AL   54571  566  597  710  976 1161
## 2          AL  182265  536  698  827 1219 1431
## 3          AL   27457  428  431  583  726  779
## 4          AL   22915  524  627  743  976 1098
## 5          AL   57322  524  627  743  976 1098
## 6          AL   10914  454  477  566  748 1002

I'm interested in the average rent for each state, so let's group_by state and calculate averages:

states <- group_by(hud_simple, state_alpha)
avg_rent <- summarise(states, rent0 = mean(fmr0),
                      rent1 = mean(fmr1),
                      rent2 = mean(fmr2),
                      rent3 = mean(fmr3),
                      rent4 = mean(fmr4))
head(avg_rent)
## Source: local data frame [6 x 6]
## 
##   state_alpha    rent0    rent1     rent2     rent3     rent4
## 1          AK 657.8966 750.3793  959.8621 1281.7586 1480.4828
## 2          AL 461.7612 498.3582  624.4776  836.2239  939.3433
## 3          AR 431.2933 459.1333  597.7733  804.1200  931.9733
## 4          AZ 523.0000 601.1333  772.8000 1074.2000 1253.6667
## 5          CA 734.8276 845.4483 1093.8966 1548.7414 1810.4655
## 6          CO 563.2500 624.8906  807.5938 1109.9062 1284.7188
dim(avg_rent)
## [1] 54  6

Now that I have the average rents for each state, let's load in a map to color. I'm sure there are better shape files out there, but I used states_21basic from ArcGIS.com

US_map <-readShapePoly('states_21basic/states.shp')
plot(US_map)

plot of chunk unnamed-chunk-5

Eh… from what I've read, there are ways to rescale and move Alaska & Hawaii, but they require rgdal, which isn't available as a binary for Mac OS X and has other dependencies. So I'll just remove them.

US_map <- US_map[US_map@data$STATE_ABBR != "HI",]
US_map <- US_map[US_map@data$STATE_ABBR != "AK",]
plot(US_map)

plot of chunk unnamed-chunk-6

Ok, now to assign values and color them by state average rent. What I struggled with most is that the color values have to be in the same order they are listed in the shape file. Of course they're not in alphabetical order in the shape file.

First, we need to make sure our averaged data has the same number of rows as our shape file does. Using the 'filter' command, we can make sure only rows where state_alpha is in the US_map@data$STATE_ABBR set are kept.

dim(avg_rent)
## [1] 54  6
avg_rent <- filter(avg_rent, state_alpha %in% US_map@data$STATE_ABBR)
dim(avg_rent)
## [1] 49  6

So the order of the avg_rent dataframe is by state, in alphabetical order:

avg_rent$state_alpha
##  [1] AL AR AZ CA CO CT DC DE FL GA IA ID IL IN KS KY LA MA MD ME MI MN MO
## [24] MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA
## [47] WI WV WY
## 54 Levels: AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY ... WY

But the order of the states in US_map is not:

 US_map@data$STATE_ABBR
##  [1] WA MT ME ND SD WY WI ID VT MN OR NH IA MA NE NY PA CT RI NJ IN NV UT
## [24] CA OH IL DC DE WV MD CO KY KS VA MO AZ OK NC TN TX NM AL MS GA SC AR
## [47] LA FL MI
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY

I couldn't figure out how to reorder the US_map data so I added an index column in avg_rent that corresponds to the order US_map

avg_rent$state_index <- match(avg_rent$state_alpha,US_map@data$STATE_ABBR)
avg_rent <- arrange(avg_rent, state_index)
head(avg_rent)
## Source: local data frame [6 x 7]
## 
##   state_alpha    rent0    rent1    rent2     rent3     rent4 state_index
## 1          WA 511.0513 604.2821 777.9744 1074.8205 1286.4359           1
## 2          MT 472.1250 516.3393 654.0893  902.7679 1037.3929           2
## 3          ME 542.9154 609.2650 747.3665  979.3985 1095.1466           3
## 4          ND 518.6981 536.5660 674.6604  937.6415 1006.6226           4
## 5          SD 447.3485 485.0909 623.6515  868.6818  934.0152           5
## 6          WY 532.0435 573.7391 726.2609 1011.7826 1143.0870           6

Now to set up our colors.

colors <- brewer.pal(11, "RdBu")
brks <- classIntervals(avg_rent$rent1, n=11)
brks <- brks$brks
brks
##  [1]  459.1333  487.3274  503.0389  520.0542  537.0152  560.1556  574.8900
##  [8]  602.8508  644.7227  747.5739  878.3443 1239.0000

This defines our colors, then figures out where the break points will be, then overwrites the brks variable with what we care about, the break point values.

plot(US_map, col = colors[findInterval(avg_rent$rent1, brks)])

plot of chunk unnamed-chunk-12

Mapping dplyr data onto a shapefile

Leave a Reply

Your email address will not be published. Required fields are marked *