SlideShare a Scribd company logo
Data Visualization

http://nycdatascience.com/part4_en/

Data Visualization
class 5

Vivian Zhang | Scott Kostyshak
CTO @Supstat Inc | Data Scientist @Supstat Inc

1 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Data visualization
We will study the application of primary drawing functions and advanced drawing functions in R and
will focus on understanding the methods of data exploration by visualization.
· The related functions in R
· The properties of a single variable
· Displaying compositions
· The relationship between variables
· Exhibiting change over time
· Geographic information
Case study and excercise: Analyzing the NBA data with graphics

2 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Why use visualization?

3 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Data visualization
A figure is worth a thousand words.
data <- read.table('data/anscombe.txt',T)
data <- data[,-1]
head(data)

1
2
3
4
5
6

4 of 98

x1
10
8
13
9
11
14

x2
10
8
13
9
11
14

x3 x4
y1
y2
y3
y4
10 8 8.04 9.14 7.46 6.58
8 8 6.95 8.14 6.77 5.76
13 8 7.58 8.74 12.74 7.71
9 8 8.81 8.77 7.11 8.84
11 8 8.33 9.26 7.81 8.47
14 8 9.96 8.10 8.84 7.04

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Data visualization
Try to calculate some statistical indicators. First calculate the mean of these datasets, and then
calculate the correlation coefficient of the four groups of data
colMeans(data)

x1 x2 x3 x4 y1 y2 y3 y4
9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5

sapply(1:4,function(x) cor(data[,x],data[,x+4]))

[1] 0.816 0.816 0.816 0.817

5 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Data visualization

6 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Some basic principles
1. Determine the target of visualization from the beginning
· Exploratory visualization
· Explanatory visualization
2. Understanding the characteristics of the data and the audience
· Which variables are important and interesting
· Consider the role and background of the audience
· Select a proper mapping
3. Keep concise but give enough information

7 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Mapping elements of a graph:
1. Coordinate position
2. Line
3. Size
4. Color
5. Shape
6. Text

8 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Visualization functions in R

9 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Visualization functions in R
· base graphics
· lattice
· ggplot2

10 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Elementary graphing functions
plot(cars$dist~cars$speed)

11 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Elementary graphing functions
plot(cars$dist,type='l')

12 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Elementary graphing functions
plot(cars$dist,type='h')

13 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Elementary graphing functions
hist(cars$dist)

14 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
library(lattice)
num <- sample(1:3,size=50,replace=T)
barchart(table(num))

15 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
qqmath(rnorm(100))

16 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
stripplot(~ Sepal.Length | Species, data = iris,layout=c(1,3))

17 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
densityplot(~ Sepal.Length, groups=Species, data = iris,plot.points=FALSE)

18 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
bwplot(Species~ Sepal.Length, data = iris)

19 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
xyplot(Sepal.Width~ Sepal.Length, groups=Species, data = iris)

20 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
splom(iris[1:4])

21 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

lattice package
histogram(~ Sepal.Length | Species, data = iris,layout=c(1,3))

22 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Three-dimensional graphs in the lattice
package
library(plyr)
func3d <- function(x,y) {
sin(x^2/2 - y^2/4) * cos(2*x - exp(y))
}
vec1 <- vec2 <- seq(0,2,length=30)
para <- expand.grid(x=vec1,y=vec2)
result6 <- mdply(.data=para,.fun=func3d)

23 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Three-dimensional graphs in the lattice
package
library(lattice)
wireframe(V1~x*y,data=result6,scales = list(arrows = FALSE),
drape = TRUE, colorkey = F)

24 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Data, Mapping and Geom
library(ggplot2)
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point()
print(p)

25 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Observe the internal structure
summary(p)

data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11]
mapping: x = cty, y = hwy
faceting: facet_null()
----------------------------------geom_point: na.rm = FALSE
stat_identity:
position_identity: (width = NULL, height = NULL)

26 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Add other data mappings
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))
p <- p + geom_point()
print(p)

27 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Add a statistical transformation such as a smooth
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))
p <- p + geom_smooth()
print(p)

28 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Add points and smooth lines on the plot layer
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=factor(year))) +
geom_smooth()

29 of 98

2/4/14, 7:31 AM
Data Visualization

30 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Scale control
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=factor(year))) +
geom_smooth() +
scale_color_manual(values=c('blue2','red4'))

31 of 98

2/4/14, 7:31 AM
Data Visualization

32 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Facet control
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=factor(year))) +
geom_smooth() +
scale_color_manual(values=c('blue2','red4')) +
facet_wrap(~ year,ncol=1)

33 of 98

2/4/14, 7:31 AM
Data Visualization

34 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot package
Polishing your plots for publication
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1) +
opts(title='Vehicle model and fuel consumption') +
labs(y='Highway miles per gallon',
x='Urban miles per gallon',
size='Displacement',
colour = 'Model')

35 of 98

2/4/14, 7:31 AM
Data Visualization

36 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot exercise I
change the coordinate system,such as coord_flip() , coord_polar(),coord_cartesian()
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=factor(year),size=displ), alpha=0.5,position = "jitter")+
stat_smooth()+
scale_color_manual(values =c('steelblue','red4'))+
scale_size_continuous(range = c(4, 10))

37 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

The properties of a single variable

38 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Histogram
library(ggplot2)
p <- ggplot(data=iris,aes(x=Sepal.Length))+
geom_histogram()
print(p)

39 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Histogram
We can customize the histogram as follows:
p <- ggplot(iris,aes(x=Sepal.Length))+
geom_histogram(binwidth=0.1,
# Set the group gap
fill='skyblue', # Set the fill color
colour='black') # Set the border color

40 of 98

2/4/14, 7:31 AM
Data Visualization

41 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Histograms plus density curve
The main role of the histogram of is to show counting by groups and distribution characteristics. The
distribution of a sample in traditional statistics is of important significance. But there is another
method that can also show the distribution of data, namely the kernel density estimation curve. We
can estimate a density curve that represents the distribution, according to the data. We can display
the histogram and density curve at the same time.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..),
fill='skyblue',
color='black') +
geom_density(color='black',
linetype=2,adjust=2)

42 of 98

2/4/14, 7:31 AM
Data Visualization

43 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Density curve
Similar to the window width parameter, the adjust parameter will control the presentation of the
density curve. We try different parameters to draw mutiple density curves. The smaller the parameter
is, the more volatile and sensitive the curve is.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..), # Note: set y to relative frequency
fill='gray60',
color='gray') +
geom_density(color='black',linetype=1,adjust=0.5) +
geom_density(color='black',linetype=2,adjust=1) +
geom_density(color='black',linetype=3,adjust=2)

44 of 98

2/4/14, 7:31 AM
Data Visualization

45 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Density curve
Density curve is also convenient for comparison between different data. For example, we want to
compare the Sepal.Length distribution of three different flowers of the iris, like this:
p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray')
print(p)

46 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Boxplot
In addition to the histograms and density map, We can also use boxplots to show the distribution of
one-dimensional data. The boxplot is also convenient for comparison of different data.
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot()
print(p)

47 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Violin plot
A violin plot contains more information than a boxplot about the (sub-)distributions of the data:
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin()
print(p)

48 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Violin plot plus points
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,
fill=Species)) +
geom_violin(fill='gray',alpha=0.5) +
geom_dotplot(binaxis = "y", stackdir = "center")
print(p)

49 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Displaying compositions

50 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Bar chart
The proportion of each vehicle model in the mpg dataset and these proportions grouped by years
p <- ggplot(mpg,aes(x=class)) +
geom_bar()
print(p)

51 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Stacked bar chart
The proportion of each vehicle model in the mpg dataset and these proportions grouped by years
mpg$year <- factor(mpg$year)
p <- ggplot(mpg,aes(x=class,fill=year)) +
geom_bar(color='black')

52 of 98

2/4/14, 7:31 AM
Data Visualization

53 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Stacked bar chart
Stacked bar chart
p <- ggplot(mpg,aes(x=class,fill=year)) +
geom_bar(color='black',
position=position_dodge())

54 of 98

2/4/14, 7:31 AM
Data Visualization

55 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Pie chart
p <- ggplot(mpg, aes(x = factor(1), fill = factor(class))) +
geom_bar(width = 1)+
coord_polar(theta = "y")

56 of 98

2/4/14, 7:31 AM
Data Visualization

57 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Rose diagram
Wind rose, a commonly used graphics tool by meteorologists, describes the wind speed and
direction distributions in a specific place.

set.seed(1)
# Randomly generate 100 wind directions, and divide them into 16 intervals.
dir <- cut_interval(runif(100,0,360),n=16)
# Randomly generate 100 wind speed, and divide them into 4 intensities.
mag <- cut_interval(rgamma(100,15),4)
sample <- data.frame(dir=dir,mag=mag)
# Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coo
p <- ggplot(sample,aes(x=dir,fill=mag)) +
geom_bar()+ coord_polar()

58 of 98

2/4/14, 7:31 AM
Data Visualization

59 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Mosaic Plot
Divide the data according to different variables, and then use rectangles of different sizes to
represent different groups of data. Let's look at the gender breakdown of survivors:

60 of 98

2/4/14, 7:31 AM
Data Visualization

61 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

The proportion structure of continuous data
data <- read.csv('data/soft_impact.csv',T)
library(reshape2)
data.melt <- melt(data,id='Year')
p <- ggplot(data.melt,aes(x=Year,y=value,
group=variable,fill=variable)) +
geom_area(color='black',size=0.3,
position=position_fill()) +
scale_fill_brewer()

62 of 98

2/4/14, 7:31 AM
Data Visualization

63 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

The relationship between variables

64 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter diagram
Show the relationship between two variables with a scatter diagram.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point()
print(p)

65 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
mpg$year <- factor(mpg$year)
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year))
print(p)

66 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
Represent different years with different shapes
mpg$year <- factor(mpg$year)
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year))
print(p)

67 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
With large data sets, the points in a scatter plot may obscure each other due to overplotting, we can
make some random disturbance to solve this problem.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position =
print(p)

68 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
For the trend of the scatterplot, we can draw out the regression line.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(color=year),alpha=0.5,position = "jitter") +
geom_smooth(method='lm')
print(p)

69 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
In addition to color, We can also use the size of the dot to reflect another variable, such as the size
of the cylinder. Some refer to plots like this as "bubble charts".
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") +
geom_smooth(method='lm') +
scale_size_continuous(range = c(4, 10))

70 of 98

2/4/14, 7:31 AM
Data Visualization

71 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
Although we can show all the variables in a picture, we can also split it into multiple pictures to show
the characteristics of different variables. This method is called grouping, conditioning, or faceting.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1)

72 of 98

2/4/14, 7:31 AM
Data Visualization

73 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

ggplot exercise II
· make scatter plot for diamond data
· use transparency and small size points, look into size and alpha option in geom_point()
· use bin chart to observe intensity of points,look into stat_bin2d()
· estimate

data

dentisy,look

into

stat_density2d()

and

use

+cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000))

74 of 98

2/4/14, 7:31 AM
Data Visualization

75 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

76 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

77 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
The typical scatter plot is to show a relationship between two variables. When you want to look at
many bivariate relationships at once, you can use a scatter plot matrix.

78 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Scatter plot of multidimensional data
if given many numerical variables, concentrated display can be done.

79 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Change over time

80 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Change over time
For visualization of time series data, the first step is looking at how the variable changes over time.
For example, we'll have a look at American employment GDP data visualization.
fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4')
p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) +
geom_bar(stat='identity',
fill=fillcolor)

81 of 98

2/4/14, 7:31 AM
Data Visualization

82 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Change over time
For the time series of small amount of data, we can use the bar graph to display. At the same time
display the number of positive and negative values with different colors.For the time series of large
scale data, the bar will be crowded, and lines and points can be used to represent the strip.

p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color='grey20',size=0.5) +
geom_point(aes(y=psavert),color='red4') +
theme_bw()

83 of 98

2/4/14, 7:31 AM
Data Visualization

84 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Change over time
When the data is more intensive, we can use line graph or area chart to show the change of a trend.
Also, some important time points or time interval can be marked in the time series graph, such as
marking 80's as a key time.
fill.color <- ifelse(economics$date > '1980-01-01' &
economics$date < '1990-01-01',
'steelblue','red4')
p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color=fill.color,size=0.9) +
geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") +
theme_bw()

85 of 98

2/4/14, 7:31 AM
Data Visualization

86 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

87 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Geographic information
visualization

88 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Map
Two types of drawing map
· Download the geographic information data, and then draw the geographical boundaries, and
identify areas and locations according to the need
· Download bitmap data of Google map, and then mark the location and path information on the
google map

89 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Map
world map
library(ggplot2)
world <- map_data("world")
worldmap <- ggplot(world, aes(x=long, y=lat, group=group)) +
geom_path(color='gray10',size=0.3) +
geom_point(x=114,y=30,size=10,shape='*') +
scale_y_continuous(breaks=(-2:2) * 30) +
scale_x_continuous(breaks=(-4:4) * 45) +
coord_map("ortho", orientation=c(30, 120, 0)) +
theme(panel.grid.major = element_line(colour = "gray50"),
panel.background = element_rect(fill = "white"),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank())

90 of 98

2/4/14, 7:31 AM
Data Visualization

91 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

map of the U.S.
map <- map_data('state')
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
usmap <- ggplot(data=arrests) +
geom_map(map =map,aes(map_id = region,fill = murder),color='gray40' ) +
expand_limits(x = map$long, y = map$lat) +
scale_fill_continuous(high='red2',low='white') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position = c(0.95,0.28),
legend.background=element_rect(fill="white", colour="white"))+ coord_map('mercator'

92 of 98

2/4/14, 7:31 AM
Data Visualization

93 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Drawing a map of China based on a bitmap
Another method to drawing China map is to download a document containing bitmap data from
Google or openstreetmap, and then to overlap points and lines elements on it with ggplot2. This
document does not include information of latitude and longitude, just a simple bitmap, for fast
mapping.
library(ggmap)
library(XML)
webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'
tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)
raw <- tables[[6]]
data <- raw[,c(1,3,4)]
names(data) <- c('date','lan','lon')
data$lan <- as.numeric(data$lan)
data$lon <- as.numeric(data$lon)
data$date <- as.Date(data$date, "%Y-%m-%d")
#Read the map data from Google by the ggmap package, and mark the previous data on the map.
earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device'
geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+
theme(legend.position = "none")

94 of 98

2/4/14, 7:31 AM
Data Visualization

95 of 98

http://nycdatascience.com/part4_en/

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

R and interactive visualization
GoogleVis is R package providing a interface between R and Google visualization API. It allows the
user to use the Google Visualization API for data visualization without the need to upload data.
We want to compare the development trajectory of 20 country group over the past several years. In
order to obtain the data, we selected three variables from the world bank database, which reflect the
change of GDP, CO2 emissions and life expectancy between 2001 to 2009.
library(googleVis)
library(WDI)
DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US'
M <- gvisMotionChart(DF, idvar="country", timevar="year",
xvar='EN.ATM.CO2E.KT',
yvar='NY.GDP.MKTP.CD')
plot(M)

96 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Case study and excercise

97 of 98

2/4/14, 7:31 AM
Data Visualization

http://nycdatascience.com/part4_en/

Exercise III: Analyzing NBA data
· Calculate the seasonal winning rate, and draw a bar chart
· Calculating the seasonal winning rate at home and on the road, and draw a bar chart
· According to the seasonal scores of home side, draw a set of four histograms
· According to the seasonal scores of home side,draw the boxplots of five seasons
· Draw the boxplots of scores of all competitions for home side and opposite side
· Calculate the average and winning percentage for each opponent, and make a scatterplot to find
the strong and the weak team.

98 of 98

2/4/14, 7:31 AM

More Related Content

PDF
R class 5 -data visualization
PPTX
R and Visualization: A match made in Heaven
PDF
peRm R group. Review of packages for r for market data downloading and analysis
PDF
r for data science 2. grammar of graphics (ggplot2) clean -ref
DOCX
R-ggplot2 package Examples
PDF
Geo Spatial Plot using R
PDF
Data visualization with multiple groups using ggplot2
PDF
Data Visualization With R
R class 5 -data visualization
R and Visualization: A match made in Heaven
peRm R group. Review of packages for r for market data downloading and analysis
r for data science 2. grammar of graphics (ggplot2) clean -ref
R-ggplot2 package Examples
Geo Spatial Plot using R
Data visualization with multiple groups using ggplot2
Data Visualization With R

What's hot (11)

PDF
Data visualization using the grammar of graphics
PDF
QMC: Undergraduate Workshop, Tutorial on 'R' Software - Yawen Guan, Feb 26, 2...
PDF
Data flow vs. procedural programming: How to put your algorithms into Flink
PPTX
PPT
R studio
PDF
Gate-Cs 2010
PDF
DocEng2010 Bilauca Healy - A New Model for Automated Table Layout
PDF
Report
PPTX
Scaling up data science applications
DOCX
Gis (model questions)777
Data visualization using the grammar of graphics
QMC: Undergraduate Workshop, Tutorial on 'R' Software - Yawen Guan, Feb 26, 2...
Data flow vs. procedural programming: How to put your algorithms into Flink
R studio
Gate-Cs 2010
DocEng2010 Bilauca Healy - A New Model for Automated Table Layout
Report
Scaling up data science applications
Gis (model questions)777
Ad

Similar to Data visualization (20)

PDF
Introduction to R Graphics with ggplot2
PDF
Generalized Notions of Data Depth
PDF
Scalable and Adaptive Graph Querying with MapReduce
DOCX
Data visualization with R and ggplot2.docx
PDF
RDataMining slides-regression-classification
PDF
M4_DAR_part1. module part 4 analystics with r
PPTX
EDBT 2015: Summer School Overview
PDF
Formations & Deformations of Social Network Graphs
PDF
R visualization: ggplot2, googlevis, plotly, igraph Overview
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
PDF
Carpita metulini 111220_dssr_bari_version2
PDF
Grouping & Summarizing Data in R
PPTX
Introduction to Data Visualization for Agriculture and Allied Sciences using ...
PDF
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
PPTX
Exploratory Analysis Part1 Coursera DataScience Specialisation
PDF
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
DOCX
Two mark qn answer
PPTX
Tech talk ggplot2
PDF
Dplyr v2 . Exploratory data analysis.pdf
PPT
DATA VISUALIZATION WITH R PACKAGES
Introduction to R Graphics with ggplot2
Generalized Notions of Data Depth
Scalable and Adaptive Graph Querying with MapReduce
Data visualization with R and ggplot2.docx
RDataMining slides-regression-classification
M4_DAR_part1. module part 4 analystics with r
EDBT 2015: Summer School Overview
Formations & Deformations of Social Network Graphs
R visualization: ggplot2, googlevis, plotly, igraph Overview
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
Carpita metulini 111220_dssr_bari_version2
Grouping & Summarizing Data in R
Introduction to Data Visualization for Agriculture and Allied Sciences using ...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
Exploratory Analysis Part1 Coursera DataScience Specialisation
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
Two mark qn answer
Tech talk ggplot2
Dplyr v2 . Exploratory data analysis.pdf
DATA VISUALIZATION WITH R PACKAGES
Ad

More from Vivian S. Zhang (20)

PDF
Why NYC DSA.pdf
PPTX
Career services workshop- Roger Ren
PDF
Nycdsa wordpress guide book
PDF
We're so skewed_presentation
PDF
Wikipedia: Tuned Predictions on Big Data
PDF
A Hybrid Recommender with Yelp Challenge Data
PDF
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
PDF
Data mining with caret package
PDF
PPTX
Streaming Python on Hadoop
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Nycdsa ml conference slides march 2015
PDF
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
PDF
Max Kuhn's talk on R machine learning
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Using Machine Learning to aid Journalism at the New York Times
PDF
Introducing natural language processing(NLP) with r
PDF
Bayesian models in r
Why NYC DSA.pdf
Career services workshop- Roger Ren
Nycdsa wordpress guide book
We're so skewed_presentation
Wikipedia: Tuned Predictions on Big Data
A Hybrid Recommender with Yelp Challenge Data
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Data mining with caret package
Streaming Python on Hadoop
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Nyc open-data-2015-andvanced-sklearn-expanded
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Max Kuhn's talk on R machine learning
Winning data science competitions, presented by Owen Zhang
Using Machine Learning to aid Journalism at the New York Times
Introducing natural language processing(NLP) with r
Bayesian models in r

Recently uploaded (20)

PDF
Updated Idioms and Phrasal Verbs in English subject
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Lesson notes of climatology university.
PDF
RMMM.pdf make it easy to upload and study
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Yogi Goddess Pres Conference Studio Updates
PPTX
master seminar digital applications in india
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Trump Administration's workforce development strategy
PPTX
Cell Types and Its function , kingdom of life
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Complications of Minimal Access Surgery at WLH
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Updated Idioms and Phrasal Verbs in English subject
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Lesson notes of climatology university.
RMMM.pdf make it easy to upload and study
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Yogi Goddess Pres Conference Studio Updates
master seminar digital applications in india
Final Presentation General Medicine 03-08-2024.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Trump Administration's workforce development strategy
Cell Types and Its function , kingdom of life
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Complications of Minimal Access Surgery at WLH
Weekly quiz Compilation Jan -July 25.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf

Data visualization