数据来自 2011 年全国大学生数据建模竞赛 A 题,下载数据到本地后,首先读取城市地形数据,即采样点的位置及所属功能区。功能区代码 1 对应「生活区」,2 对应「工业区」,3 对应「山区」,4 对应「交通区」和 5 对应「公园绿地区」。
library(readxl)
# 采样点的城市地形数据
dat1 <- read_xls(
path = "data/cumcm2011A附件_数据.xls",
col_names = TRUE, sheet = "附件1", range = "A3:E322"
)
dat1
# # A tibble: 319 × 5
# 编号 `x(m)` `y(m)` `海拔(m)` 功能区
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 74 781 5 4
# 2 2 1373 731 11 4
# 3 3 1321 1791 28 4
....
library(tibble)
tmp <- tribble(
~`功能区编号`, ~`功能区名称`,
1, "生活区",
2, "工业区",
3, "山区",
4, "交通区",
5, "公园绿地区"
)
# 合并数据
dat2 <- merge(x = dat1, y = tmp, by.x = "功能区", by.y = "功能区编号", sort = FALSE)
dat2
# 功能区 编号 x(m) y(m) 海拔(m) 功能区名称
# 1 4 1 74 781 5 交通区
# 2 4 2 1373 731 11 交通区
# 3 4 3 1321 1791 28 交通区
# 4 4 44 8457 8991 21 交通区
# 5 4 5 1049 2127 12 交通区
....
读取地表土壤各种重金属浓度的数据,它在同一个Excel文件的另一个工作区里。
# 土壤重金属浓度
dat3 <- read_xls(
path = "data/cumcm2011A附件_数据.xls",
col_names = TRUE, sheet = "附件2", range = "A3:I322"
)
# 篇幅所限,铅 Pb (μg/g) 和锌 Zn (μg/g) 两列未显示
dat3
# # A tibble: 319 × 9
# 编号 `As (μg/g)` `Cd (ng/g)` `Cr (μg/g)` `Cu (μg/g)` `Hg (ng/g)` `Ni (μg/g)`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 7.84 154. 44.3 20.6 266 18.2
# 2 2 5.93 146. 45.0 22.5 86 17.2
# 3 3 4.9 439. 29.1 64.6 109 10.6
....
将采样点的地形数据和土壤重金属浓度数据合并在一起。
dat4 <- merge(x = dat2, y = dat3, by.x = "编号", by.y = "编号", sort = TRUE)
dat4
# 编号 功能区 x(m) y(m) 海拔(m) 功能区名称 As (μg/g) Cd (ng/g) Cr (μg/g) Cu (μg/g)
# 1 1 4 74 781 5 交通区 7.84 153.8 44.31 20.56
# 2 2 4 1373 731 11 交通区 5.93 146.2 45.05 22.51
# 3 3 4 1321 1791 28 交通区 4.90 439.2 29.07 64.56
# 4 4 2 0 1787 4 工业区 6.56 223.9 40.08 25.17
# 5 5 4 1049 2127 12 交通区 6.35 525.2 59.35 117.53
....
以上8种主要重金属元素的背景参考值如下:
# 土壤重金属浓度的背景参考值
dat5 <- read_xls(
path = "data/cumcm2011A附件_数据.xls",
col_names = TRUE, sheet = "附件3", range = "A3:D11"
)
dat5
# # A tibble: 8 × 4
# 元素 平均值 标准偏差 范围
# <chr> <dbl> <dbl> <chr>
# 1 As (μg/g) 3.6 0.9 1.8~5.4
# 2 Cd (ng/g) 130 30 70~190
# 3 Cr (μg/g) 31 9 13~49
# 4 Cu (μg/g) 13.2 3.6 6.0~20.4
# 5 Hg (ng/g) 35 8 19~51
# 6 Ni (μg/g) 12.3 3.8 4.7~19.9
# 7 Pb (μg/g) 31 6 19~43
# 8 Zn (μg/g) 69 14 41~97
既然提供了各重金属浓度的背景参考值,可以先将原浓度数据标准化。
# 相比于 transform 函数 within 更友好一些,特别是在列名处理上
# 详见 https://bugs.r-project.org/show_bug.cgi?id=17890
dat6 <- within(dat4, {
`As (μg/g)` <- (`As (μg/g)` - 3.6) / 0.9
`Cd (ng/g)` <- (`Cd (ng/g)` - 130) / 30
`Cr (μg/g)` <- (`Cr (μg/g)` - 31) / 9
`Cu (μg/g)` <- (`Cu (μg/g)` - 13.2) / 3.6
`Hg (ng/g)` <- (`Hg (ng/g)` - 35) / 8
`Ni (μg/g)` <- (`Ni (μg/g)` - 12.3) / 3.8
`Pb (μg/g)` <- (`Pb (μg/g)` - 31) / 6
`Zn (μg/g)` <- (`Zn (μg/g)` - 69) / 14
})
# 篇幅所限,仅展示部分列
dat6
# 编号 功能区 x(m) y(m) 海拔(m) 功能区名称 As (μg/g) Cd (ng/g) Cr (μg/g) Cu (μg/g)
# 1 1 4 74 781 5 交通区 4.71111 0.79333 1.478889 2.04444
# 2 2 4 1373 731 11 交通区 2.58889 0.54000 1.561111 2.58611
# 3 3 4 1321 1791 28 交通区 1.44444 10.30667 -0.214444 14.26667
# 4 4 2 0 1787 4 工业区 3.28889 3.13000 1.008889 3.32500
# 5 5 4 1049 2127 12 交通区 3.05556 13.17333 3.150000 28.98056
....
为了方便后续数据处理和分析,重命名数据框各个列名。
# 查看各个列名
colnames(dat6)
# [1] "编号" "功能区" "x(m)" "y(m)" "海拔(m)" "功能区名称"
# [7] "As (μg/g)" "Cd (ng/g)" "Cr (μg/g)" "Cu (μg/g)" "Hg (ng/g)" "Ni (μg/g)"
# [13] "Pb (μg/g)" "Zn (μg/g)"
# 重命名各个列
colnames(dat6) <- c(
"编号", "功能区", "x", "y", "海拔", "功能区名称",
"As", "Cd", "Cr", "Cu", "Hg", "Ni", "Pb", "Zn"
)
# 调色板
# RColorBrewer::brewer.pal(n = 5, name = "Set2")
# 查看颜色
# RColorBrewer::display.brewer.pal(n = 5, name = "Set2")
colorize_factor <- function(x) {
# 注意因子水平个数
scales::col_factor(palette = "Set2", levels = unique(x))(x)
}
# 给每个功能区设置一个颜色
dat6 <- transform(dat6, color = colorize_factor(`功能区名称`))
dat6
# 编号 功能区 x y 海拔 功能区名称 As Cd Cr Cu
# 1 1 4 74 781 5 交通区 4.71111 0.79333 1.478889 2.04444
# 2 2 4 1373 731 11 交通区 2.58889 0.54000 1.561111 2.58611
# 3 3 4 1321 1791 28 交通区 1.44444 10.30667 -0.214444 14.26667
# 4 4 2 0 1787 4 工业区 3.28889 3.13000 1.008889 3.32500
# 5 5 4 1049 2127 12 交通区 3.05556 13.17333 3.150000 28.98056
....
查看各个功能区采样点的数量,以及各个功能区的配色。
aggregate(`编号` ~ color + `功能区名称`, data = dat6, FUN = function(x) length(unique(x)))
# color 功能区名称 编号
# 1 #66C2A5 交通区 138
# 2 #8DA0CB 公园绿地区 35
# 3 #A6D854 山区 66
# 4 #FC8D62 工业区 36
# 5 #E78AC3 生活区 44
采样点在各个功能区的分布情况,如图1所示,城市地势西南低东北高,西北边界主要分布工业区,交通连接城市各个功能区,东北方向主要是山区。
library(lattice)
# 绘图涉及中文,调用 showtext 包处理
library(showtext)
showtext_auto()
# 三维分组散点图
cloud(`海拔` ~ x * y,
data = dat6, pch = 19,
col = dat6$color, # 散点颜色映射功能区
# z 轴标签旋转 90 度
scales = list(
arrows = FALSE, col = "black",
z = list(rot = 90)
),
key = list( # 制作图例
# space = "right",
corner = c(1.0, 0.5), # 右侧居中
points = list(col = c("#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3", "#A6D854"), pch = 19),
text = list(c("交通区", "工业区", "公园绿地区", "生活区", "山区"))
),
# 减少三维图形的边空
lattice.options = list(
layout.widths = list(
left.padding = list(x = -.6, units = "inches"),
right.padding = list(x = -1.0, units = "inches")
),
layout.heights = list(
bottom.padding = list(x = -.8, units = "inches"),
top.padding = list(x = -1.0, units = "inches")
)
),
# 设置坐标轴字体大小
par.settings = list(
axis.line = list(col = "transparent"),
fontsize = list(text = 15, points = 10)
),
# 设置三维图的观察方位
screen = list(z = 30, x = -65, y = 0)
)
接下来,根据此数据框 data.frame 类型构造空间数据类型 sp (对应 Spatial 类),以便后续调用空间数据分析方法。
library(sp)
coordinates(dat6) <- ~ x + y
summary(dat6)
# Object of class SpatialPointsDataFrame
# Coordinates:
# min max
# x 0 28654
# y 0 18449
# Is projected: NA
# proj4string : [NA]
# Number of points: 319
# Data attributes:
# 编号 功能区 海拔 功能区名称 As
# Min. : 1.0 Min. :1.00 Min. : 0.0 Length:319 Min. :-2.211
# 1st Qu.: 80.5 1st Qu.:2.50 1st Qu.: 14.0 Class :character 1st Qu.: 0.189
# Median :160.0 Median :4.00 Median : 29.0 Mode :character Median : 1.900
# Mean :160.0 Mean :3.26 Mean : 42.5 Mean : 2.307
# 3rd Qu.:239.5 3rd Qu.:4.00 3rd Qu.: 55.5 3rd Qu.: 3.522
# Max. :319.0 Max. :5.00 Max. :308.0 Max. :29.478
# Cd Cr Cu Hg Ni
# Min. :-3.00 Min. :-1.74 Min. : -3.0 Min. : -3.3 Min. :-2.11
# 1st Qu.: 0.62 1st Qu.: 0.31 1st Qu.: 1.6 1st Qu.: -0.6 1st Qu.: 0.05
# Median : 3.62 Median : 1.22 Median : 4.1 Median : 1.9 Median : 0.97
# Mean : 5.75 Mean : 2.50 Mean : 11.6 Mean : 33.1 Mean : 1.31
# 3rd Qu.: 7.97 3rd Qu.: 2.61 3rd Qu.: 11.1 3rd Qu.: 9.7 3rd Qu.: 1.97
# Max. :49.66 Max. :98.87 Max. :698.7 Max. :1995.6 Max. :34.26
# Pb Zn color
# Min. :-1.89 Min. : -2.58 Length:319
# 1st Qu.: 0.45 1st Qu.: 0.46 Class :character
# Median : 2.47 Median : 2.67 Mode :character
# Mean : 5.12 Mean : 9.44
# 3rd Qu.: 6.68 3rd Qu.: 9.11
# Max. :73.58 Max. :263.70
spplot(dat6,
zcol = c("海拔"),
main = "",
as.table = TRUE, # 面板自左上开始
scales = list(
draw = TRUE, # 坐标轴刻度
# 去掉图形上边、右边多余的刻度线
x = list(alternating = 1, tck = c(1, 0)),
y = list(alternating = 1, tck = c(1, 0))
),
colorkey = TRUE,
alpha = 0.7,
key.space = "right"
)
由图2不难看出,采样点的位置是以图中左下角的点为参照,城市整体上的地势是西南低东北高,城市中间间或有两座小山。
As <- bubble(dat6, zcol = c("As"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Cd <- bubble(dat6, zcol = c("Cd"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Cr <- bubble(dat6, zcol = c("Cr"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Cu <- bubble(dat6, zcol = c("Cu"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Hg <- bubble(dat6, zcol = c("Hg"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Ni <- bubble(dat6, zcol = c("Ni"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Pb <- bubble(dat6, zcol = c("Pb"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
Zn <- bubble(dat6, zcol = c("Zn"), col = c("#4dac26", "#d01c8b"), fill = F, key.space = "bottom")
# 4 列 2 行,图按列
print(As, split = c(1, 1, 4, 2), more = TRUE)
print(Cd, split = c(1, 2, 4, 2), more = TRUE)
print(Cr, split = c(2, 1, 4, 2), more = TRUE)
print(Cu, split = c(2, 2, 4, 2), more = TRUE)
print(Hg, split = c(3, 1, 4, 2), more = TRUE)
print(Ni, split = c(3, 2, 4, 2), more = TRUE)
print(Pb, split = c(4, 1, 4, 2), more = TRUE)
print(Zn, split = c(4, 2, 4, 2), more = FALSE)
图中,绿色气泡表示重金属浓度低于背景值,红色气泡反之,气泡大小对应不同区间的浓度值,根据浓度值数据的五个分位点划分区间,以砷 As 为例,浓度的分位点如下:
quantile(dat6$As)
# 0% 25% 50% 75% 100%
# -2.2111 0.1889 1.9000 3.5222 29.4778
现在数据准备好了,通过上述探索,也有了基本的了解,接下来的问题是如何找到城市的重金属污染源?