这是为 Pandas 新用户准备的简短的入门介绍,我补充了对应的 R 语言实现。对 R 语言用户来说是熟悉 Python 语言的机会,对 Python 用户来说是了解 R 语言的机会 – 左手用 R 右手用 Python 。
注:Python 中涉及索引位置是从 0 开始计数的,而 R 是从 1 开始计数的。这显得 R 更加贴近人,Python 更加贴近计算机。
Python 的类型表示、检查比 R 语言更加丰富、严格,这是它的优势,带来的不便是使用上没有 R 语言那么灵活。
首先导入两个 Python 模块:
import numpy as np
import pandas as pd
R 内置一套数据操作的工具,不需要加载扩展包,也可以导入 dplyr 或 data.table 包。
1 基本的数据结构
- 序列 Series:带标签的一维数组,R 语言中等价的数据结构是向量 vector。
- 数据框 DataFrame:二维数据结构,如二维数组、表格,R 语言中等价的数据结构是数据框 data.frame。
2 创建对象
创建序列
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
## 0 1.0
## 1 3.0
## 2 5.0
## 3 NaN
## 4 6.0
## 5 8.0
## dtype: float64
r <- c(1, 3, 5, NA, 6, 8)
r
## [1] 1 3 5 NA 6 8
创建数据框
dates = pd.date_range("20130101", periods=6)
dates
## DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
## '2013-01-05', '2013-01-06'],
## dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
## A B C D
## 2013-01-01 -0.542846 -1.551642 1.134287 0.495023
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649
## 2013-01-04 -0.071331 -0.682826 0.627705 -0.052117
## 2013-01-05 -2.247319 1.595135 0.517725 0.813387
## 2013-01-06 -1.026931 0.205139 -1.057070 -0.457724
set.seed(20252025)
dates <- seq(from = as.Date("2013-01-01"), length.out = 6)
dates
## [1] "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-05"
## [6] "2013-01-06"
df <- data.frame(
matrix(
data = rnorm(24), nrow = 6, ncol = 4,
dimnames = list(NULL, c("A", "B", "C", "D"))
),
row.names = dates
)
df
## A B C D
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
## 2013-01-02 -1.2680 -1.1996 1.8483 1.67844
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484
## 2013-01-04 -0.6239 2.0038 0.4186 -1.03359
## 2013-01-05 0.8007 0.9745 -0.3079 0.45199
## 2013-01-06 0.4495 1.3587 -0.1422 0.41400
通过指定列名和列值创建数据框,数据框的列有不同的数据类型
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
## A B C D E F
## 0 1.0 2013-01-02 1.0 3 test foo
## 1 1.0 2013-01-02 1.0 3 train foo
## 2 1.0 2013-01-02 1.0 3 test foo
## 3 1.0 2013-01-02 1.0 3 train foo
df2.dtypes
## A float64
## B datetime64[s]
## C float32
## D int32
## E category
## F object
## dtype: object
df2 <- data.frame(
A = 1,
B = as.Date("2013-01-02"),
C = rep(1, 4),
D = rep(3L, 4),
E = c("test", "train", "test", "train"),
F = "foo"
)
df2
## A B C D E F
## 1 1 2013-01-02 1 3 test foo
## 2 1 2013-01-02 1 3 train foo
## 3 1 2013-01-02 1 3 test foo
## 4 1 2013-01-02 1 3 train foo
str(df2)
## 'data.frame': 4 obs. of 6 variables:
## $ A: num 1 1 1 1
## $ B: Date, format: "2013-01-02" "2013-01-02" ...
## $ C: num 1 1 1 1
## $ D: int 3 3 3 3
## $ E: chr "test" "train" "test" "train"
## $ F: chr "foo" "foo" "foo" "foo"
3 查看数据
查看数据框的前几行、后三行、行名(索引)和列名
df.head()
## A B C D
## 2013-01-01 -0.542846 -1.551642 1.134287 0.495023
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649
## 2013-01-04 -0.071331 -0.682826 0.627705 -0.052117
## 2013-01-05 -2.247319 1.595135 0.517725 0.813387
df.tail(3)
## A B C D
## 2013-01-04 -0.071331 -0.682826 0.627705 -0.052117
## 2013-01-05 -2.247319 1.595135 0.517725 0.813387
## 2013-01-06 -1.026931 0.205139 -1.057070 -0.457724
df.index
## DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
## '2013-01-05', '2013-01-06'],
## dtype='datetime64[ns]', freq='D')
df.columns
## Index(['A', 'B', 'C', 'D'], dtype='object')
head(df)
## A B C D
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
## 2013-01-02 -1.2680 -1.1996 1.8483 1.67844
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484
## 2013-01-04 -0.6239 2.0038 0.4186 -1.03359
## 2013-01-05 0.8007 0.9745 -0.3079 0.45199
## 2013-01-06 0.4495 1.3587 -0.1422 0.41400
tail(df, 3)
## A B C D
## 2013-01-04 -0.6239 2.0038 0.4186 -1.034
## 2013-01-05 0.8007 0.9745 -0.3079 0.452
## 2013-01-06 0.4495 1.3587 -0.1422 0.414
rownames(df)
## [1] "2013-01-01" "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-05"
## [6] "2013-01-06"
colnames(df)
## [1] "A" "B" "C" "D"
转化为数组,这需要原数据框中各列的元素类型都一致,否则,类型转化就可能出乎你的预料。
df.to_numpy()
## array([[-0.5428463 , -1.55164217, 1.13428719, 0.49502319],
## [ 2.42645044, 0.06697779, -0.64322811, 1.3949035 ],
## [ 0.5921552 , 0.10811436, 0.27123201, -0.86064924],
## [-0.07133073, -0.68282644, 0.62770526, -0.05211696],
## [-2.24731859, 1.59513481, 0.51772517, 0.81338676],
## [-1.02693104, 0.20513924, -1.05707 , -0.45772433]])
as.matrix(df)
## A B C D
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
## 2013-01-02 -1.2680 -1.1996 1.8483 1.67844
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484
## 2013-01-04 -0.6239 2.0038 0.4186 -1.03359
## 2013-01-05 0.8007 0.9745 -0.3079 0.45199
## 2013-01-06 0.4495 1.3587 -0.1422 0.41400
数据汇总描述和转置,转置操作适合对数组、矩阵操作,对数据框最好不要,因为列的类型可能不一样。
df.describe()
## A B C D
## count 6.000000 6.000000 6.000000 6.000000
## mean -0.144970 -0.043184 0.141775 0.222137
## std 1.582214 1.045502 0.828585 0.837768
## min -2.247319 -1.551642 -1.057070 -0.860649
## 25% -0.905910 -0.495375 -0.414613 -0.356322
## 50% -0.307089 0.087546 0.394479 0.221453
## 75% 0.426284 0.180883 0.600210 0.733796
## max 2.426450 1.595135 1.134287 1.394903
df.T
## 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
## A -0.542846 2.426450 0.592155 -0.071331 -2.247319 -1.026931
## B -1.551642 0.066978 0.108114 -0.682826 1.595135 0.205139
## C 1.134287 -0.643228 0.271232 0.627705 0.517725 -1.057070
## D 0.495023 1.394903 -0.860649 -0.052117 0.813387 -0.457724
summary(df)
## A B C D
## Min. :-1.268 Min. :-1.200 Min. :-0.79941 Min. :-1.1248
## 1st Qu.:-0.356 1st Qu.:-0.548 1st Qu.:-0.26648 1st Qu.:-0.7983
## Median : 0.511 Median : 0.398 Median :-0.00434 Median : 0.1608
## Mean : 0.152 Mean : 0.381 Mean : 0.19182 Mean : 0.0489
## 3rd Qu.: 0.744 3rd Qu.: 1.263 3rd Qu.: 0.34733 3rd Qu.: 0.4425
## Max. : 0.978 Max. : 2.004 Max. : 1.84834 Max. : 1.6784
t(as.matrix(df))
## 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
## A 0.97844 -1.268 0.5727 -0.6239 0.8007 0.4495
## B -0.67156 -1.200 -0.1781 2.0038 0.9745 1.3587
## C -0.79941 1.848 0.1335 0.4186 -0.3079 -0.1422
## D -0.09235 1.678 -1.1248 -1.0336 0.4520 0.4140
重新排列各列,按列名重排各列(降序排列)、按某列的值重排各行(升序排列)
df.sort_index(axis=1, ascending=False)
## D C B A
## 2013-01-01 0.495023 1.134287 -1.551642 -0.542846
## 2013-01-02 1.394903 -0.643228 0.066978 2.426450
## 2013-01-03 -0.860649 0.271232 0.108114 0.592155
## 2013-01-04 -0.052117 0.627705 -0.682826 -0.071331
## 2013-01-05 0.813387 0.517725 1.595135 -2.247319
## 2013-01-06 -0.457724 -1.057070 0.205139 -1.026931
df.sort_values(by="B")
## A B C D
## 2013-01-01 -0.542846 -1.551642 1.134287 0.495023
## 2013-01-04 -0.071331 -0.682826 0.627705 -0.052117
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649
## 2013-01-06 -1.026931 0.205139 -1.057070 -0.457724
## 2013-01-05 -2.247319 1.595135 0.517725 0.813387
df[, sort(colnames(df), decreasing = T)]
## D C B A
## 2013-01-01 -0.09235 -0.7994 -0.6716 0.9784
## 2013-01-02 1.67844 1.8483 -1.1996 -1.2680
## 2013-01-03 -1.12484 0.1335 -0.1781 0.5727
## 2013-01-04 -1.03359 0.4186 2.0038 -0.6239
## 2013-01-05 0.45199 -0.3079 0.9745 0.8007
## 2013-01-06 0.41400 -0.1422 1.3587 0.4495
df[order(df$B, decreasing = F),]
## A B C D
## 2013-01-02 -1.2680 -1.1996 1.8483 1.67844
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484
## 2013-01-05 0.8007 0.9745 -0.3079 0.45199
## 2013-01-06 0.4495 1.3587 -0.1422 0.41400
## 2013-01-04 -0.6239 2.0038 0.4186 -1.03359
4 筛选
4.1 提取元素
根据列名、行序、索引提取数据框中的数据
df["A"]
## 2013-01-01 -0.542846
## 2013-01-02 2.426450
## 2013-01-03 0.592155
## 2013-01-04 -0.071331
## 2013-01-05 -2.247319
## 2013-01-06 -1.026931
## Freq: D, Name: A, dtype: float64
df[0:3]
## A B C D
## 2013-01-01 -0.542846 -1.551642 1.134287 0.495023
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649
df["20130102":"20130104"]
## A B C D
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649
## 2013-01-04 -0.071331 -0.682826 0.627705 -0.052117
df[, "A", drop = F]
## A
## 2013-01-01 0.9784
## 2013-01-02 -1.2680
## 2013-01-03 0.5727
## 2013-01-04 -0.6239
## 2013-01-05 0.8007
## 2013-01-06 0.4495
df[1:3, ]
## A B C D
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
## 2013-01-02 -1.2680 -1.1996 1.8483 1.67844
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484
df[rownames(df) %in% c(as.Date("2013-01-02"), as.Date("2013-01-03"), as.Date("2013-01-04")), ]
## A B C D
## 2013-01-02 -1.2680 -1.1996 1.8483 1.678
## 2013-01-03 0.5727 -0.1781 0.1335 -1.125
## 2013-01-04 -0.6239 2.0038 0.4186 -1.034
4.2 按列名筛选
df.loc[dates[0]]
## A -0.542846
## B -1.551642
## C 1.134287
## D 0.495023
## Name: 2013-01-01 00:00:00, dtype: float64
df.loc[:, ["A", "B"]]
## A B
## 2013-01-01 -0.542846 -1.551642
## 2013-01-02 2.426450 0.066978
## 2013-01-03 0.592155 0.108114
## 2013-01-04 -0.071331 -0.682826
## 2013-01-05 -2.247319 1.595135
## 2013-01-06 -1.026931 0.205139
df.loc["20130102":"20130104", ["A", "B"]]
## A B
## 2013-01-02 2.426450 0.066978
## 2013-01-03 0.592155 0.108114
## 2013-01-04 -0.071331 -0.682826
df.loc[dates[0], "A"]
## np.float64(-0.5428462981141139)
df.at[dates[0], "A"]
## np.float64(-0.5428462981141139)
df[1, ]
## A B C D
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
df[, c("A", "B")]
## A B
## 2013-01-01 0.9784 -0.6716
## 2013-01-02 -1.2680 -1.1996
## 2013-01-03 0.5727 -0.1781
## 2013-01-04 -0.6239 2.0038
## 2013-01-05 0.8007 0.9745
## 2013-01-06 0.4495 1.3587
df[rownames(df) %in% seq(as.Date("2013-01-02"), as.Date("2013-01-04")), c("A", "B")]
## A B
## 2013-01-02 -1.2680 -1.1996
## 2013-01-03 0.5727 -0.1781
## 2013-01-04 -0.6239 2.0038
df[1, "A"]
## [1] 0.9784
df[1, "A"]
## [1] 0.9784
4.3 按位置筛选
df.iloc[3]
## A -0.071331
## B -0.682826
## C 0.627705
## D -0.052117
## Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5, 0:2]
## A B
## 2013-01-04 -0.071331 -0.682826
## 2013-01-05 -2.247319 1.595135
df.iloc[[1, 2, 4], [0, 2]]
## A C
## 2013-01-02 2.426450 -0.643228
## 2013-01-03 0.592155 0.271232
## 2013-01-05 -2.247319 0.517725
df.iloc[:, 1:3]
## B C
## 2013-01-01 -1.551642 1.134287
## 2013-01-02 0.066978 -0.643228
## 2013-01-03 0.108114 0.271232
## 2013-01-04 -0.682826 0.627705
## 2013-01-05 1.595135 0.517725
## 2013-01-06 0.205139 -1.057070
df.iloc[1, 1]
## np.float64(0.06697778509268705)
df.iat[1, 1]
## np.float64(0.06697778509268705)
df[4, ]
## A B C D
## 2013-01-04 -0.6239 2.004 0.4186 -1.034
df[4:5, 1:2]
## A B
## 2013-01-04 -0.6239 2.0038
## 2013-01-05 0.8007 0.9745
df[c(2, 3, 5), c(1, 3)]
## A C
## 2013-01-02 -1.2680 1.8483
## 2013-01-03 0.5727 0.1335
## 2013-01-05 0.8007 -0.3079
df[, c(2, 3)]
## B C
## 2013-01-01 -0.6716 -0.7994
## 2013-01-02 -1.1996 1.8483
## 2013-01-03 -0.1781 0.1335
## 2013-01-04 2.0038 0.4186
## 2013-01-05 0.9745 -0.3079
## 2013-01-06 1.3587 -0.1422
df[1, 1]
## [1] 0.9784
df[1, 1]
## [1] 0.9784
4.4 按布尔索引筛选
df[df["A"] > 0]
## A B C D
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649
df[df > 0]
## A B C D
## 2013-01-01 NaN NaN 1.134287 0.495023
## 2013-01-02 2.426450 0.066978 NaN 1.394903
## 2013-01-03 0.592155 0.108114 0.271232 NaN
## 2013-01-04 NaN NaN 0.627705 NaN
## 2013-01-05 NaN 1.595135 0.517725 0.813387
## 2013-01-06 NaN 0.205139 NaN NaN
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2
## A B C D E
## 2013-01-01 -0.542846 -1.551642 1.134287 0.495023 one
## 2013-01-02 2.426450 0.066978 -0.643228 1.394903 one
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649 two
## 2013-01-04 -0.071331 -0.682826 0.627705 -0.052117 three
## 2013-01-05 -2.247319 1.595135 0.517725 0.813387 four
## 2013-01-06 -1.026931 0.205139 -1.057070 -0.457724 three
df2[df2["E"].isin(["two", "four"])]
## A B C D E
## 2013-01-03 0.592155 0.108114 0.271232 -0.860649 two
## 2013-01-05 -2.247319 1.595135 0.517725 0.813387 four
df[df$A > 0,]
## A B C D
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484
## 2013-01-05 0.8007 0.9745 -0.3079 0.45199
## 2013-01-06 0.4495 1.3587 -0.1422 0.41400
df[df>0]
## [1] 0.9784 0.5727 0.8007 0.4495 2.0038 0.9745 1.3587 1.8483 0.1335 0.4186
## [11] 1.6784 0.4520 0.4140
df2 <- df
df2$E = c("one", "one", "two", "three", "four", "three")
df2
## A B C D E
## 2013-01-01 0.9784 -0.6716 -0.7994 -0.09235 one
## 2013-01-02 -1.2680 -1.1996 1.8483 1.67844 one
## 2013-01-03 0.5727 -0.1781 0.1335 -1.12484 two
## 2013-01-04 -0.6239 2.0038 0.4186 -1.03359 three
## 2013-01-05 0.8007 0.9745 -0.3079 0.45199 four
## 2013-01-06 0.4495 1.3587 -0.1422 0.41400 three
df2[df2$E %in% c("two", "four"),]
## A B C D E
## 2013-01-03 0.5727 -0.1781 0.1335 -1.125 two
## 2013-01-05 0.8007 0.9745 -0.3079 0.452 four
4.5 赋值
在数据框中添加新列时,通过索引自动对齐数据。而在 R 语言中不借助索引,如果要实现与 Python 一样的效果,则需通过 Join (关联)的方式。
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
s1
## 2013-01-02 1
## 2013-01-03 2
## 2013-01-04 3
## 2013-01-05 4
## 2013-01-06 5
## 2013-01-07 6
## Freq: D, dtype: int64
# df 添加新列 F
df["F"] = s1
# 在数据框中对给定元素赋值
df.at[dates[0], "A"] = 0
df.iat[0, 1] = 0
df.loc[:, "D"] = np.array([5] * len(df))
df
## A B C D F
## 2013-01-01 0.000000 0.000000 1.134287 5.0 NaN
## 2013-01-02 2.426450 0.066978 -0.643228 5.0 1.0
## 2013-01-03 0.592155 0.108114 0.271232 5.0 2.0
## 2013-01-04 -0.071331 -0.682826 0.627705 5.0 3.0
## 2013-01-05 -2.247319 1.595135 0.517725 5.0 4.0
## 2013-01-06 -1.026931 0.205139 -1.057070 5.0 5.0
df2 = df.copy()
df2[df2 > 0] = -df2
df2
## A B C D F
## 2013-01-01 0.000000 0.000000 -1.134287 -5.0 NaN
## 2013-01-02 -2.426450 -0.066978 -0.643228 -5.0 -1.0
## 2013-01-03 -0.592155 -0.108114 -0.271232 -5.0 -2.0
## 2013-01-04 -0.071331 -0.682826 -0.627705 -5.0 -3.0
## 2013-01-05 -2.247319 -1.595135 -0.517725 -5.0 -4.0
## 2013-01-06 -1.026931 -0.205139 -1.057070 -5.0 -5.0
df[1, "A"] = 0
df[1, 2] = 0
df[, "D"] = rep(5, nrow(df))
df2 = df
df2[df2 > 0]
## [1] 0.5727 0.8007 0.4495 2.0038 0.9745 1.3587 1.8483 0.1335 0.4186 5.0000
## [11] 5.0000 5.0000 5.0000 5.0000 5.0000
df2
## A B C D
## 2013-01-01 0.0000 0.0000 -0.7994 5
## 2013-01-02 -1.2680 -1.1996 1.8483 5
## 2013-01-03 0.5727 -0.1781 0.1335 5
## 2013-01-04 -0.6239 2.0038 0.4186 5
## 2013-01-05 0.8007 0.9745 -0.3079 5
## 2013-01-06 0.4495 1.3587 -0.1422 5
5 缺失数据
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1
## A B C D F E
## 2013-01-01 0.000000 0.000000 1.134287 5.0 NaN 1.0
## 2013-01-02 2.426450 0.066978 -0.643228 5.0 1.0 1.0
## 2013-01-03 0.592155 0.108114 0.271232 5.0 2.0 NaN
## 2013-01-04 -0.071331 -0.682826 0.627705 5.0 3.0 NaN
# 去掉含缺失值的行
df1.dropna(how="any")
## A B C D F E
## 2013-01-02 2.42645 0.066978 -0.643228 5.0 1.0 1.0
# 将缺失值填充为 5
df1.fillna(value=5)
## A B C D F E
## 2013-01-01 0.000000 0.000000 1.134287 5.0 5.0 1.0
## 2013-01-02 2.426450 0.066978 -0.643228 5.0 1.0 1.0
## 2013-01-03 0.592155 0.108114 0.271232 5.0 2.0 5.0
## 2013-01-04 -0.071331 -0.682826 0.627705 5.0 3.0 5.0
# 是否为缺失值
pd.isna(df1)
## A B C D F E
## 2013-01-01 False False False False True False
## 2013-01-02 False False False False False False
## 2013-01-03 False False False False False True
## 2013-01-04 False False False False False True
airquality 是 R 内置的数据集,含有缺失值。
df1 = head(airquality)
df1
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
# 去掉含缺失值的行
df1[complete.cases(df1), ] |> head()
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
# 填充缺失值
fillna <- function(x) {x[is.na(x)] = 5; x}
apply(df1, 2, fillna) |> head()
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 5 5 14.3 56 5 5
## 6 28 5 14.9 66 5 6
# 逐列检查是否含有缺失值
apply(df1, 2, anyNA) |> head()
## Ozone Solar.R Wind Temp Month Day
## TRUE TRUE FALSE FALSE FALSE FALSE
# 逐列检查是否是缺失值
apply(df1, 2, is.na) |> head()
## Ozone Solar.R Wind Temp Month Day
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE
## 5 TRUE TRUE FALSE FALSE FALSE FALSE
## 6 FALSE TRUE FALSE FALSE FALSE FALSE
# 整行数据是否完整无缺失
complete.cases(df1) |> head()
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
6 算子
6.1 统计
# 求列平均值
df.mean()
## A -0.054496
## B 0.215423
## C 0.141775
## D 5.000000
## F 3.000000
## dtype: float64
# 求行平均值
df.mean(axis=1)
## 2013-01-01 1.533572
## 2013-01-02 1.570040
## 2013-01-03 1.594300
## 2013-01-04 1.574710
## 2013-01-05 1.773108
## 2013-01-06 1.624228
## Freq: D, dtype: float64
colMeans(df)
## A B C D
## -0.0115 0.4932 0.1918 5.0000
rowMeans(df)
## 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
## 1.050 1.095 1.382 1.700 1.617 1.666
7 合并
7.1 拼接
df = pd.DataFrame(np.random.randn(10, 4))
df
## 0 1 2 3
## 0 1.230407 -1.268378 -0.789585 -0.387475
## 1 -1.001668 -0.573205 -0.330400 -0.534907
## 2 1.456553 1.464092 2.234690 -1.725228
## 3 -1.231932 1.162669 -1.265748 -0.329205
## 4 -0.750532 -0.347237 0.862690 -1.828584
## 5 -0.803690 -1.471544 -0.491834 -0.576663
## 6 0.692912 0.951068 -1.502720 2.565378
## 7 -0.962528 -0.488740 -0.203435 -0.711255
## 8 0.343757 -0.225558 -0.023401 -2.106459
## 9 0.174665 1.574763 -0.619298 0.761226
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
## 0 1 2 3
## 0 1.230407 -1.268378 -0.789585 -0.387475
## 1 -1.001668 -0.573205 -0.330400 -0.534907
## 2 1.456553 1.464092 2.234690 -1.725228
## 3 -1.231932 1.162669 -1.265748 -0.329205
## 4 -0.750532 -0.347237 0.862690 -1.828584
## 5 -0.803690 -1.471544 -0.491834 -0.576663
## 6 0.692912 0.951068 -1.502720 2.565378
## 7 -0.962528 -0.488740 -0.203435 -0.711255
## 8 0.343757 -0.225558 -0.023401 -2.106459
## 9 0.174665 1.574763 -0.619298 0.761226
df <- data.frame(
matrix(data = rnorm(40), nrow = 10, ncol = 4)
)
df
## X1 X2 X3 X4
## 1 0.01941 -0.1804 0.2889 1.3787
## 2 -0.05706 1.0076 0.2686 -0.1908
## 3 0.07462 0.1104 -2.3951 -0.7010
## 4 1.57430 -0.4883 0.6992 0.1575
## 5 1.04954 1.0060 0.4420 0.5075
## 6 0.79425 1.3962 -1.3361 0.7282
## 7 0.94989 2.0674 0.2382 -1.2264
## 8 -0.83755 -0.2801 0.2264 -0.1165
## 9 -0.47313 1.6676 0.3761 -2.6264
## 10 -0.43360 -0.2053 0.0847 -1.1447
rbind(df[1:3, ], df[3:7, ], df[7:10, ])
## X1 X2 X3 X4
## 1 0.01941 -0.1804 0.2889 1.3787
## 2 -0.05706 1.0076 0.2686 -0.1908
## 3 0.07462 0.1104 -2.3951 -0.7010
## 31 0.07462 0.1104 -2.3951 -0.7010
## 4 1.57430 -0.4883 0.6992 0.1575
## 5 1.04954 1.0060 0.4420 0.5075
## 6 0.79425 1.3962 -1.3361 0.7282
## 7 0.94989 2.0674 0.2382 -1.2264
## 71 0.94989 2.0674 0.2382 -1.2264
## 8 -0.83755 -0.2801 0.2264 -0.1165
## 9 -0.47313 1.6676 0.3761 -2.6264
## 10 -0.43360 -0.2053 0.0847 -1.1447
7.2 关联
按某列合并两张表
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
left
## key lval
## 0 foo 1
## 1 foo 2
right
## key rval
## 0 foo 4
## 1 foo 5
pd.merge(left, right, on="key")
## key lval rval
## 0 foo 1 4
## 1 foo 1 5
## 2 foo 2 4
## 3 foo 2 5
left <- data.frame(
key = c("foo", "foo"),
lval = c(1, 2)
)
right <- data.frame(
key = c("foo", "foo"),
rval = c(4, 5)
)
merge(x = left, y = right, by = "key")
## key lval rval
## 1 foo 1 4
## 2 foo 1 5
## 3 foo 2 4
## 4 foo 2 5
按主键(唯一性)合并两张表
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
left
## key lval
## 0 foo 1
## 1 bar 2
right
## key rval
## 0 foo 4
## 1 bar 5
pd.merge(left, right, on="key")
## key lval rval
## 0 foo 1 4
## 1 bar 2 5
left <- data.frame(
key = c("foo", "bar"),
lval = c(1, 2)
)
right <- data.frame(
key = c("foo", "bar"),
rval = c(4, 5)
)
merge(x = left, y = right, by = "key")
## key lval rval
## 1 bar 2 5
## 2 foo 1 4
8 聚合
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df
## A B C D
## 0 foo one 0.275127 1.290274
## 1 bar one 0.658550 1.379370
## 2 foo two -1.068897 -1.259664
## 3 bar three 0.915246 1.208480
## 4 foo two 1.587377 -0.425338
## 5 bar two -0.665659 -0.814551
## 6 foo one 1.385820 -0.631904
## 7 foo three -0.026600 0.614469
df.groupby("A")[["C", "D"]].sum()
## C D
## A
## bar 0.908137 1.773299
## foo 2.152827 -0.412162
df.groupby(["A", "B"]).sum()
## C D
## A B
## bar one 0.658550 1.379370
## three 0.915246 1.208480
## two -0.665659 -0.814551
## foo one 1.660947 0.658370
## three -0.026600 0.614469
## two 0.518479 -1.685001
df <- data.frame(
A = c("foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"),
B = c("one", "one", "two", "three", "two", "two", "one", "three"),
C = rnorm(8),
D = rnorm(8)
)
df
## A B C D
## 1 foo one -0.258537 1.3789
## 2 bar one -1.517600 0.8628
## 3 foo two -1.026278 0.6548
## 4 bar three 0.002687 -1.7290
## 5 foo two -0.049517 -0.2230
## 6 bar two -0.652734 -1.1685
## 7 foo one 1.534983 2.5455
## 8 foo three 0.217812 -0.7702
aggregate(df, cbind(C, D) ~ A, sum)
## A C D
## 1 bar -2.1676 -2.035
## 2 foo 0.4185 3.586
aggregate(df, . ~ A + B, sum)
## A B C D
## 1 bar one -1.517600 0.8628
## 2 foo one 1.276446 3.9244
## 3 bar three 0.002687 -1.7290
## 4 foo three 0.217812 -0.7702
## 5 bar two -0.652734 -1.1685
## 6 foo two -1.075795 0.4319
9 重塑
9.1 stack 宽格式变长格式
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2
## A B
## first second
## bar one -1.330748 1.464349
## two -0.357819 1.886293
## baz one 0.078505 1.613801
## two -0.473406 1.385060
stacked = df2.stack(future_stack=True)
stacked
## first second
## bar one A -1.330748
## B 1.464349
## two A -0.357819
## B 1.886293
## baz one A 0.078505
## B 1.613801
## two A -0.473406
## B 1.385060
## dtype: float64
9.2 pivot_table 长格式变宽格式
df = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)
df
## A B C D E
## 0 one A foo -2.119611 -0.290515
## 1 one B foo 1.277883 0.201935
## 2 two C foo -0.364860 -0.780446
## 3 three A bar -0.667161 0.471151
## 4 one B bar -0.045493 -1.684878
## 5 one C bar 1.060607 0.053233
## 6 two A foo -0.660434 0.060553
## 7 three B foo -0.033984 -0.793361
## 8 one C foo 0.533071 0.735222
## 9 one A bar -0.442919 -1.810009
## 10 two B bar 1.220791 -0.705138
## 11 three C bar -0.442783 -1.004603
pd.pivot_table(df, values=["D", "E"], index=["A", "B"], columns=["C"])
## D E
## C bar foo bar foo
## A B
## one A -0.442919 -2.119611 -1.810009 -0.290515
## B -0.045493 1.277883 -1.684878 0.201935
## C 1.060607 0.533071 0.053233 0.735222
## three A -0.667161 NaN 0.471151 NaN
## B NaN -0.033984 NaN -0.793361
## C -0.442783 NaN -1.004603 NaN
## two A NaN -0.660434 NaN 0.060553
## B 1.220791 NaN -0.705138 NaN
## C NaN -0.364860 NaN -0.780446
df <- data.frame(
A = rep(c("one", "one", "two", "three"), 3),
B = rep(c("A", "B", "C"), 4),
C = rep(c("foo", "foo", "foo", "bar", "bar", "bar"), 2),
D = rnorm(12),
E = rnorm(12)
)
df
## A B C D E
## 1 one A foo -1.88805 0.19452
## 2 one B foo -0.60136 0.59975
## 3 two C foo -0.01720 0.48475
## 4 three A bar -1.40703 -0.63300
## 5 one B bar -0.33923 -1.79990
## 6 one C bar 0.23079 0.54017
## 7 two A foo 0.61020 -1.72101
## 8 three B foo -2.26313 -1.03960
## 9 one C foo -0.08435 0.66869
## 10 one A bar 0.74246 -0.86170
## 11 two B bar -0.08440 -0.09692
## 12 three C bar -0.85306 0.64536
# 长变宽格式
reshape(data = df, direction = "wide",
idvar = c("A", "B"), timevar = "C", v.names = c("D", "E"))
## A B D.foo E.foo D.bar E.bar
## 1 one A -1.88805 0.1945 0.7425 -0.86170
## 2 one B -0.60136 0.5997 -0.3392 -1.79990
## 3 two C -0.01720 0.4848 NA NA
## 4 three A NA NA -1.4070 -0.63300
## 6 one C -0.08435 0.6687 0.2308 0.54017
## 7 two A 0.61020 -1.7210 NA NA
## 8 three B -2.26313 -1.0396 NA NA
## 11 two B NA NA -0.0844 -0.09692
## 12 three C NA NA -0.8531 0.64536