SciPy 做科学计算

1 线性代数
2 积分
3 优化
4 稀疏数组
5 统计

SciPy 是基于 NumPy 做科学计算的 Python 模块，它的官网没有类似 NumPy 和 Pandas 的快速入门文档，但有一个按功能分类的用户指南。本文介绍 SciPy 提供的基础算法，涵盖优化、积分、插值、特征值问题、代数方程、微分方程、统计学等功能。

下面蜻蜓点水式地介绍一点 SciPy 做科学计算及等价 R 语言实现。

1 线性代数

在线性代数运算方面，scipy.linalg 包含 numpy.linalg 的所有函数，且功能比 numpy.linalg 更多。scipy.linalg 总是添加了 BLAS/LAPACK 支持，而这在 numpy.linalg 是可选的。

推荐使用 NumPy 提供的多维数组类型。

import numpy as np
from scipy import linalg
A = np.array([[1,2],[3,4]])
A

## array([[1, 2],
##        [3, 4]])

linalg.inv(A)

## array([[-2. ,  1. ],
##        [ 1.5, -0.5]])

b = np.array([[5,6]]) #2D array
b

## array([[5, 6]])

b.T

## array([[5],
##        [6]])

A*b #not matrix multiplication!

## array([[ 5, 12],
##        [15, 24]])

A.dot(b.T) #matrix multiplication

## array([[17],
##        [39]])

b = np.array([5,6]) #1D array
b

## array([5, 6])

b.T  #not matrix transpose!

## array([5, 6])

A.dot(b)  #does not matter for multiplication

## array([17, 39])

A = array(data = c(1, 3, 2, 4), dim = c(2, 2))
A

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

solve(A)

##      [,1] [,2]
## [1,] -2.0  1.0
## [2,]  1.5 -0.5

b = array(data = c(5, 6), dim = c(1, 2))
b

##      [,1] [,2]
## [1,]    5    6

t(b)

##      [,1]
## [1,]    5
## [2,]    6

# A * b 不可以
A %*% t(b) # 矩阵乘法

##      [,1]
## [1,]   17
## [2,]   39

2 积分

这是一个关于贝塞尔函数的积分，贝塞尔函数是内置的特殊函数。

import scipy.integrate as integrate
import scipy.special as special
result = integrate.quad(lambda x: special.jv(2.5,x), 0, 4.5)
result

## (1.1178179380783249, 7.866317216380692e-09)

from numpy import sqrt, sin, cos, pi
I = sqrt(2/pi)*(18.0/27*sqrt(2)*cos(4.5) - 4.0/27*sqrt(2)*sin(4.5) +
                sqrt(2*pi) * special.fresnel(3/sqrt(pi))[0])
I

## np.float64(1.117817938088701)

print(abs(result[0]-I))

## 1.0376144388146713e-11

integrate(f = besselJ, nu = 2.5, lower = 0, upper = 4.5)

## 1.118 with absolute error < 1.4e-07

3 优化

SciPy 的优化功能也不少，局部优化和全局优化问题，约束优化和无约束优化问题等都有涉及。下面是一个关于 5 维的香蕉函数的求极小值的问题。

笔者曾在书《R 语言数据分析实战》专门整理了 R 语言做数值优化的问题，很全面，没啥要补充的，不在此一一对照。

import numpy as np
from scipy.optimize import minimize

def rosen(x):
    """The Rosenbrock function"""
    return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)

x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(rosen, x0, method='L-BFGS-B')
               
print(res.x)

## [0.99999957 0.99999915 0.99999834 0.99999667 0.99999336]

fn <- function(x) {
  n <- length(x)
  sum(100*(x[-1] - x[-n]^2  )^2 +  (1 - x[-n])^2 )
}
optim(
  par = c(1.3, 0.7, 0.8, 1.9, 1.2), fn = fn, 
  method = "L-BFGS-B"
)

## $par
## [1] 1.0000 0.9999 0.9999 0.9997 0.9994
## 
## $value
## [1] 1.007e-07
## 
## $counts
## function gradient 
##       27       27 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

4 稀疏数组

稀疏数组应用很广泛，特别是在图关系数据（邻接矩阵）和文本数据分析（词文档矩阵）领域。下面是一个二维稀疏数组（即矩阵）的构造过程。

import scipy as sp
import numpy as np
dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
sparse = sp.sparse.coo_array(dense)
dense

## array([[1, 0, 0, 2],
##        [0, 4, 1, 0],
##        [0, 0, 5, 0]])

sparse

## <COOrdinate sparse array of dtype 'int64'
## 	with 5 stored elements and shape (3, 4)>

sparse.max()

## np.int64(5)

dense.max()

## np.int64(5)

sparse.argmax()

## 10

dense.argmax()

## np.int64(10)

sparse.mean()

## np.float64(1.0833333333333333)

dense.mean()

## np.float64(1.0833333333333333)

sparse.nnz

## 5

sparse.mean(axis=1)

## array([0.75, 1.25, 1.25])

在 R 语言中，Matrix 包主要用于稀疏矩阵操作。

library(Matrix)
dense = matrix(data = c(1, 0, 0, 2,
                        0, 4, 1, 0,
                        0, 0, 5, 0),
               byrow = T, nrow = 3, ncol = 4)
sparse = Matrix(dense, sparse=TRUE) 
dense

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    2
## [2,]    0    4    1    0
## [3,]    0    0    5    0

sparse

## 3 x 4 sparse Matrix of class "dgCMatrix"
##             
## [1,] 1 . . 2
## [2,] . 4 1 .
## [3,] . . 5 .

max(dense)

## [1] 5

max(sparse)

## [1] 5

which.max(dense)

## [1] 9

# which.max(sparse)
mean(dense)

## [1] 1.083

mean(sparse)

## [1] 1.083

rowMeans(sparse)

## [1] 0.75 1.25 1.25

5 统计

独立两样本方差相等的均值检验，即 t 检验，而在方差不等的情况下，是 Welch’s t-test 。

import scipy.stats as stats
rvs1 = stats.norm.rvs(loc=5, scale=10, size=500)
rvs2 = stats.norm.rvs(loc=8, scale=10, size=500)
stats.ttest_ind(rvs1, rvs2, equal_var=True)

## TtestResult(statistic=np.float64(-4.677832576759619), pvalue=np.float64(3.298043097151713e-06), df=np.float64(998.0))

函数 ttest_ind 的参数说明见文档。

rvs1 = rnorm(n = 500, mean = 5, sd = 10)
rvs2 = rnorm(n = 500, mean = 8, sd = 10)
t.test(rvs1, rvs2, var.equal = TRUE)

## 
## 	Two Sample t-test
## 
## data:  rvs1 and rvs2
## t = -4.5, df = 998, p-value = 7e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.102 -1.621
## sample estimates:
## mean of x mean of y 
##     5.625     8.487

在 Python 和 R 语言中，都是随机生成的两组样本，样本并不相同，检验统计量的值、P 值等自然也不同。

黄湘云

1 线性代数

2 积分

3 优化

4 稀疏数组

5 统计