統計検定の
python で
分散分析に ついて
史上最強 図解
には、
* バラツキの
* 統計資料の
* 統計資料の
分散分析の できる ライブラリ
調べた
spipy
scipy.stats.f_oneway(*args)
で、一元配置分散分析が 可能です。 statsmodels
statsmodels.api.stats.anova_lm()
で、一元配置分散分析、 二限配置分散分析が 可能です。 pyvttbl
python3.x では動作しなそうですが、 pyvttbl と いう ライブラリが あり、 この ライブラリでも 分散分析が 可能です。
参考
以下、
* pythonで
* Four ways to conduct one-way ANOVAs with Python - Erik Marsja
* https://www.marsja.se/three-ways-to-carry-out-2-way-anova-with-python/
* 私の
* Anova in Python | plotly
実際に スクリプトを 書いて 試す
spipy、
spipy
spipy で
使用する
import pandas as pd from scipy import stats url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv" data = pd.read_csv(url) data
Unnamed: 0 | weight | group | |
---|---|---|---|
0 | 1 | 4.17 | ctrl |
1 | 2 | 5.58 | ctrl |
2 | 3 | 5.18 | ctrl |
3 | 4 | 6.11 | ctrl |
4 | 5 | 4.50 | ctrl |
5 | 6 | 4.61 | ctrl |
6 | 7 | 5.17 | ctrl |
7 | 8 | 4.53 | ctrl |
8 | 9 | 5.33 | ctrl |
9 | 10 | 5.14 | ctrl |
10 | 11 | 4.81 | trt1 |
11 | 12 | 4.17 | trt1 |
12 | 13 | 4.41 | trt1 |
13 | 14 | 3.59 | trt1 |
14 | 15 | 5.87 | trt1 |
15 | 16 | 3.83 | trt1 |
16 | 17 | 6.03 | trt1 |
17 | 18 | 4.89 | trt1 |
18 | 19 | 4.32 | trt1 |
19 | 20 | 4.69 | trt1 |
20 | 21 | 6.31 | trt2 |
21 | 22 | 5.12 | trt2 |
22 | 23 | 5.54 | trt2 |
23 | 24 | 5.50 | trt2 |
24 | 25 | 5.37 | trt2 |
25 | 26 | 5.29 | trt2 |
26 | 27 | 4.92 | trt2 |
27 | 28 | 6.15 | trt2 |
28 | 29 | 5.80 | trt2 |
29 | 30 | 5.26 | trt2 |
# CSVデータは縦持ちなので、横持ちのデータに変換する ctrl = data['weight'][data.group == 'ctrl'] grps = pd.unique(data.group.values) d_data = {grp:data['weight'][data.group == grp] for grp in grps} d_data
{'ctrl': 0 4.17 1 5.58 2 5.18 3 6.11 4 4.50 5 4.61 6 5.17 7 4.53 8 5.33 9 5.14 Name: weight, dtype: float64, 'trt1': 10 4.81 11 4.17 12 4.41 13 3.59 14 5.87 15 3.83 16 6.03 17 4.89 18 4.32 19 4.69 Name: weight, dtype: float64, 'trt2': 20 6.31 21 5.12 22 5.54 23 5.50 24 5.37 25 5.29 26 4.92 27 6.15 28 5.80 29 5.26 Name: weight, dtype: float64}
# stats.f_onewayを実行して結果を出力する result = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2']) print(result)
F_onewayResult(statistic=4.846087862380136, pvalue=0.015909958325622899)
statistic は、
有意水準5%と
statsmodels
statsmodels で、
一元配置分散分析
import pandas as pd from scipy import stats url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv" data = pd.read_csv(url) data
Unnamed: 0 | weight | group | |
---|---|---|---|
0 | 1 | 4.17 | ctrl |
1 | 2 | 5.58 | ctrl |
2 | 3 | 5.18 | ctrl |
3 | 4 | 6.11 | ctrl |
4 | 5 | 4.50 | ctrl |
5 | 6 | 4.61 | ctrl |
6 | 7 | 5.17 | ctrl |
7 | 8 | 4.53 | ctrl |
8 | 9 | 5.33 | ctrl |
9 | 10 | 5.14 | ctrl |
10 | 11 | 4.81 | trt1 |
11 | 12 | 4.17 | trt1 |
12 | 13 | 4.41 | trt1 |
13 | 14 | 3.59 | trt1 |
14 | 15 | 5.87 | trt1 |
15 | 16 | 3.83 | trt1 |
16 | 17 | 6.03 | trt1 |
17 | 18 | 4.89 | trt1 |
18 | 19 | 4.32 | trt1 |
19 | 20 | 4.69 | trt1 |
20 | 21 | 6.31 | trt2 |
21 | 22 | 5.12 | trt2 |
22 | 23 | 5.54 | trt2 |
23 | 24 | 5.50 | trt2 |
24 | 25 | 5.37 | trt2 |
25 | 26 | 5.29 | trt2 |
26 | 27 | 4.92 | trt2 |
27 | 28 | 6.15 | trt2 |
28 | 29 | 5.80 | trt2 |
29 | 30 | 5.26 | trt2 |
import statsmodels.api as sm from statsmodels.formula.api import ols # 最小二乗法 ols 実行。 正直あまり、ここで何故登場するのか意味がわかっておりません。 mod = ols('weight ~ group', data=data).fit() print(mod.summary())
OLS Regression Results ============================================================================== Dep. Variable: weight R-squared: 0.264 Model: OLS Adj. R-squared: 0.210 Method: Least Squares F-statistic: 4.846 Date: Sun, 08 Jul 2018 Prob (F-statistic): 0.0159 Time: 04:18:39 Log-Likelihood: -26.810 No. Observations: 30 AIC: 59.62 Df Residuals: 27 BIC: 63.82 Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 5.0320 0.197 25.527 0.000 4.628 5.436 group[T.trt1] -0.3710 0.279 -1.331 0.194 -0.943 0.201 group[T.trt2] 0.4940 0.279 1.772 0.088 -0.078 1.066 ============================================================================== Omnibus: 1.835 Durbin-Watson: 2.704 Prob(Omnibus): 0.400 Jarque-Bera (JB): 1.406 Skew: 0.524 Prob(JB): 0.495 Kurtosis: 2.835 Cond. No. 3.73 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# 分散分析 を実行 aov_table = sm.stats.anova_lm(mod, typ=2) # The type of Anova test to perform. See notes. print(aov_table)
sum_sq df F PR(>F) group 3.76634 2.0 4.846088 0.01591 Residual 10.49209 27.0 NaN NaN
skipy と
二元配置分散分析
続いて、
データは、
# 歯の成長データを取得 data = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/ToothGrowth.csv') data
Unnamed: 0 | len | supp | dose | |
---|---|---|---|---|
0 | 1 | 4.2 | VC | 0.5 |
1 | 2 | 11.5 | VC | 0.5 |
2 | 3 | 7.3 | VC | 0.5 |
3 | 4 | 5.8 | VC | 0.5 |
4 | 5 | 6.4 | VC | 0.5 |
5 | 6 | 10.0 | VC | 0.5 |
6 | 7 | 11.2 | VC | 0.5 |
7 | 8 | 11.2 | VC | 0.5 |
8 | 9 | 5.2 | VC | 0.5 |
9 | 10 | 7.0 | VC | 0.5 |
10 | 11 | 16.5 | VC | 1.0 |
11 | 12 | 16.5 | VC | 1.0 |
12 | 13 | 15.2 | VC | 1.0 |
13 | 14 | 17.3 | VC | 1.0 |
14 | 15 | 22.5 | VC | 1.0 |
15 | 16 | 17.3 | VC | 1.0 |
16 | 17 | 13.6 | VC | 1.0 |
17 | 18 | 14.5 | VC | 1.0 |
18 | 19 | 18.8 | VC | 1.0 |
19 | 20 | 15.5 | VC | 1.0 |
20 | 21 | 23.6 | VC | 2.0 |
21 | 22 | 18.5 | VC | 2.0 |
22 | 23 | 33.9 | VC | 2.0 |
23 | 24 | 25.5 | VC | 2.0 |
24 | 25 | 26.4 | VC | 2.0 |
25 | 26 | 32.5 | VC | 2.0 |
26 | 27 | 26.7 | VC | 2.0 |
27 | 28 | 21.5 | VC | 2.0 |
28 | 29 | 23.3 | VC | 2.0 |
29 | 30 | 29.5 | VC | 2.0 |
30 | 31 | 15.2 | OJ | 0.5 |
31 | 32 | 21.5 | OJ | 0.5 |
32 | 33 | 17.6 | OJ | 0.5 |
33 | 34 | 9.7 | OJ | 0.5 |
34 | 35 | 14.5 | OJ | 0.5 |
35 | 36 | 10.0 | OJ | 0.5 |
36 | 37 | 8.2 | OJ | 0.5 |
37 | 38 | 9.4 | OJ | 0.5 |
38 | 39 | 16.5 | OJ | 0.5 |
39 | 40 | 9.7 | OJ | 0.5 |
40 | 41 | 19.7 | OJ | 1.0 |
41 | 42 | 23.3 | OJ | 1.0 |
42 | 43 | 23.6 | OJ | 1.0 |
43 | 44 | 26.4 | OJ | 1.0 |
44 | 45 | 20.0 | OJ | 1.0 |
45 | 46 | 25.2 | OJ | 1.0 |
46 | 47 | 25.8 | OJ | 1.0 |
47 | 48 | 21.2 | OJ | 1.0 |
48 | 49 | 14.5 | OJ | 1.0 |
49 | 50 | 27.3 | OJ | 1.0 |
50 | 51 | 25.5 | OJ | 2.0 |
51 | 52 | 26.4 | OJ | 2.0 |
52 | 53 | 22.4 | OJ | 2.0 |
53 | 54 | 24.5 | OJ | 2.0 |
54 | 55 | 24.8 | OJ | 2.0 |
55 | 56 | 30.9 | OJ | 2.0 |
56 | 57 | 26.4 | OJ | 2.0 |
57 | 58 | 27.3 | OJ | 2.0 |
58 | 59 | 29.4 | OJ | 2.0 |
59 | 60 | 23.0 | OJ | 2.0 |
import statsmodels.api as sm from statsmodels.formula.api import ols formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)' model = ols(formula, data).fit() print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: len R-squared: 0.794 Model: OLS Adj. R-squared: 0.775 Method: Least Squares F-statistic: 41.56 Date: Sun, 08 Jul 2018 Prob (F-statistic): 2.50e-17 Time: 04:18:54 Log-Likelihood: -159.35 No. Observations: 60 AIC: 330.7 Df Residuals: 54 BIC: 343.3 Df Model: 5 Covariance Type: nonrobust ================================================================================================ coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------ Intercept 13.2300 1.148 11.521 0.000 10.928 15.532 C(supp)[T.VC] -5.2500 1.624 -3.233 0.002 -8.506 -1.994 C(dose)[T.1.0] 9.4700 1.624 5.831 0.000 6.214 12.726 C(dose)[T.2.0] 12.8300 1.624 7.900 0.000 9.574 16.086 C(supp)[T.VC]:C(dose)[T.1.0] -0.6800 2.297 -0.296 0.768 -5.285 3.925 C(supp)[T.VC]:C(dose)[T.2.0] 5.3300 2.297 2.321 0.024 0.725 9.935 ============================================================================== Omnibus: 0.336 Durbin-Watson: 2.025 Prob(Omnibus): 0.846 Jarque-Bera (JB): 0.324 Skew: 0.164 Prob(JB): 0.850 Kurtosis: 2.852 Cond. No. 9.77 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
aov_table = sm.stats.anova_lm(model, typ=2) print(aov_table)
sum_sq df F PR(>F) C(supp) 205.350000 1.0 15.571979 2.311828e-04 C(dose) 2426.434333 2.0 91.999965 4.046291e-18 C(supp):C(dose) 108.319000 2.0 4.106991 2.186027e-02 Residual 712.106000 54.0 NaN NaN
import math print("C(supp) pvalue", 2.311828 * pow(math.e, -4)) print("C(dose) pvalue", 4.046291 * pow(math.e, -18)) print("C(supp):C(dose) pvalue", 2.186027 * pow(math.e, -2))
C(supp) pvalue 0.042342606820864576 C(dose) pvalue 6.162492997121306e-08 C(supp):C(dose) pvalue 0.29584658320788276
C(supp) の
C(dose) の
supp のみが、
まとめ
skipy と、
- 「統計用語 + python」 で
検索すると、 statsmodel 記事が よく 引っかかる。 - skipy と、
statsmodel だったら、 一元配置分散分析も、 二元配置分散分析も、 同じ 手順で 実装できる statsmodel の 方が 良いと 思った。 - 因子分析と、
分散分析の 違いが よくわからなくなったが、 因子が 決まっていてどれくらいの 影響が あるのかを 調べるのが 分散分析で、 因子が 何かわからない 状況で 探すのが 因子分析。 と、 個人的には 理解した。
以上です。
コメント