【kaggle】離婚予測データでlightGBMしてみる

カグルにこんなんありました『離婚予測（離婚の予測）』。なんでも、トルコの170組のカップルに54個の質問をしていて、その回答がデータとしてまとまってると。で、170組のうち84組は既に離婚していて、残りはしていないと。170組分の54個の特徴量と離婚したしてないのターゲットがまとまったデータ、というわけです。
既婚者としてはですね、重要度が高い特徴量（質問）は何か？、を調べずには居られないですよね。ということで、これまた前回『Scikit-learnカリフォルニア住宅価格データセットでlightGBMしてみる』優秀だったlightGBMでやっていこうと思います。

データの確認と準備

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
from sklearn import metrics
from sklearn.model_selection import train_test_split

いつものやつを一通りインポートします。

df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/divorce.csv")
df.head()

データフレームとして読み込んで表示しますよと。

	Sorry_end	Ignore_diff	begin_correct	Contact	Special_time	No_home_time	2_strangers	enjoy_holiday	enjoy_travel	common_goals	harmony	freeom_value	entertain	people_goals	dreams	love	happy	marriage	roles	trust	likes	care_sick	fav_food	stresses	inner_world	anxieties	current_stress	hopes_wishes	know_well	friends_social	Aggro_argue	Always_never	negative_personality	offensive_expressions	insult	humiliate	not_calm	hate_subjects	sudden_discussion	idk_what’s_going_on	calm_breaks	argue_then_leave	silent_for_calm	good_to_leave_home	silence_instead_of_discussion	silence_for_harm	silence_fear_anger	I’m_right	accusations	I’m_not_guilty	I’m_not_wrong	no_hesitancy_inadequate	you’re_inadequate	incompetence	Divorce_Y_N
0	2	2	4	1	0	0	0	0	0	0	1	0	1	1	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	1	1	2	1	2	0	1	2	1	3	3	2	1	1	2	3	2	1	3	3	3	2	3	2	1	1
1	4	4	4	4	4	0	0	4	4	4	4	3	4	0	4	4	4	4	3	2	1	1	0	2	2	1	2	0	1	1	0	4	2	3	0	2	3	4	2	4	2	2	3	4	2	2	2	3	4	4	4	4	2	2	1
2	2	2	2	2	1	3	2	1	1	2	3	4	2	3	3	3	3	3	3	2	1	0	1	2	2	2	2	2	3	2	3	3	1	1	1	1	2	1	3	3	3	3	2	3	2	3	2	3	1	1	1	2	2	2	1
3	3	2	3	2	3	3	3	3	3	3	4	3	3	4	3	3	3	3	3	4	1	1	1	1	2	1	1	1	1	3	2	3	2	2	1	1	3	3	4	4	2	2	3	2	3	2	2	3	3	3	3	2	2	2	1
4	2	2	1	1	1	1	0	0	0	0	0	1	0	1	1	1	1	1	2	1	1	0	0	0	0	2	1	2	1	1	1	1	1	1	0	0	0	0	2	1	0	2	3	0	2	2	1	2	3	2	2	2	1	0	1

グーグル翻訳によれば、
（0 =まったくない、1 =ほとんどない、2 =平均的、3 =頻繁に、4 =常に）
とのことで、質問のタイトルのようなやつがカラムになっていて、
一番右端のカラムに、１：離婚、0：離婚してない、のデータが入ってます。

infoを見てやるとこんな感じ。

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 55 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Sorry_end                      170 non-null    int64
 1   Ignore_diff                    170 non-null    int64
 2   begin_correct                  170 non-null    int64
 3   Contact                        170 non-null    int64
 4   Special_time                   170 non-null    int64
 5   No_home_time                   170 non-null    int64
 6   2_strangers                    170 non-null    int64
 7   enjoy_holiday                  170 non-null    int64
 8   enjoy_travel                   170 non-null    int64
 9   common_goals                   170 non-null    int64
 10  harmony                        170 non-null    int64
 11  freeom_value                   170 non-null    int64
 12  entertain                      170 non-null    int64
 13  people_goals                   170 non-null    int64
 14  dreams                         170 non-null    int64
 15  love                           170 non-null    int64
 16  happy                          170 non-null    int64
 17  marriage                       170 non-null    int64
 18  roles                          170 non-null    int64
 19  trust                          170 non-null    int64
 20  likes                          170 non-null    int64
 21  care_sick                      170 non-null    int64
 22  fav_food                       170 non-null    int64
 23  stresses                       170 non-null    int64
 24  inner_world                    170 non-null    int64
 25  anxieties                      170 non-null    int64
 26  current_stress                 170 non-null    int64
 27  hopes_wishes                   170 non-null    int64
 28  know_well                      170 non-null    int64
 29  friends_social                 170 non-null    int64
 30  Aggro_argue                    170 non-null    int64
 31  Always_never                   170 non-null    int64
 32  negative_personality           170 non-null    int64
 33  offensive_expressions          170 non-null    int64
 34  insult                         170 non-null    int64
 35  humiliate                      170 non-null    int64
 36  not_calm                       170 non-null    int64
 37  hate_subjects                  170 non-null    int64
 38  sudden_discussion              170 non-null    int64
 39  idk_what's_going_on            170 non-null    int64
 40  calm_breaks                    170 non-null    int64
 41  argue_then_leave               170 non-null    int64
 42  silent_for_calm                170 non-null    int64
 43  good_to_leave_home             170 non-null    int64
 44  silence_instead_of_discussion  170 non-null    int64
 45  silence_for_harm               170 non-null    int64
 46  silence_fear_anger             170 non-null    int64
 47  I'm_right                      170 non-null    int64
 48  accusations                    170 non-null    int64
 49  I'm_not_guilty                 170 non-null    int64
 50  I'm_not_wrong                  170 non-null    int64
 51  no_hesitancy_inadequate        170 non-null    int64
 52  you're_inadequate              170 non-null    int64
 53  incompetence                   170 non-null    int64
 54  Divorce_Y_N                    170 non-null    int64
dtypes: int64(55)
memory usage: 73.2 KB

離婚とそうでないデータの数も見ておく。

df["Divorce_Y_N"].value_counts()

0    86
1    84
Name: Divorce_Y_N, dtype: int64

各々問題なさそうなので、データを定義して学習用とテスト用にデータを分けます。

x=df.drop("Divorce_Y_N",axis=1)
t=df.loc[:,["Divorce_Y_N"]]

x_train,x_test,t_train,t_test=train_test_split(x,t,test_size=0.3,random_state=0)
print(x_train.shape,x_test.shape,t_train.shape,t_test.shape)

(119, 54) (51, 54) (119, 1) (51, 1)

lightGBMで学習

import lightgbm as lgb

lgb_train=lgb.Dataset(x_train,t_train)
lgb_eval=lgb.Dataset(x_test,t_test)

params = {"metric":"auc",
          "objective":"binary", 
          "max_depth":10}

lgbm = lgb.train(params,
                lgb_train,
                valid_sets=lgb_eval,
                num_boost_round=500,
                early_stopping_rounds=100,
                verbose_eval=50)

aucはROC曲線の下の部分の面積で、1に近ければ近いほど良い予測器というやつ。二値特有の評価指標ですね。

valid_sets	検証用データ
num_boost_round	ブースティングの回数（木の本数）
early_stopping_rounds	検証用データの評価指標が指定した回数改善しなかったら計算終わってねのやつ
verbose_eval	この数字刻みのブースト回数で結果を出力してねのやつ

ということで、これで学習完了。結果の出力としては、

Training until validation scores don't improve for 100 rounds.
[50]	valid_0's auc: 0.993827
[100]	valid_0's auc: 0.989198
Early stopping, best iteration is:
[45]	valid_0's auc: 0.993827

こんなです。

重要な特徴量（質問）は何か？

学習できたので、結果をみてやります。

np.round(lgbm.predict(x_train),3)

array([0.994, 0.994, 0.995, 0.011, 0.985, 0.006, 0.006, 0.012, 0.995,
       0.02 , 0.015, 0.012, 0.007, 0.01 , 0.008, 0.994, 0.996, 0.994,
       0.013, 0.008, 0.994, 0.994, 0.009, 0.995, 0.011, 0.007, 0.007,
       0.006, 0.006, 0.009, 0.994, 0.009, 0.014, 0.996, 0.007, 0.011,
       0.908, 0.994, 0.994, 0.005, 0.994, 0.011, 0.995, 0.013, 0.995,
       0.993, 0.995, 0.996, 0.995, 0.996, 0.015, 0.008, 0.983, 0.995,
       0.994, 0.006, 0.01 , 0.994, 0.008, 0.011, 0.994, 0.9  , 0.006,
       0.012, 0.994, 0.994, 0.009, 0.995, 0.995, 0.994, 0.013, 0.995,
       0.982, 0.01 , 0.009, 0.014, 0.994, 0.006, 0.009, 0.995, 0.994,
       0.995, 0.008, 0.009, 0.995, 0.993, 0.013, 0.996, 0.009, 0.008,
       0.009, 0.996, 0.01 , 0.994, 0.01 , 0.994, 0.995, 0.006, 0.008,
       0.009, 0.996, 0.995, 0.995, 0.994, 0.009, 0.008, 0.996, 0.996,
       0.012, 0.01 , 0.994, 0.01 , 0.995, 0.995, 0.959, 0.013, 0.995,
       0.008, 0.995])

学習データに対して、こんな予測となりました、なるほどなるほど。
正解率は、

print(metrics.accuracy_score(t_train.values,np.round(lgbm.predict(x_train))))
print(metrics.accuracy_score(t_test.values,np.round(lgbm.predict(x_test))))

1.0
0.9803921568627451

さすがのlightGBMですね。
さて、本題はここから、特徴量の重要度を見ます。

imp=pd.DataFrame(np.round(lgbm.feature_importance(importance_type="gain"),2), index=x.columns, columns=["importance"])
imp=imp.sort_values("importance",ascending=False)
imp.head()

質問	importance
idk_what’s_going_on	388.66
marriage	381.65
sudden_discussion	51.93
trust	17.57
Contact	5.0

このgainを指定すると、予測値と真の値のギャップに対する重要度が計算されるようです。
指定なしだと決定木に使われた頻度が出力されます。
ということで、これ、上記の上2個の質問でほぼ予測できてしまう感じですね。
具体的には、これらこんな質問です。

idk_what’s_going_on
We’re just starting a discussion before I know what’s going on.
何が起こっているのかを知る前に、私たちは話し合いを始めたばかりです。
marriage
My spouse and I have similar ideas about how marriage should be.
私の配偶者と私は結婚がどうあるべきかについて同様の考えを持っています。

グーグル翻訳でこんな。ん？相関関係見てみます。

df_=df.loc[:,["marriage","idk_what's_going_on","Divorce_Y_N"]]
df_.corr()

質問	marriage	idk_what’s_going_on	Divorce_Y_N
marriage	1.0	0.8760012777696867	0.9232083178110088
idk_what’s_going_on	0.8760012777696867	1.0	0.9386836321317147
Divorce_Y_N	0.9232083178110088	0.9386836321317147	1.0

どちらの質問も、ターゲット（1:離婚、0:離婚してない）に対して強い正の相関です。
一個目の質問は、状況把握をお互いせんと議論がおっぱじまる、ってことなら、これは確かに離婚に結び付きそうな感じしますが、2個目のやつ、これは、結婚がどうあるべきか同様の考え方を持ってると離婚しやすい、ってことになりますね。んー、そもそもどうあるべきか、なんて考えを持ってるとよくないし、それが似てるなんてもってのほか、ってこと？？？