【kaggle】ポケモンデータでlightGBMしてみる

カグルにこんなんありました『Pokemon with stats』、ポケモンのデータです。800種類のポケモンの
基本的な統計（HP、攻撃、防御、特殊攻撃、特殊防御、速度等）情報がまとまってるとのこと。伝説ポケモンかそうでないか、というデータも入っていますので、どの特徴量が伝説ポケモンたらしめているのか、を見ていこうと思います。こちらも前回『【kaggle】離婚予測データでlightGBMしてみる』使ったlightGBMでやっていこうと思います。

データの確認と準備

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
from sklearn import metrics
from sklearn.model_selection import train_test_split

import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder

いつものやつと、カテゴリデータがあるので、ラベルエンコーダーもインポートしておきます。

df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Pokemon.csv")
df.head(10)

データフレームとして読み込んで表示します。

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	FALSE
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	FALSE
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	FALSE
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	FALSE
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	FALSE
5	5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	1	FALSE
6	6	Charizard	Fire	Flying	534	78	84	78	109	85	100	1	FALSE
7	6	CharizardMega Charizard X	Fire	Dragon	634	78	130	111	130	85	100	1	FALSE
8	6	CharizardMega Charizard Y	Fire	Flying	634	78	104	78	159	115	100	1	FALSE
9	7	Squirtle	Water	NaN	314	44	48	65	50	64	43	1	FALSE

こういう感じです。ほとんど知らないんですが、Type1が大カテゴリ、Type2が中カテゴリみたいなことなんでしょう。Generation、これもおそらく世に出た順番みたいなものでしょう。
で、Legendaryに伝説か、そうでないか、が入ってます。

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB

こんな感じ。これも見ておきます。

df.nunique()

#             721
Name          800
Type 1         18
Type 2         18
Total         200
HP             94
Attack        111
Defense       103
Sp. Atk       105
Sp. Def        92
Speed         108
Generation      6
Legendary       2
dtype: int64

なるほど、Type1は18種もありますね。Type2はないポケモンもいるようなので、これ以外でやっていきます。

df_=df.drop(["#","Name","Type 2","Legendary"],axis=1)
df_.head(10)

	Type 1	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	Grass	318	45	49	49	65	65	45	1
1	Grass	405	60	62	63	80	80	60	1
2	Grass	525	80	82	83	100	100	80	1
3	Grass	625	80	100	123	122	120	80	1
4	Fire	309	39	52	43	60	50	65	1
5	Fire	405	58	64	58	80	65	80	1
6	Fire	534	78	84	78	109	85	100	1
7	Fire	634	78	130	111	130	85	100	1
8	Fire	634	78	104	78	159	115	100	1
9	Water	314	44	48	65	50	64	43	1

Type1を数値化していきます。決定木ベースのlightGBMを使うので、そのままラベルエンコーディングします。

x=df_

type1_le=LabelEncoder()
x["Type 1"]=type1_le.fit_transform(x["Type 1"])
x.head(10)

	Type 1	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	9	318	45	49	49	65	65	45	1
1	9	405	60	62	63	80	80	60	1
2	9	525	80	82	83	100	100	80	1
3	9	625	80	100	123	122	120	80	1
4	6	309	39	52	43	60	50	65	1
5	6	405	58	64	58	80	65	80	1
6	6	534	78	84	78	109	85	100	1
7	6	634	78	130	111	130	85	100	1
8	6	634	78	104	78	159	115	100	1
9	17	314	44	48	65	50	64	43	1

これで特徴量は用意できました。次にターゲットの方ですね。

t=df.iloc[:,-1]
t=t.astype(np.int)

簡単です。true/falseは型を変えるだけで１，０になってくれます。あとは、いつもの通り、スプリットします。

x_train,x_test,t_train,t_test=train_test_split(x,t,test_size=0.3,random_state=0)
print(x_train.shape,x_test.shape,t_train.shape,t_test.shape)

(560, 9) (240, 9) (560,) (240,)

これにて、データの準備が整いました。

lightGBMで学習

lgb_train=lgb.Dataset(x_train,t_train)
lgb_eval=lgb.Dataset(x_test,t_test)

params = {"metric":"auc",
          "objective":"binary", 
          "max_depth":7}

lgbm = lgb.train(params,
                lgb_train,
                valid_sets=lgb_eval,
                num_boost_round=1000,
                early_stopping_rounds=50,
                verbose_eval=50)

Training until validation scores don't improve for 50 rounds.
[50]	valid_0's auc: 0.986375
Early stopping, best iteration is:
[45]	valid_0's auc: 0.988179

いつもの通り、一瞬で終わります。正解率は、

print(metrics.accuracy_score(t_train.values,np.round(lgbm.predict(x_train))))
print(metrics.accuracy_score(t_test.values,np.round(lgbm.predict(x_test))))

0.9946428571428572
0.9625

やはり、素晴らしい数値をたたき出します。

伝説ポケモンたらしめる特徴量は何か？

imp=pd.DataFrame(np.round(lgbm.feature_importance(importance_type="gain"),2), index=x.columns, columns=["importance"])
imp=imp.sort_values("importance",ascending=False)
imp