Árboles de decisión (y IV)
En este artículo vamos a repetir el mismo ejercicio que en el anterior pero esta vez construiremos un modelo C5.0.
Como recordaréis, nuestro problema de clasificación consiste en la predicción de posibles bajas (churn) de clientes de una operadora móvil.
Los pasos que seguiremos son, como siempre:
Obtención de los datos
Exploración y preparación de los datos
Construcción del modelo
Evaluación de su rendimiento
Posibilidades de mejora
Obtención de los datos
Cargamos de nuevo los datos:
library(C50)
library(modeldata)
data(mlc_churn)
churn <- mlc_churn
Exploración y preparación de los datos
Este ejercicio ya lo hicimos aquí, pero refresquemos un poco nuestro conocimiento del dataset:
str(churn)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 20 variables:
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
## $ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
## $ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_minutes : num 265 162 243 299 167 ...
## $ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
## $ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
## $ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_minutes : num 245 254 163 197 187 ...
## $ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
## $ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
## $ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
Tenemos 5000 observaciones y 17 variables, 16 de ellas predictores y 1, churn, nuestra variable objetivo.
¿Cuántas observaciones corresponden a clientes que desertaron?
table(churn$churn)
##
## yes no
## 707 4293
Y en porcentajes:
prop.table(table(churn$churn))
##
## yes no
## 0.1414 0.8586
Preparemos ahora los datos para la posterior construcción del modelo. Los dividiremos en un training set (con el que construiremos el modelo) y un test set (con el que evaluaremos el rendimiento dle modelo). Existen sistemas más sofisticados, como la validación cruzada o cross-validation que ya veremos, de momento mantendremos las cosas sencillas.
Para dividir los datos entre el training set y el test set utilizaremos el muestreo aleatorio, un procedimiento que selecciona aleatoriamente observaciones del conjunto total. Haremos que al training set vayan a parar aleatoriamente el 90% de las observaciones y el 10 % restante altest set:
set.seed(127)
train_idx <- sample(nrow(churn), 0.9*nrow(churn))
churn_train <- churn[train_idx,]
churn_test <- churn[-train_idx,]
Efectivamente, las dos muestras son muy parecidas:
prop.table(table(churn_train$churn))
##
## yes no
## 0.1388889 0.8611111
prop.table(table(churn_test$churn))
##
## yes no
## 0.164 0.836
Construcción del modelo
Vamos a utilizar el algoritmo C5.0 del paquete C50
. Ya lo hemos cargado antes (library(C50)
) ya que este paquete también contiene nuestros datos.
En primera aproximación usaremos las opciones por defecto (trials = 1
, costs = NULL
, rules = FALSE
, weights = NULL
,control = C5.0Control()
. Por claridad las voy a explicitar:
C50_churn_model <- C5.0(x = churn_train[-20],
y = churn_train$churn,
trials = 1,
rules = FALSE,
weights = NULL,
control = C5.0Control(),
costs = NULL)
Como se ve, este algoritmo tiene muchas opciones. En particular, véase la función de control del algoritmo:
C5.0Control()
## $subset
## [1] TRUE
##
## $bands
## [1] 0
##
## $winnow
## [1] FALSE
##
## $noGlobalPruning
## [1] FALSE
##
## $CF
## [1] 0.25
##
## $minCases
## [1] 2
##
## $fuzzyThreshold
## [1] FALSE
##
## $sample
## [1] 0
##
## $earlyStopping
## [1] TRUE
##
## $label
## [1] "outcome"
##
## $seed
## [1] 2993
Te resultará interesante echarle un ojo a ?C5.0
y C5.0Control
.
Veamos el modelo que ha resultado:
C50_churn_model
##
## Call:
## C5.0.default(x = churn_train[-20], y = churn_train$churn, trials = 1, rules
## = FALSE, weights = NULL, control = C5.0Control(), costs = NULL)
##
## Classification Tree
## Number of samples: 4500
## Number of predictors: 19
##
## Tree size: 29
##
## Non-standard options: attempt to group attributes
Vemos que la “profundidad” de las decisiones del árbol llega a 27. Veámoslas:
summary(C50_churn_model)
##
## Call:
## C5.0.default(x = churn_train[-20], y = churn_train$churn, trials = 1, rules
## = FALSE, weights = NULL, control = C5.0Control(), costs = NULL)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 30 12:32:34 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 4500 cases (20 attributes) from undefined.data
##
## Decision tree:
##
## number_customer_service_calls > 3:
## :...total_day_minutes <= 162.7:
## : :...total_eve_charge <= 19.83: yes (105/5)
## : : total_eve_charge > 19.83:
## : : :...total_day_minutes <= 134.5: yes (17/1)
## : : total_day_minutes > 134.5: no (16/3)
## : total_day_minutes > 162.7:
## : :...international_plan = yes:
## : :...total_night_calls <= 96: yes (10/1)
## : : total_night_calls > 96: no (11/3)
## : international_plan = no:
## : :...total_day_minutes > 263.4:
## : :...voice_mail_plan = no: yes (15/2)
## : : voice_mail_plan = yes: no (5)
## : total_day_minutes <= 263.4:
## : :...total_eve_charge > 11.48: no (158/21)
## : total_eve_charge <= 11.48:
## : :...total_day_minutes <= 201.3: yes (10)
## : total_day_minutes > 201.3: no (9/3)
## number_customer_service_calls <= 3:
## :...total_day_minutes <= 244.6:
## :...international_plan = yes:
## : :...total_intl_minutes > 13: yes (60)
## : : total_intl_minutes <= 13:
## : : :...total_intl_calls <= 2: yes (48)
## : : total_intl_calls > 2: no (214/5)
## : international_plan = no:
## : :...total_day_minutes <= 220.8: no (2932/72)
## : total_day_minutes > 220.8:
## : :...total_eve_charge <= 22.7: no (368/18)
## : total_eve_charge > 22.7:
## : :...voice_mail_plan = no: yes (35/4)
## : voice_mail_plan = yes: no (11)
## total_day_minutes > 244.6:
## :...voice_mail_plan = yes: no (115/8)
## voice_mail_plan = no:
## :...total_eve_minutes > 201:
## :...total_night_charge > 9.5: yes (75)
## : total_night_charge <= 9.5:
## : :...total_day_minutes > 264.6: yes (57/3)
## : total_day_minutes <= 264.6:
## : :...total_eve_minutes <= 242.4: no (25/3)
## : total_eve_minutes > 242.4: yes (21/5)
## total_eve_minutes <= 201:
## :...total_day_minutes <= 277.7:
## :...international_plan = no: no (112/12)
## : international_plan = yes:
## : :...total_intl_calls <= 2: yes (6)
## : total_intl_calls > 2: no (14/3)
## total_day_minutes > 277.7:
## :...total_eve_minutes > 167.3: yes (23)
## total_eve_minutes <= 167.3:
## :...total_night_charge > 9.31: yes (10)
## total_night_charge <= 9.31:
## :...total_day_minutes <= 303: no (15)
## total_day_minutes > 303: yes (3)
##
##
## Evaluation on training data (4500 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 29 172( 3.8%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 474 151 (a): class yes
## 21 3854 (b): class no
##
##
## Attribute usage:
##
## 100.00% total_day_minutes
## 100.00% number_customer_service_calls
## 89.29% international_plan
## 16.20% total_eve_charge
## 12.04% voice_mail_plan
## 8.02% total_eve_minutes
## 7.16% total_intl_minutes
## 6.27% total_intl_calls
## 4.58% total_night_charge
## 0.47% total_night_calls
##
##
## Time: 0.1 secs
Donde se ven claramente las decisiones según las cuales se crean las ramas del árbol. Los números entre paréntesis indican el número de muestras que llegan a la decisión y cuántas de ellas se clasifican mal. Por ejemplo, a la decisión de la 3ª línea llegan 106 muestras que se clasifican como “yes” y, de ellas, 4 quedan mal clasificadas.
Evaluación del modelo
Los árboles de decisión tienen tendencia a sobreajustarse (overfit) a los ejemplos que se le presentan en el training set. Cualquier algoritmo machine learning se comportará peor con los datos de test set que con los del training set (al fin y al cabo, no los ha visto nunca ;-), pero en el caso de los árboles de decisión puede ser peor. Comprobémoslo. Hahamos una predicción sobre las muestras del test set:
C50_predictions <- predict(C50_churn_model, churn_test)
La confusion matrix obtenida es:
library(caret)
C50_cm <- confusionMatrix(data = C50_predictions,
reference = churn_test$churn)
C50_cm$table
## Reference
## Prediction yes no
## yes 60 5
## no 22 413
La accuracy o exactitud (aciertos sobre el total de casos) es:
C50_cm$overall["Accuracy"]
## Accuracy
## 0.946
No está nada mal, con el training test era solo ligeramente superior, el 95.5 %. Pero investiguemos un poco más.
La sensibilidad (de todos los verdaderos “yes”, ¿cuántos se clasificaron como tales? o: si un ejemplo es verdaderamente “yes” ¿cuál es la probabilidad de que lo hayamos clasificado correctamente?) es:
C50_cm$byClass["Sensitivity"]
## Sensitivity
## 0.7317073
Ya que de 60 verdaderos “yes” solo se clasificaron como tales 41.
La especificidad (de todos los verdaderos “no”, ¿cuántos se clasificaron como tales? o: si un ejemplo es verdaeramente “no” ¿cuál es la probabilidad de que lo hayamos clasificado correctamente?) es:
C50_cm$byClass["Specificity"]
## Specificity
## 0.9880383
Ya que de 440 verdaderos “no” se clasificaron como tales 434.
Por lo tanto la false positive rate (de todos los verdaderos “no”, ¿cuántos se clasificaron como “yes”?) es:
as.numeric(C50_cm$byClass["Specificity"])
## [1] 0.9880383
Y la false negative rate (de todos los verdaderos “yes”, ¿cuántos se clasificaron como “no”?):
as.numeric(C50_cm$byClass["Sensitivity"])
## [1] 0.7317073
También podemos hablar del valor de predicción positiva (de todos las predicciones “yes”, ¿cuántas lo eran realmente? o: si hemos clasificado una obeservación como “yes” ¿cuál es la probabilidad de que realmente lo sea?)
C50_cm$byClass["Pos Pred Value"]
## Pos Pred Value
## 0.9230769
(ya que se predijeron 47 “yes” y solo 41 lo eran)
Y el valor de predicción negativa (de todas las predicciones “no”, cuántas lo eran realmente? o: si hemos clasificado una obeservación como “no” ¿cuál es la probabilidad de que realmente lo sea?):
C50_cm$byClass["Neg Pred Value"]
## Neg Pred Value
## 0.9494253
(ya que se predijeron 453 “yes” y solo 434 lo eran)
Posibilidades de mejora
Boosting
C5.0
nos proporciona la posibilidad de utilizar un mecanismo llamado boosting adaptativo, un proceso en el que se construyen muchos árboles de decisión que “votan” para decidir la clase de cada observación.
Se puede aplicar boosting a cualquier algoritmo machine learning, no sólo a los árboles de decisión. Por el momento, nos contentaremos con mencionar que la filosofía en que se basa consite en combinar un conjunto de clasificadores débiles para construir un clasificador más potente que cualquiera de ellos.
La función C5.0()
permite emplear boosting muy fácilmente, simplemente especificando mediante el argumento trials
el número de árboles que se quiere emplear. Se suelen emplear 10 árboles, lo que, según algunos estudios, suele permitir disminuir la tasa de error más o menos un 25%.
trials
especifica el límite superior de árboles que añadir; si al añadir árboles se observa que la exactitud no mejora significativamente, dejan de añadirse árboles.
C50_churn_model_boost10 <- C5.0(x = churn_train[-20],
y = churn_train$churn,
trials = 10,
rules = FALSE, # Default
weights = NULL, # Default
control = C5.0Control(), # Default
costs = NULL # Default
)
Examinemos el modelo resultante. Observaremos que aparecen algunas lineas más:
C50_churn_model_boost10
##
## Call:
## C5.0.default(x = churn_train[-20], y = churn_train$churn, trials = 10, rules
## = FALSE, weights = NULL, control = C5.0Control(), costs = NULL)
##
## Classification Tree
## Number of samples: 4500
## Number of predictors: 19
##
## Number of boosting iterations: 10
## Average tree size: 30.1
##
## Non-standard options: attempt to group attributes
En efecto, aparecen el número de trials
y el tamaño medio de cada árbol.
Mediante:
summary(C50_churn_model_boost10)
##
## Call:
## C5.0.default(x = churn_train[-20], y = churn_train$churn, trials = 10, rules
## = FALSE, weights = NULL, control = C5.0Control(), costs = NULL)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 30 12:32:35 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 4500 cases (20 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## number_customer_service_calls > 3:
## :...total_day_minutes <= 162.7:
## : :...total_eve_charge <= 19.83: yes (105/5)
## : : total_eve_charge > 19.83:
## : : :...total_day_minutes <= 134.5: yes (17/1)
## : : total_day_minutes > 134.5: no (16/3)
## : total_day_minutes > 162.7:
## : :...international_plan = yes:
## : :...total_night_calls <= 96: yes (10/1)
## : : total_night_calls > 96: no (11/3)
## : international_plan = no:
## : :...total_day_minutes > 263.4:
## : :...voice_mail_plan = no: yes (15/2)
## : : voice_mail_plan = yes: no (5)
## : total_day_minutes <= 263.4:
## : :...total_eve_charge > 11.48: no (158/21)
## : total_eve_charge <= 11.48:
## : :...total_day_minutes <= 201.3: yes (10)
## : total_day_minutes > 201.3: no (9/3)
## number_customer_service_calls <= 3:
## :...total_day_minutes <= 244.6:
## :...international_plan = yes:
## : :...total_intl_minutes > 13: yes (60)
## : : total_intl_minutes <= 13:
## : : :...total_intl_calls <= 2: yes (48)
## : : total_intl_calls > 2: no (214/5)
## : international_plan = no:
## : :...total_day_minutes <= 220.8: no (2932/72)
## : total_day_minutes > 220.8:
## : :...total_eve_charge <= 22.7: no (368/18)
## : total_eve_charge > 22.7:
## : :...voice_mail_plan = no: yes (35/4)
## : voice_mail_plan = yes: no (11)
## total_day_minutes > 244.6:
## :...voice_mail_plan = yes: no (115/8)
## voice_mail_plan = no:
## :...total_eve_minutes > 201:
## :...total_night_charge > 9.5: yes (75)
## : total_night_charge <= 9.5:
## : :...total_day_minutes > 264.6: yes (57/3)
## : total_day_minutes <= 264.6:
## : :...total_eve_minutes <= 242.4: no (25/3)
## : total_eve_minutes > 242.4: yes (21/5)
## total_eve_minutes <= 201:
## :...total_day_minutes <= 277.7:
## :...international_plan = no: no (112/12)
## : international_plan = yes:
## : :...total_intl_calls <= 2: yes (6)
## : total_intl_calls > 2: no (14/3)
## total_day_minutes > 277.7:
## :...total_eve_minutes > 167.3: yes (23)
## total_eve_minutes <= 167.3:
## :...total_night_charge > 9.31: yes (10)
## total_night_charge <= 9.31:
## :...total_day_minutes <= 303: no (15)
## total_day_minutes > 303: yes (3)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## number_customer_service_calls > 3:
## :...state in {AK,AR,CA,CT,DC,DE,FL,GA,IA,ID,KS,KY,LA,MA,ME,MI,MN,MO,MS,MT,NC,
## : : ND,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TX,VT,WA,WI,WV,
## : : WY}: yes (416.8/109.8)
## : state in {AL,AZ,CO,HI,IL,IN,MD,NE,TN,UT,VA}: no (98.7/16.7)
## number_customer_service_calls <= 3:
## :...total_day_minutes > 236.3:
## :...total_night_charge <= 7.32: no (164.2/31.4)
## : total_night_charge > 7.32:
## : :...voice_mail_plan = no: yes (444.3/118.8)
## : voice_mail_plan = yes: no (129.1/49.3)
## total_day_minutes <= 236.3:
## :...international_plan = yes:
## :...total_intl_charge <= 3.51: no (218.1/61.6)
## : total_intl_charge > 3.51: yes (42.6)
## international_plan = no:
## :...total_eve_minutes <= 167: no (682.2/70.4)
## total_eve_minutes > 167:
## :...total_day_calls <= 77: no (268.3/30.4)
## total_day_calls > 77:
## :...state in {AK,AR,CO,CT,DE,GA,HI,IA,IL,KS,KY,MA,MO,NM,OK,PA,
## : RI,SD}: no (517.9/5.3)
## state in {AL,AZ,CA,DC,FL,ID,IN,LA,MD,ME,MI,MN,MS,MT,NC,ND,
## : NE,NH,NJ,NV,NY,OH,OR,SC,TN,TX,UT,VA,VT,WA,WI,WV,
## : WY}:
## :...total_eve_charge > 27.69: yes (28/6.1)
## total_eve_charge <= 27.69:
## :...account_length > 151: no (192.6/93)
## account_length <= 151:
## :...total_night_calls > 135: no (30.4)
## total_night_calls <= 135:
## :...total_eve_calls > 137: no (29.6)
## total_eve_calls <= 137:
## :...total_intl_calls > 7: no (115.9/15.6)
## total_intl_calls <= 7:
## :...total_day_minutes > 210.5:
## :...state in {AL,CA,FL,MI,MS,NV,OH,
## : : OR,SC,TX,UT,VA,VT,WA,
## : : WI}: yes (180.7/64.2)
## : state in {AZ,DC,ID,IN,LA,MD,ME,
## : MN,MT,NC,ND,NE,NH,NJ,
## : NY,TN,WV,
## : WY}: no (72.2/3.8)
## total_day_minutes <= 210.5:
## :...state in {AL,FL,MD,MI,NV,NY,SC,
## : VT,
## : WI}: no (182.4)
## state in {AZ,CA,DC,ID,IN,LA,ME,
## : MN,MS,MT,NC,ND,NE,NH,
## : NJ,OH,OR,TN,TX,UT,VA,
## : WA,WV,WY}: [S1]
##
## SubTree [S1]
##
## total_day_minutes > 202.4: no (33.4)
## total_day_minutes <= 202.4:
## :...total_day_minutes > 198.4: yes (50.4/15.2)
## total_day_minutes <= 198.4:
## :...total_eve_charge > 17.87: no (266.1/56.3)
## total_eve_charge <= 17.87:
## :...state in {AZ,ID,IN,LA,NH,NJ,OR,VA,WY}: no (75.2)
## state in {CA,DC,ME,MN,MS,MT,NC,ND,NE,OH,TN,TX,UT,WA,WV}:
## :...total_eve_charge <= 17.75: no (213.3/98.6)
## total_eve_charge > 17.75: yes (47.6/5.3)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## international_plan = yes:
## :...total_intl_calls <= 2: yes (147.2)
## : total_intl_calls > 2:
## : :...total_intl_minutes <= 13: no (278.8/73.9)
## : total_intl_minutes > 13: yes (78.1)
## international_plan = no:
## :...total_day_minutes > 266:
## :...total_eve_charge <= 13.32: no (73.9/11.4)
## : total_eve_charge > 13.32: yes (204.1/46.3)
## total_day_minutes <= 266:
## :...total_intl_minutes <= 3.8: no (50.5/0.6)
## total_intl_minutes > 3.8:
## :...total_day_calls > 146: yes (48.3/18.6)
## total_day_calls <= 146:
## :...state in {AK,AL,AZ,CT,DE,FL,HI,IL,KS,MA,MO,NH,NM,OK,PA,RI,SD,
## : VA,VT,WI}: no (1188.3/154.4)
## state in {AR,CA,CO,DC,GA,IA,ID,IN,KY,LA,MD,ME,MI,MN,MS,MT,NC,
## : ND,NE,NJ,NV,NY,OH,OR,SC,TN,TX,UT,WA,WV,WY}:
## :...number_vmail_messages > 36: no (89.9/6.9)
## number_vmail_messages <= 36:
## :...total_eve_calls <= 63: no (49.2/3.2)
## total_eve_calls > 63:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 120.5: yes (27.9)
## : total_day_minutes > 120.5:
## : :...total_eve_charge > 23: no (25.6)
## : total_eve_charge <= 23:
## : :...total_intl_calls > 7: no (17.8/1.8)
## : total_intl_calls <= 7:
## : :...total_day_minutes <= 148.6: yes (23.4)
## : total_day_minutes > 148.6:
## : :...state in {AR,DC,GA,IA,KY,LA,MI,
## : : MN,MS,MT,NE,NJ,NV,OH,
## : : TN,TX,WA,
## : : WY}: yes (178.6/62.2)
## : state in {CA,CO,ID,IN,MD,ME,NC,
## : ND,NY,OR,SC,UT,
## : WV}: no (73.3/8.9)
## number_customer_service_calls <= 3:
## :...total_day_minutes <= 78.4: no (37)
## total_day_minutes > 78.4:
## :...total_day_minutes <= 82.7: yes (31/6.1)
## total_day_minutes > 82.7:
## :...total_day_charge <= 16.37: no (35.6)
## total_day_charge > 16.37:
## :...total_eve_minutes > 243:
## :...state in {DC,GA,IA,ID,IN,KY,LA,
## : : ME,MS,NC,OH,
## : : TN}: no (95.2/8.5)
## : state in {AR,CA,CO,MD,MI,MN,MT,
## : : ND,NE,NJ,NV,NY,OR,SC,
## : : TX,UT,WA,WV,WY}: [S1]
## total_eve_minutes <= 243:
## :...account_length <= 48: no (117.3/0.6)
## account_length > 48:
## :...account_length > 128: [S2]
## account_length <= 128: [S3]
##
## SubTree [S1]
##
## number_vmail_messages > 23: no (30.8/5.7)
## number_vmail_messages <= 23:
## :...total_night_charge <= 9.02: no (85.5/31.9)
## total_night_charge > 9.02: yes (205.9/37.5)
##
## SubTree [S2]
##
## state in {AR,CA,CO,GA,IA,KY,ME,MI,ND,NE,NV,TX}: no (77.2)
## state in {DC,ID,IN,LA,MD,MN,MS,MT,NC,NJ,NY,OH,OR,SC,TN,UT,WA,WV,WY}:
## :...total_day_calls <= 91: no (53/8.3)
## total_day_calls > 91:
## :...total_intl_minutes > 15.2: yes (23.5/1.2)
## total_intl_minutes <= 15.2:
## :...total_intl_minutes <= 12.2: yes (225.8/76.9)
## total_intl_minutes > 12.2: no (26.1)
##
## SubTree [S3]
##
## total_eve_charge > 17.87: no (214.5/20.1)
## total_eve_charge <= 17.87:
## :...total_intl_minutes > 13.9: no (40)
## total_intl_minutes <= 13.9:
## :...total_eve_charge <= 14.43: no (239.5/30.6)
## total_eve_charge > 14.43:
## :...state in {AR,CO,DC,IA,IN,KY,LA,MD,NV,NY,OR,UT,WA,
## : WY}: no (106.9)
## state in {CA,GA,ID,ME,MI,MN,MS,MT,NC,ND,NE,NJ,OH,SC,TN,TX,WV}:
## :...total_day_calls > 121: yes (54.4/9.8)
## total_day_calls <= 121:
## :...account_length <= 105: yes (202.1/80.8)
## account_length > 105: no (43.9)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## international_plan = yes:
## :...total_intl_calls <= 2: yes (117.3)
## : total_intl_calls > 2:
## : :...total_intl_minutes > 13: yes (62.2)
## : total_intl_minutes <= 13:
## : :...number_customer_service_calls > 4: yes (20.1/1.6)
## : number_customer_service_calls <= 4:
## : :...state in {AK,AL,CA,DC,FL,HI,IA,ID,IL,IN,KY,LA,MI,MN,NC,ND,NE,
## : : NJ,NM,NV,NY,OK,OR,PA,RI,SC,UT,VT,WA,WV,
## : : WY}: no (96.3)
## : state in {AR,AZ,CO,CT,DE,GA,KS,MA,MD,ME,MO,MS,MT,NH,OH,SD,TN,
## : TX,VA,WI}: yes (166.4/65.3)
## international_plan = no:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 180.8: yes (284.5/85)
## : total_day_minutes > 180.8:
## : :...total_eve_minutes <= 135.1: yes (40.3/14.7)
## : total_eve_minutes > 135.1:
## : :...total_night_charge <= 11.41: no (216.9/25.2)
## : total_night_charge > 11.41: yes (53.8/22)
## number_customer_service_calls <= 3:
## :...total_day_minutes > 221.8:
## :...total_day_charge > 53.65: yes (25.2)
## : total_day_charge <= 53.65:
## : :...voice_mail_plan = yes:
## : :...state in {AK,AL,AR,AZ,CA,CO,CT,DC,DE,GA,HI,IA,ID,IL,IN,KS,
## : : : KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NM,NY,
## : : : OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,
## : : : WY}: no (189.6)
## : : state in {FL,NJ,NV}: yes (28.1/7.9)
## : voice_mail_plan = no:
## : :...total_eve_charge > 18.21: yes (326.7/98.9)
## : total_eve_charge <= 18.21:
## : :...total_intl_minutes <= 14.7: no (452.8/115.3)
## : total_intl_minutes > 14.7: yes (36.7/7.1)
## total_day_minutes <= 221.8:
## :...state in {AK,AL,AR,CT,DE,FL,HI,IA,KS,KY,MA,MO,NM,OK,
## : PA}: no (422.7)
## state in {AZ,CA,CO,DC,GA,ID,IL,IN,LA,MD,ME,MI,MN,MS,MT,NC,ND,NE,NH,
## : NJ,NV,NY,OH,OR,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY}:
## :...total_day_minutes <= 78.5: no (41.2)
## total_day_minutes > 78.5:
## :...total_day_calls <= 59: no (25.4)
## total_day_calls > 59:
## :...total_day_calls <= 63: yes (34/14.2)
## total_day_calls > 63:
## :...account_length <= 57: no (191.6/17.7)
## account_length > 57:
## :...total_intl_minutes <= 8.2: no (341.4/43.4)
## total_intl_minutes > 8.2:
## :...total_eve_charge <= 11.42: no (66.4)
## total_eve_charge > 11.42:
## :...total_eve_charge <= 12.26: yes (75.5/23.9)
## total_eve_charge > 12.26:
## :...total_day_calls <= 74: no (57.3)
## total_day_calls > 74: [S1]
##
## SubTree [S1]
##
## total_night_calls <= 64: no (29.7)
## total_night_calls > 64:
## :...total_night_charge <= 6.52: no (110.3/6.6)
## total_night_charge > 6.52:
## :...total_eve_charge <= 20.91:
## :...state in {CO,GA,IL,MI,NH,NJ,NV,OR,RI,SD,WI,WV,WY}: no (163.6)
## : state in {AZ,CA,DC,ID,IN,LA,MD,ME,MN,MS,MT,NC,ND,NE,NY,OH,SC,TN,TX,
## : : UT,VA,VT,WA}:
## : :...number_vmail_messages <= 32: no (530.1/174.3)
## : number_vmail_messages > 32: yes (55/18.2)
## total_eve_charge > 20.91:
## :...total_day_calls > 121: no (17.6)
## total_day_calls <= 121:
## :...state in {AZ,CA,CO,DC,GA,ID,IL,IN,LA,MD,ME,MS,NC,ND,NY,OH,OR,
## : SC,SD,VT}: no (35)
## state in {MI,MN,MT,NE,NH,NJ,NV,RI,TN,TX,UT,VA,WA,WI,WV,
## WY}: yes (186.2/56.9)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## total_day_minutes > 253.1:
## :...total_day_charge > 53.65: yes (30.2)
## : total_day_charge <= 53.65:
## : :...voice_mail_plan = yes: no (138.5/31.9)
## : voice_mail_plan = no:
## : :...total_eve_charge <= 15.74: no (211.2/86.3)
## : total_eve_charge > 15.74:
## : :...total_night_charge <= 6.5: no (44.7/9.6)
## : total_night_charge > 6.5: yes (210.5/19.3)
## total_day_minutes <= 253.1:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 138.7: yes (91.2/13.9)
## : total_day_minutes > 138.7:
## : :...total_eve_charge <= 16.21: yes (188.3/81.5)
## : total_eve_charge > 16.21:
## : :...number_vmail_messages > 26: no (62.1/0.4)
## : number_vmail_messages <= 26:
## : :...international_plan = no: no (226.1/53.6)
## : international_plan = yes: yes (26.2/6.8)
## number_customer_service_calls <= 3:
## :...total_day_calls > 147: no (43.6/6.7)
## total_day_calls <= 147:
## :...international_plan = yes:
## :...total_intl_calls <= 2: yes (64.8)
## : total_intl_calls > 2:
## : :...total_intl_minutes > 13.1: yes (26.3)
## : total_intl_minutes <= 13.1:
## : :...total_eve_charge <= 23.47: no (206.7/9.7)
## : total_eve_charge > 23.47: yes (42.8/14.1)
## international_plan = no:
## :...state in {AK,AL,AR,DE,FL,GA,HI,IA,MA,ME,MI,MO,NH,NM,NV,OK,PA,
## : RI,SD,TN,VA,VT,WI,WV}: no (1091.2/149.8)
## state in {AZ,CA,CO,CT,DC,ID,IL,IN,KS,KY,LA,MD,MN,MS,MT,NC,ND,
## : NE,NJ,NY,OH,OR,SC,TX,UT,WA,WY}:
## :...number_vmail_messages > 36: no (76.5/5.6)
## number_vmail_messages <= 36:
## :...total_night_charge <= 5.38: no (74.4/5.6)
## total_night_charge > 5.38:
## :...total_day_calls > 128: yes (120.7/57.5)
## total_day_calls <= 128:
## :...total_night_calls > 135: no (57.6)
## total_night_calls <= 135:
## :...total_day_minutes > 236.3: yes (169.9/84.2)
## total_day_minutes <= 236.3:
## :...total_eve_calls <= 75: no (138.8/61.3)
## total_eve_calls > 75:
## :...total_night_calls <= 72: no (57.1)
## total_night_calls > 72:
## :...state in {AZ,CO,CT,KY,MD,
## : NY}: no (126.9/5.9)
## state in {CA,DC,ID,IL,IN,KS,LA,
## : MN,MS,MT,NC,ND,NE,NJ,
## : OH,OR,SC,TX,UT,WA,WY}: [S1]
##
## SubTree [S1]
##
## number_vmail_messages > 28: no (94.1/10.6)
## number_vmail_messages <= 28:
## :...total_day_minutes > 226.1: no (65.3/4.8)
## total_day_minutes <= 226.1:
## :...total_day_calls <= 74: no (58.5/5.3)
## total_day_calls > 74:
## :...total_night_minutes > 294.8: yes (32.2/7)
## total_night_minutes <= 294.8:
## :...total_eve_charge <= 12.11: no (63.4/5.3)
## total_eve_charge > 12.11:
## :...state in {KS,MN,MT,NC,ND,NE,NJ,OR,UT,WY}: no (272/56.6)
## state in {CA,DC,ID,IL,IN,LA,MS,OH,SC,TX,WA}:
## :...total_eve_charge <= 17.06: yes (225.5/85.1)
## total_eve_charge > 17.06: no (162.7/48.7)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## total_day_charge > 45.1:
## :...voice_mail_plan = yes: no (108.1/37.2)
## : voice_mail_plan = no:
## : :...total_eve_charge <= 11.75: no (39.2/10.7)
## : total_eve_charge > 11.75: yes (288.4/50.1)
## total_day_charge <= 45.1:
## :...number_customer_service_calls > 4:
## :...total_day_minutes <= 135.7: yes (28.9)
## : total_day_minutes > 135.7: no (220.5/99.6)
## number_customer_service_calls <= 4:
## :...total_eve_charge > 21.56:
## :...total_day_charge <= 35.31: no (422.7/115.5)
## : total_day_charge > 35.31:
## : :...voice_mail_plan = yes: no (25.7/0.3)
## : voice_mail_plan = no:
## : :...total_night_charge <= 7.85: no (57.5/14.2)
## : total_night_charge > 7.85: yes (191.9/25.3)
## total_eve_charge <= 21.56:
## :...total_night_charge <= 3.82: yes (44.9/19.5)
## total_night_charge > 3.82:
## :...international_plan = yes:
## :...total_intl_calls <= 2: yes (47.1)
## : total_intl_calls > 2:
## : :...total_intl_minutes <= 13: no (195.4/15.4)
## : total_intl_minutes > 13: yes (29.8)
## international_plan = no:
## :...total_intl_minutes <= 4: no (29.2)
## total_intl_minutes > 4:
## :...total_eve_calls <= 66: no (80.2/3.9)
## total_eve_calls > 66:
## :...total_eve_calls <= 68: yes (38.6/14.5)
## total_eve_calls > 68:
## :...total_night_calls <= 78:
## :...state in {AK,AL,AR,AZ,DC,FL,GA,HI,IA,ID,IL,
## : : IN,KS,MA,MI,MN,MO,MS,MT,NE,NH,NM,
## : : NV,NY,OH,OK,PA,RI,TX,VA,VT,WA,WI,
## : : WY}: no (184.2/4.4)
## : state in {CA,CO,CT,DE,KY,LA,MD,ME,NC,ND,NJ,
## : OR,SC,SD,TN,UT,
## : WV}: yes (190.8/70.7)
## total_night_calls > 78:
## :...total_eve_charge <= 14.19: no (571.2/63.7)
## total_eve_charge > 14.19:
## :...state in {AZ,CO,DE,HI,IL,KY,MA,MO,NH,
## : NM,OK,PA,SD,TN,WV,
## : WY}: no (297.1/10.4)
## state in {AK,AL,AR,CA,CT,DC,FL,GA,IA,
## : ID,IN,KS,LA,MD,ME,MI,MN,MS,
## : MT,NC,ND,NE,NJ,NV,NY,OH,OR,
## : RI,SC,TX,UT,VA,VT,WA,WI}:
## :...account_length > 174: no (30.7)
## account_length <= 174: [S1]
##
## SubTree [S1]
##
## number_customer_service_calls > 3: no (142.9/59.1)
## number_customer_service_calls <= 3:
## :...total_night_charge <= 5.39: no (30.4)
## total_night_charge > 5.39:
## :...account_length <= 95:
## :...state in {AK,CA,CT,FL,IN,LA,MD,ME,MI,NC,ND,NE,NJ,NV,NY,RI,SC,UT,
## : : VA}: no (211.9)
## : state in {AL,AR,DC,GA,IA,ID,KS,MN,MS,MT,OH,OR,TX,VT,WA,WI}:
## : :...total_intl_calls > 6: no (41/1.5)
## : total_intl_calls <= 6:
## : :...total_intl_calls <= 4: no (156.4/40.3)
## : total_intl_calls > 4: yes (99.4/39.3)
## account_length > 95:
## :...state in {AK,AL,AR,GA,IA,KS,MN,NV,RI,VT,WI}: no (90.3/0.3)
## state in {CA,CT,DC,FL,ID,IN,LA,MD,ME,MI,MS,MT,NC,ND,NE,NJ,NY,OH,OR,
## : SC,TX,UT,VA,WA}:
## :...total_day_calls <= 79: no (42.2/4.6)
## total_day_calls > 79:
## :...number_vmail_messages > 35: yes (38.7/11.4)
## number_vmail_messages <= 35:
## :...number_vmail_messages > 28: no (33.5)
## number_vmail_messages <= 28:
## :...total_intl_minutes <= 9.9: no (173.9/45.3)
## total_intl_minutes > 9.9:
## :...state in {FL,ME,NC,ND,UT,VA}: no (28.3)
## state in {CA,CT,DC,ID,IN,LA,MD,MI,MS,MT,NE,NJ,
## : NY,OH,OR,SC,TX,WA}:
## :...total_day_charge <= 16.15: no (14.2)
## total_day_charge > 16.15:
## :...number_vmail_messages > 26: yes (21.1/2)
## number_vmail_messages <= 26:
## :...total_eve_calls <= 85: yes (48/7.4)
## total_eve_calls > 85: no (206/94.6)
##
## ----- Trial 6: -----
##
## Decision tree:
##
## number_customer_service_calls > 3:
## :...total_day_minutes <= 173.5: yes (347.9/83.5)
## : total_day_minutes > 173.5:
## : :...total_eve_calls > 123: no (49.9/3.7)
## : total_eve_calls <= 123:
## : :...total_intl_calls > 6: no (34.3/3.5)
## : total_intl_calls <= 6:
## : :...total_day_calls > 127: yes (26.5/2.7)
## : total_day_calls <= 127:
## : :...total_day_calls > 124: no (18.1)
## : total_day_calls <= 124:
## : :...state in {AK,AL,AZ,CA,CT,FL,HI,IL,KY,MA,MO,MT,NE,NH,NM,
## : : NV,NY,OH,OK,PA,RI,SC,SD,TN,UT,VA,VT,
## : : WI}: no (74.3/2)
## : state in {AR,CO,DC,DE,GA,IA,ID,IN,KS,LA,MD,ME,MI,MN,MS,
## : NC,ND,NJ,OR,TX,WA,WV,WY}: yes (206.8/84.9)
## number_customer_service_calls <= 3:
## :...total_day_minutes > 253.5:
## :...total_night_charge > 9.41: yes (231.2/57.4)
## : total_night_charge <= 9.41:
## : :...total_eve_charge <= 17.44: no (159.9/43.8)
## : total_eve_charge > 17.44: yes (142.2/44.1)
## total_day_minutes <= 253.5:
## :...total_eve_charge > 21.56:
## :...number_vmail_messages > 32: no (28.3/1.7)
## : number_vmail_messages <= 32:
## : :...total_night_minutes > 232.3: yes (171.7/56.2)
## : total_night_minutes <= 232.3:
## : :...state in {AK,AL,AZ,CO,CT,DC,DE,FL,HI,IA,KS,KY,LA,MA,ME,MN,
## : : MS,MT,ND,NE,NJ,OH,OK,OR,PA,RI,SD,TN,UT,VT,
## : : WA}: no (190.8/23.8)
## : state in {AR,CA,GA,ID,IL,IN,MD,MI,MO,NC,NH,NM,NV,NY,SC,TX,
## : VA,WI,WV,WY}: yes (227.2/77.7)
## total_eve_charge <= 21.56:
## :...total_day_calls <= 62: no (65.5/2.2)
## total_day_calls > 62:
## :...total_eve_calls <= 66: no (59/3.5)
## total_eve_calls > 66:
## :...state in {AK,DE,HI,IA,MA,MO,NH,NM,NV,OK,PA,
## : RI}: no (254.2/17.8)
## state in {AL,AR,AZ,CA,CO,CT,DC,FL,GA,ID,IL,IN,KS,KY,LA,MD,
## : ME,MI,MN,MS,MT,NC,ND,NE,NJ,NY,OH,OR,SC,SD,TN,TX,
## : UT,VA,VT,WA,WI,WV,WY}:
## :...total_night_calls <= 70: no (179.8/18.7)
## total_night_calls > 70:
## :...account_length > 119:
## :...state in {AL,AR,AZ,CA,CT,GA,KS,KY,ME,MI,SD,TN,
## : : TX,VA,VT,WI,WV}: no (164.4/8.1)
## : state in {CO,DC,FL,ID,IL,IN,LA,MD,MN,MS,MT,NC,
## : : ND,NE,NJ,NY,OH,OR,SC,UT,WA,WY}:
## : :...total_intl_minutes <= 5.8: yes (43.2/9.5)
## : total_intl_minutes > 5.8: no (519.7/192.2)
## account_length <= 119:
## :...state in {CO,FL,IL,KY,NY,OR,SC,WV,
## : WY}: no (220.3/5.4)
## state in {AL,AR,AZ,CA,CT,DC,GA,ID,IN,KS,LA,MD,
## : ME,MI,MN,MS,MT,NC,ND,NE,NJ,OH,SD,TN,
## : TX,UT,VA,VT,WA,WI}:
## :...total_eve_charge <= 11.34: no (52.6/1.3)
## total_eve_charge > 11.34:
## :...total_intl_minutes <= 7.3: no (100.1/4.7)
## total_intl_minutes > 7.3:
## :...total_night_charge <= 7.47: no (183.7/19.1)
## total_night_charge > 7.47:
## :...total_day_calls > 129: no (52.4/3.9)
## total_day_calls <= 129: [S1]
##
## SubTree [S1]
##
## total_eve_calls <= 77: yes (76.7/32.4)
## total_eve_calls > 77:
## :...state in {AZ,DC,GA,ID,KS,MI,UT}: no (101.7/3.4)
## state in {AL,AR,CA,CT,IN,LA,MD,ME,MN,MS,MT,NC,ND,NE,NJ,OH,SD,TN,TX,VA,VT,
## : WA,WI}:
## :...total_eve_calls <= 103: no (235/58.5)
## total_eve_calls > 103:
## :...total_night_calls <= 84: yes (71.5/15.6)
## total_night_calls > 84: no (211.1/80.4)
##
## ----- Trial 7: -----
##
## Decision tree:
##
## total_day_minutes > 283.9: yes (167.5/49.9)
## total_day_minutes <= 283.9:
## :...international_plan = yes:
## :...total_intl_calls <= 2: yes (135.9)
## : total_intl_calls > 2:
## : :...total_intl_minutes <= 13: no (296.6/73.1)
## : total_intl_minutes > 13: yes (108.1)
## international_plan = no:
## :...number_customer_service_calls > 3:
## :...total_day_minutes > 188: no (271.2/66.3)
## : total_day_minutes <= 188:
## : :...state in {AK,AR,CA,CT,DE,FL,ID,KS,KY,LA,MA,MN,MO,MS,MT,ND,NE,
## : : NH,NJ,NM,NV,NY,OK,PA,RI,SC,SD,TN,UT,
## : : WA}: yes (198.8/20.9)
## : state in {AL,AZ,CO,DC,GA,HI,IA,IL,IN,MD,ME,MI,NC,OH,OR,TX,VA,
## : VT,WI,WV,WY}: no (186.9/51.8)
## number_customer_service_calls <= 3:
## :...total_eve_minutes <= 167: no (614.4/82.7)
## total_eve_minutes > 167:
## :...voice_mail_plan = yes: no (588.1/97.5)
## voice_mail_plan = no:
## :...total_day_minutes <= 210.5:
## :...state in {AK,AL,AR,CT,DE,FL,GA,HI,IA,IL,KS,KY,MA,MO,ND,
## : : NH,NJ,NM,NV,OK,PA,RI,SD,UT,
## : : VT}: no (322.2)
## : state in {AZ,CA,CO,DC,ID,IN,LA,MD,ME,MI,MN,MS,MT,NC,NE,
## : : NY,OH,OR,SC,TN,TX,VA,WA,WI,WV,WY}:
## : :...account_length > 151: yes (135.8/60.6)
## : account_length <= 151:
## : :...total_intl_minutes <= 7.2: no (60.8/8.7)
## : total_intl_minutes > 7.2:
## : :...total_intl_minutes <= 7.4: yes (30.8/4.1)
## : total_intl_minutes > 7.4: no (646.1/173.2)
## total_day_minutes > 210.5:
## :...total_night_charge <= 8.55:
## :...total_eve_charge <= 22.85: no (240.1/44.8)
## : total_eve_charge > 22.85: yes (23.3/5)
## total_night_charge > 8.55:
## :...total_day_charge > 45.17: yes (38.1)
## total_day_charge <= 45.17:
## :...total_eve_charge > 23.53: yes (54.5/5.1)
## total_eve_charge <= 23.53:
## :...account_length <= 36: no (18.8/0.8)
## account_length > 36:
## :...state in {IL,OK}: yes (0)
## state in {AK,AZ,CO,DC,FL,HI,LA,MA,MT,
## : NC,NE,NH,NM,PA,RI,SD,VA,WI,
## : WY}: no (51.3)
## state in {AL,AR,CA,CT,DE,GA,IA,ID,IN,
## : KS,KY,MD,ME,MI,MN,MO,MS,ND,
## : NJ,NV,NY,OH,OR,SC,TN,TX,UT,
## : VT,WA,WV}:
## :...total_intl_calls <= 6: yes (270.3/87.2)
## total_intl_calls > 6: no (35.3/11.5)
##
## ----- Trial 8: -----
##
## Decision tree:
##
## international_plan = yes:
## :...total_intl_calls <= 2: yes (116.4)
## : total_intl_calls > 2:
## : :...total_intl_minutes > 13: yes (90.5)
## : total_intl_minutes <= 13:
## : :...number_customer_service_calls <= 3: no (258.7/84.8)
## : number_customer_service_calls > 3: yes (89.6/24)
## international_plan = no:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 134.6: yes (113.1/7.2)
## : total_day_minutes > 134.6:
## : :...total_day_calls > 138: yes (26.4/1.6)
## : total_day_calls <= 138:
## : :...total_eve_charge <= 11.48: yes (89/22.2)
## : total_eve_charge > 11.48:
## : :...voice_mail_plan = yes: no (116.1/18.8)
## : voice_mail_plan = no:
## : :...total_day_minutes <= 160.5: yes (97.5/24.1)
## : total_day_minutes > 160.5:
## : :...total_day_minutes <= 241.5: no (233.2/62.7)
## : total_day_minutes > 241.5: yes (84.2/21.1)
## number_customer_service_calls <= 3:
## :...total_night_minutes <= 116.9: no (133.1)
## total_night_minutes > 116.9:
## :...total_day_minutes > 241.9:
## :...voice_mail_plan = yes: no (173.3/20.3)
## : voice_mail_plan = no:
## : :...total_day_charge > 51.07: yes (33.1)
## : total_day_charge <= 51.07:
## : :...total_eve_charge <= 15.75:
## : :...total_intl_minutes <= 13.6: no (173.6/29.7)
## : : total_intl_minutes > 13.6: yes (44/7.9)
## : total_eve_charge > 15.75:
## : :...total_intl_minutes <= 7.2: no (43.4/9.9)
## : total_intl_minutes > 7.2: yes (214.6/40.3)
## total_day_minutes <= 241.9:
## :...total_eve_charge <= 14.43: no (400.8/11)
## total_eve_charge > 14.43:
## :...state in {AK,AZ,FL,HI,IA,IL,KY,MA,NH,OH,OK,PA,SD,VA,
## : WI}: no (317.1/0.7)
## state in {AL,AR,CA,CO,CT,DC,DE,GA,ID,IN,KS,LA,MD,ME,MI,MN,
## : MO,MS,MT,NC,ND,NE,NJ,NM,NV,NY,OR,RI,SC,TN,TX,UT,
## : VT,WA,WV,WY}:
## :...total_eve_calls <= 70: no (95.2/2.2)
## total_eve_calls > 70:
## :...total_eve_calls <= 73: yes (69.6/29.9)
## total_eve_calls > 73:
## :...total_night_calls <= 67: no (73.9)
## total_night_calls > 67:
## :...total_day_calls <= 72: no (94.1/3)
## total_day_calls > 72:
## :...total_intl_minutes <= 5.2: no (48/1.6)
## total_intl_minutes > 5.2:
## :...total_intl_minutes <= 5.8: yes (30.5/11.7)
## total_intl_minutes > 5.8:
## :...state in {CO,GA,OR,RI,
## : VT}: no (108.8)
## state in {AL,AR,CA,CT,DC,DE,ID,
## : IN,KS,LA,MD,ME,MI,MN,
## : MO,MS,MT,NC,ND,NE,NJ,
## : NM,NV,NY,SC,TN,TX,UT,
## : WA,WV,WY}: [S1]
##
## SubTree [S1]
##
## total_day_calls > 124: no (146.9/59.4)
## total_day_calls <= 124:
## :...number_vmail_messages > 28: no (81)
## number_vmail_messages <= 28:
## :...total_day_calls > 121: no (36.5)
## total_day_calls <= 121:
## :...total_eve_charge <= 20.68: no (513.7/119.4)
## total_eve_charge > 20.68:
## :...total_day_minutes <= 165.2: no (123.9/18.5)
## total_day_minutes > 165.2: yes (192.2/74.1)
##
## ----- Trial 9: -----
##
## Decision tree:
##
## international_plan = yes:
## :...total_intl_calls <= 2: yes (96.2)
## : total_intl_calls > 2:
## : :...total_intl_minutes > 13: yes (74.8)
## : total_intl_minutes <= 13:
## : :...number_customer_service_calls > 4: yes (23.5)
## : number_customer_service_calls <= 4:
## : :...state in {AK,AL,CA,DC,FL,HI,IA,ID,IL,IN,KY,LA,MA,MI,MN,MO,NC,
## : : ND,NE,NJ,NM,NV,NY,OK,OR,PA,RI,SC,UT,VA,VT,WA,WV,
## : : WY}: no (105.2)
## : state in {AR,AZ,CO,CT,DE,GA,KS,MD,ME,MS,MT,NH,OH,SD,TN,TX,
## : WI}: yes (211.1/87.7)
## international_plan = no:
## :...number_customer_service_calls > 3:
## :...total_day_minutes <= 134.6: yes (87.6)
## : total_day_minutes > 134.6:
## : :...state in {AL,AZ,FL,HI,ID,IL,IN,MD,NC,ND,NE,NJ,NM,OK,PA,UT,VA,
## : : VT}: no (199.8/36.4)
## : state in {AR,CA,LA,MS,MT}: yes (56.5)
## : state in {AK,CO,CT,DC,DE,GA,IA,KS,KY,MA,ME,MI,MN,MO,NH,NV,NY,OH,OR,
## : : RI,SC,SD,TN,TX,WA,WI,WV,WY}:
## : :...total_day_calls <= 68: yes (38.8/4.3)
## : total_day_calls > 68:
## : :...total_eve_charge <= 10.75: yes (23.4)
## : total_eve_charge > 10.75:
## : :...total_intl_calls > 7: no (31.6/2.8)
## : total_intl_calls <= 7:
## : :...total_night_calls <= 108: no (209.2/80.8)
## : total_night_calls > 108: yes (155.8/47.6)
## number_customer_service_calls <= 3:
## :...total_day_minutes <= 208.3: no (1579.5/41.1)
## total_day_minutes > 208.3:
## :...voice_mail_plan = yes: no (253.6/7.2)
## voice_mail_plan = no:
## :...total_day_minutes > 265.9:
## :...total_eve_minutes <= 167.3: no (108.5/40.8)
## : total_eve_minutes > 167.3: yes (177.9/14.3)
## total_day_minutes <= 265.9:
## :...total_eve_charge > 22.69: yes (177.2/58.2)
## total_eve_charge <= 22.69:
## :...total_night_charge <= 7.25: no (135.3)
## total_night_charge > 7.25:
## :...total_eve_charge <= 14.19: no (72.1)
## total_eve_charge > 14.19:
## :...total_day_minutes > 253.5: yes (96.3/38.1)
## total_day_minutes <= 253.5:
## :...state in {AK,AL,AZ,CO,DC,FL,HI,IA,IL,IN,KY,
## : LA,MA,ME,MN,MO,MT,NC,NE,NH,NM,NV,
## : NY,OK,PA,RI,SC,SD,TN,VA,VT,WI,WV,
## : WY}: no (244/5.5)
## state in {AR,CA,CT,DE,GA,ID,KS,MD,MI,MS,ND,
## : NJ,OH,OR,TX,UT,WA}:
## :...total_intl_minutes > 12.8: yes (25/3.1)
## total_intl_minutes <= 12.8:
## :...total_night_charge <= 13.33: no (230.6/66.2)
## total_night_charge > 13.33: yes (13.6)
##
##
## Evaluation on training data (4500 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 29 172( 3.8%)
## 1 24 559(12.4%)
## 2 36 591(13.1%)
## 3 32 527(11.7%)
## 4 32 524(11.6%)
## 5 37 506(11.2%)
## 6 31 649(14.4%)
## 7 22 372( 8.3%)
## 8 33 348( 7.7%)
## 9 25 293( 6.5%)
## boost 101( 2.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 531 94 (a): class yes
## 7 3868 (b): class no
##
##
## Attribute usage:
##
## 100.00% international_plan
## 100.00% total_day_minutes
## 100.00% total_day_charge
## 100.00% number_customer_service_calls
## 99.93% total_eve_charge
## 97.38% total_intl_minutes
## 95.84% state
## 93.69% total_day_calls
## 93.27% total_night_charge
## 89.80% total_eve_calls
## 87.73% total_eve_minutes
## 84.69% total_night_minutes
## 84.47% total_night_calls
## 75.69% voice_mail_plan
## 71.44% account_length
## 69.18% number_vmail_messages
## 45.76% total_intl_calls
## 6.89% total_intl_charge
##
##
## Time: 0.6 secs
podemos ver cada uno de los árboles construidos y el rendimiento sobre el training set
:
El nuevo clasificador se equivoca en 104 de las 4500 observaciones que tiene el training set
, un 2.31% de errores frente al 4.53% que tenía sobre el training set
nuestro modelo anterior. Se trata de una mejora del 50% en el error de entrenamiento, pero lo que en realidad nos importa es el comportamiento del nuevo modelo sobre los datos que no ha visto hasta ahora, los del test set
:
C50_predictions_boost10 <- predict(C50_churn_model_boost10, churn_test)
C50_cm_boost10 <- confusionMatrix(data = C50_predictions_boost10,
reference = churn_test$churn)
C50_cm_boost10$table
## Reference
## Prediction yes no
## yes 55 3
## no 27 415
C50_cm_boost10
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 55 3
## no 27 415
##
## Accuracy : 0.94
## 95% CI : (0.9155, 0.9592)
## No Information Rate : 0.836
## P-Value [Acc > NIR] : 1.619e-12
##
## Kappa : 0.752
##
## Mcnemar's Test P-Value : 2.679e-05
##
## Sensitivity : 0.6707
## Specificity : 0.9928
## Pos Pred Value : 0.9483
## Neg Pred Value : 0.9389
## Prevalence : 0.1640
## Detection Rate : 0.1100
## Detection Prevalence : 0.1160
## Balanced Accuracy : 0.8318
##
## 'Positive' Class : yes
##
la Accuracy
ha pasado del 0.95 al 0.962, es decir, la tasa de error del modelo previo era del 0.05 y la del nuevo modelo es 0.038 (mejora del 24%, prácticamente la mejora esperada). También han mejorado la sensibilidad, la especificidad, el valor predictivo positivo y el valor predictivo negativo.
Penalización de errores
No hacer nada para evitar que un cliente que se va a marchar efectivamente lo haga puede ser un error caro. La solución para reducir el número de falsos negativos podría ser aplicar una penalización a los diferentes tipos de errores, para desalentar que el árbol cometa los errores más penalizados. C5.0
permite hacer esto mediante una matriz de coste que especificará cuánto queremos penalizar cada tipo de error.
Construyamos dicha matriz. Primero, sus dimensiones:
cost_matrix_dims <- list(c("no", "yes"), c("no", "yes"))
names(cost_matrix_dims) <- c("predicted", "actual")
cost_matrix_dims
## $predicted
## [1] "no" "yes"
##
## $actual
## [1] "no" "yes"
Ahora, las penalizaciones:
error_cost <- matrix(c(0,1,20,0), nrow = 2, dimnames = cost_matrix_dims)
error_cost
## actual
## predicted no yes
## no 0 20
## yes 1 0
Como se ve, una clasificación correcta no tiene ningún coste, un falso positivo tiene un a penalización de 1 y un falso negativo cuesta 20. Ya podemos construir el modelo:
C50_churn_model_cost <- C5.0(x = churn_train[-20],
y = churn_train$churn,
trials = 1, # Default
rules = FALSE, # Default
weights = NULL, # Default
control = C5.0Control(), # Default
costs = error_cost
)
Veamos qué tal predice:
C50_predictions_cost <- predict(C50_churn_model_cost, churn_test)
C50_cm_cost <- confusionMatrix(data = C50_predictions_cost,
reference = churn_test$churn)
C50_cm_cost$table
## Reference
## Prediction yes no
## yes 72 128
## no 10 290
C50_cm_cost
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 72 128
## no 10 290
##
## Accuracy : 0.724
## 95% CI : (0.6826, 0.7628)
## No Information Rate : 0.836
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3623
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8780
## Specificity : 0.6938
## Pos Pred Value : 0.3600
## Neg Pred Value : 0.9667
## Prevalence : 0.1640
## Detection Rate : 0.1440
## Detection Prevalence : 0.4000
## Balanced Accuracy : 0.7859
##
## 'Positive' Class : yes
##
Como se ve, los falsos negativos han bajado de 19 a 10 a costa de aumentar los falsos positivos (de 6 a 114), lo que ha supuesto también una importante bajada de la exactitud. Puede que esto nos interese o no, en cuyo caso debemos jugar con los costes asignados a ver si podemos obtener un resultado más próximo a nuestros intereses.