电信行业客户流失分析

实验目的

熟悉使用R语言进行大数据综合分析的方法

实验原理

R工具是一套完整的的数据处理计算和制图软件系统。其功能包括：数据存储和处理系统；数组运算工具（其向量、矩阵运算方面功能尤其强大）；完整连贯的统计分析工具；优秀的的统计制图功能；简便而强大的的编程语言：可操纵数据的输入和输出以及可实现分支、循环等用户自定义功能。

实验背景

客户在电信运营商中的地位十分重要。在电信业新的市场格局重新确定后，各大电信运营商间的竞争往往首先发生在对客户资源的争夺上。如何有效地挽留住现有客户是电信运营商匝当前日趋激烈的市场竞争中必须重视的环节。然而，电信运营商往往会因为套餐内容或者价格的制定失误导致大量的客户流失。本实验正是想通过数据集中客户对电信运营商服务的使用情况来找出那些导致客户流失的因素，并且做到能够针对某一个客户预测他是否是潜在流失的客户。

实验步骤

熟悉R环境；
打开R云件环境；
在相应编程环境中修改和运行代码；
查看结果。

实验一：数据读取

使用提供的电信行业客户流失数据

安装必要的包，如提示包不存在可以用install.packages("包名")的方式安装

library('C50')
library('sjmisc')
library('sjstats')
library('ggplot2')
library('pROC')
train <- read.csv("/data/tele_train_small.csv", header = T)
train_big <- read.csv("/data/tele_train_big.csv", header = T)
test <- read.csv("/data/tele_train_small.csv", header = T)

实验二：数据探索

str(train_big)
'data.frame':    1706496 obs. of  20 variables:
 $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
 $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
 $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
 $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
 $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
 $ number_vmail_messages        : num  25 26 0 0 0 0 24 0 0 37 ...
 $ total_day_minutes            : num  265 162 243 299 167 ...
 $ total_day_calls              : int  110 123 114 71 113 98 88 79 97 84 ...
 $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
 $ total_eve_minutes            : num  197.4 195.5 121.2 61.9 148.3 ...
 $ total_eve_calls              : num  99 103 110 88 122 101 108 94 80 111 ...
 $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
 $ total_night_minutes          : num  245 254 163 197 187 ...
 $ total_night_calls            : int  91 103 104 89 121 118 118 96 90 97 ...
 $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
 $ total_intl_minutes           : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
 $ total_intl_calls             : num  5 3 5 7 3 6 7 6 4 5 ...
 $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
 $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
 $ churn                        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

数据集中包含了19个变量，共1706496行。其中变量洲(state)、国际长途计划(international_plan)、信箱语音计划(voice_mail_plan)和是否流失(churn)为因子变量，其余变量均为数值变量。

实验三：数据预处理

这里的区域编码变量(area_code)没有任何实际意义，故考虑排除该变量。

#剔除无意义的区域编码变量
train <- train[,-3]
train_big <- train_big[,-3]
test <- test[,-3]
#由于模型中，更关心的是流失这个结果(churn=yes)，所以对该因子进行排序
train$churn <- factor(train$churn,levels = c('no','yes'), order = TRUE)
train_big$churn <- factor(train_big$churn,levels = c('no','yes'), order = TRUE)
test$churn <- factor(test$churn, ,levels = c('no','yes'), order = TRUE)

实验四：分析建模

对train_big数据构建逻辑回归模型：

#构建Logistic模型
model <- glm(formula = churn ~ ., data = train_big, family = 'binomial')
summary(model)

Call:
glm(formula = churn ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9364  -0.4994  -0.3130  -0.1663   3.0951  

Coefficients:
                                Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                   -9.752e+00  4.311e-02 -226.186  < 2e-16 ***
stateAL                        3.578e-01  3.370e-02   10.616  < 2e-16 ***
stateAR                        9.213e-01  3.324e-02   27.717  < 2e-16 ***
stateAZ                        1.097e-01  3.734e-02    2.938  0.00331 ** 
stateCA                        1.820e+00  3.460e-02   52.600  < 2e-16 ***
stateCO                        6.682e-01  3.369e-02   19.836  < 2e-16 ***
stateCT                        1.034e+00  3.206e-02   32.259  < 2e-16 ***
stateDC                        6.993e-01  3.573e-02   19.571  < 2e-16 ***
stateDE                        7.526e-01  3.313e-02   22.714  < 2e-16 ***
stateFL                        5.983e-01  3.365e-02   17.779  < 2e-16 ***
stateGA                        6.703e-01  3.441e-02   19.483  < 2e-16 ***
stateHI                       -2.158e-01  3.959e-02   -5.451 5.00e-08 ***
stateIA                        2.115e-01  3.988e-02    5.304 1.13e-07 ***
stateID                        8.743e-01  3.306e-02   26.448  < 2e-16 ***
stateIL                       -2.190e-01  3.684e-02   -5.944 2.78e-09 ***
stateIN                        4.549e-01  3.326e-02   13.677  < 2e-16 ***
stateKS                        1.069e+00  3.225e-02   33.153  < 2e-16 ***
stateKY                        8.000e-01  3.384e-02   23.640  < 2e-16 ***
stateLA                        5.638e-01  3.694e-02   15.261  < 2e-16 ***
stateMA                        1.179e+00  3.285e-02   35.896  < 2e-16 ***
stateMD                        1.157e+00  3.168e-02   36.505  < 2e-16 ***
stateME                        1.336e+00  3.220e-02   41.493  < 2e-16 ***
stateMI                        1.402e+00  3.153e-02   44.469  < 2e-16 ***
stateMN                        1.164e+00  3.163e-02   36.793  < 2e-16 ***
stateMO                        6.040e-01  3.422e-02   17.650  < 2e-16 ***
stateMS                        1.362e+00  3.219e-02   42.322  < 2e-16 ***
stateMT                        1.877e+00  3.168e-02   59.263  < 2e-16 ***
stateNC                        6.006e-01  3.334e-02   18.013  < 2e-16 ***
stateND                        1.485e-01  3.520e-02    4.219 2.45e-05 ***
stateNE                        3.055e-01  3.563e-02    8.574  < 2e-16 ***
stateNH                        1.186e+00  3.396e-02   34.907  < 2e-16 ***
stateNJ                        1.581e+00  3.138e-02   50.381  < 2e-16 ***
stateNM                        4.713e-01  3.478e-02   13.552  < 2e-16 ***
stateNV                        1.257e+00  3.205e-02   39.234  < 2e-16 ***
stateNY                        1.171e+00  3.180e-02   36.814  < 2e-16 ***
stateOH                        6.914e-01  3.298e-02   20.966  < 2e-16 ***
stateOK                        8.778e-01  3.343e-02   26.258  < 2e-16 ***
stateOR                        7.706e-01  3.253e-02   23.689  < 2e-16 ***
statePA                        1.156e+00  3.444e-02   33.565  < 2e-16 ***
stateRI                       -1.087e-01  3.626e-02   -2.998  0.00272 ** 
stateSC                        1.755e+00  3.259e-02   53.856  < 2e-16 ***
stateSD                        8.236e-01  3.367e-02   24.457  < 2e-16 ***
stateTN                        2.700e-01  3.627e-02    7.444 9.77e-14 ***
stateTX                        1.652e+00  3.129e-02   52.782  < 2e-16 ***
stateUT                        1.050e+00  3.289e-02   31.916  < 2e-16 ***
stateVA                       -4.317e-01  3.638e-02  -11.867  < 2e-16 ***
stateVT                        9.838e-02  3.444e-02    2.857  0.00428 ** 
stateWA                        1.422e+00  3.199e-02   44.462  < 2e-16 ***
stateWI                        2.955e-01  3.449e-02    8.569  < 2e-16 ***
stateWV                        5.767e-01  3.242e-02   17.789  < 2e-16 ***
stateWY                        2.944e-01  3.336e-02    8.826  < 2e-16 ***
account_length                 1.005e-03  6.334e-05   15.868  < 2e-16 ***
international_planyes          2.188e+00  6.762e-03  323.644  < 2e-16 ***
voice_mail_planyes            -2.130e+00  2.627e-02  -81.089  < 2e-16 ***
number_vmail_messages          3.823e-02  8.242e-04   46.384  < 2e-16 ***
total_day_minutes             -6.552e-02  5.946e-02   -1.102  0.27055    
total_day_calls                4.028e-03  1.265e-04   31.856  < 2e-16 ***
total_day_charge               4.625e-01  3.498e-01    1.322  0.18614    
total_eve_minutes              9.278e-01  7.503e-02   12.365  < 2e-16 ***
total_eve_calls                9.997e-04  1.277e-04    7.832 4.82e-15 ***
total_eve_charge              -1.082e+01  8.827e-01  -12.261  < 2e-16 ***
total_night_minutes           -2.416e-01  3.992e-02   -6.051 1.44e-09 ***
total_night_calls              1.270e-04  1.293e-04    0.982  0.32593    
total_night_charge             5.455e+00  8.871e-01    6.149 7.78e-10 ***
total_intl_minutes            -4.214e+00  2.427e-01  -17.363  < 2e-16 ***
total_intl_calls              -9.101e-02  1.137e-03  -80.016  < 2e-16 ***
total_intl_charge              1.592e+01  8.989e-01   17.707  < 2e-16 ***
number_customer_service_calls  5.358e-01  1.810e-03  296.056  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1412246  on 1706495  degrees of freedom
Residual deviance: 1060480  on 1706428  degrees of freedom
AIC: 1060616

Number of Fisher Scoring iterations: 6

可以发现绝大多数的变量已经显著。因为原始数据集条目过多，为了加速之后的试验训练过程，下面换用train_big数据的子集train进行下面的实验。

对train数据重新构建逻辑回归模型：

#构建Logistic模型
model <- glm(formula = churn ~ ., data = train, family = 'binomial')
summary(model)

Call:
glm(formula = churn ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9200  -0.4994  -0.3127  -0.1664   3.0363  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -9.753e+00  9.756e-01  -9.996  < 2e-16 ***
stateAL                        3.593e-01  7.627e-01   0.471 0.637573    
stateAR                        9.224e-01  7.521e-01   1.226 0.220040    
stateAZ                        1.089e-01  8.450e-01   0.129 0.897463    
stateCA                        1.821e+00  7.830e-01   2.326 0.020024 *  
stateCO                        6.675e-01  7.622e-01   0.876 0.381144    
stateCT                        1.034e+00  7.255e-01   1.426 0.153931    
stateDC                        6.979e-01  8.086e-01   0.863 0.388065    
stateDE                        7.521e-01  7.497e-01   1.003 0.315755    
stateFL                        5.983e-01  7.615e-01   0.786 0.432018    
stateGA                        6.685e-01  7.788e-01   0.858 0.390694    
stateHI                       -2.133e-01  8.960e-01  -0.238 0.811874    
stateIA                        2.115e-01  9.022e-01   0.234 0.814665    
stateID                        8.738e-01  7.480e-01   1.168 0.242719    
stateIL                       -2.206e-01  8.339e-01  -0.265 0.791350    
stateIN                        4.539e-01  7.526e-01   0.603 0.546413    
stateKS                        1.069e+00  7.297e-01   1.465 0.142794    
stateKY                        8.005e-01  7.658e-01   1.045 0.295850    
stateLA                        5.629e-01  8.360e-01   0.673 0.500722    
stateMA                        1.179e+00  7.433e-01   1.586 0.112673    
stateMD                        1.156e+00  7.169e-01   1.613 0.106744    
stateME                        1.335e+00  7.287e-01   1.832 0.066916 .  
stateMI                        1.403e+00  7.135e-01   1.966 0.049329 *  
stateMN                        1.163e+00  7.157e-01   1.625 0.104083    
stateMO                        6.023e-01  7.745e-01   0.778 0.436784    
stateMS                        1.362e+00  7.283e-01   1.871 0.061408 .  
stateMT                        1.878e+00  7.167e-01   2.620 0.008804 ** 
stateNC                        5.993e-01  7.546e-01   0.794 0.427099    
stateND                        1.479e-01  7.966e-01   0.186 0.852729    
stateNE                        3.057e-01  8.061e-01   0.379 0.704570    
stateNH                        1.186e+00  7.684e-01   1.543 0.122805    
stateNJ                        1.581e+00  7.100e-01   2.227 0.025959 *  
stateNM                        4.711e-01  7.870e-01   0.599 0.549388    
stateNV                        1.257e+00  7.252e-01   1.733 0.083099 .  
stateNY                        1.170e+00  7.195e-01   1.626 0.103851    
stateOH                        6.909e-01  7.462e-01   0.926 0.354524    
stateOK                        8.780e-01  7.564e-01   1.161 0.245733    
stateOR                        7.696e-01  7.361e-01   1.045 0.295799    
statePA                        1.156e+00  7.793e-01   1.484 0.137936    
stateRI                       -1.080e-01  8.203e-01  -0.132 0.895256    
stateSC                        1.754e+00  7.376e-01   2.379 0.017379 *  
stateSD                        8.233e-01  7.620e-01   1.080 0.279941    
stateTN                        2.687e-01  8.209e-01   0.327 0.743377    
stateTX                        1.651e+00  7.080e-01   2.332 0.019675 *  
stateUT                        1.050e+00  7.442e-01   1.411 0.158358    
stateVA                       -4.304e-01  8.231e-01  -0.523 0.601058    
stateVT                        9.606e-02  7.797e-01   0.123 0.901945    
stateWA                        1.422e+00  7.239e-01   1.965 0.049461 *  
stateWI                        2.965e-01  7.804e-01   0.380 0.703980    
stateWV                        5.759e-01  7.337e-01   0.785 0.432454    
stateWY                        2.953e-01  7.548e-01   0.391 0.695612    
account_length                 1.005e-03  1.433e-03   0.701 0.483141    
international_planyes          2.188e+00  1.530e-01  14.302  < 2e-16 ***
voice_mail_planyes            -2.131e+00  5.945e-01  -3.584 0.000339 ***
number_vmail_messages          3.824e-02  1.865e-02   2.050 0.040368 *  
total_day_minutes             -3.854e-01  3.379e+00  -0.114 0.909187    
total_day_calls                4.028e-03  2.861e-03   1.408 0.159188    
total_day_charge               2.344e+00  1.988e+01   0.118 0.906114    
total_eve_minutes              9.299e-01  1.698e+00   0.548 0.583939    
total_eve_calls                1.001e-03  2.889e-03   0.347 0.728889    
total_eve_charge              -1.085e+01  1.998e+01  -0.543 0.587089    
total_night_minutes           -2.431e-01  9.035e-01  -0.269 0.787842    
total_night_calls              1.233e-04  2.926e-03   0.042 0.966392    
total_night_charge             5.490e+00  2.008e+01   0.273 0.784494    
total_intl_minutes            -4.210e+00  5.492e+00  -0.767 0.443347    
total_intl_calls              -9.109e-02  2.575e-02  -3.537 0.000404 ***
total_intl_charge              1.590e+01  2.034e+01   0.782 0.434347    
number_customer_service_calls  5.359e-01  4.095e-02  13.084  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2758.3  on 3332  degrees of freedom
Residual deviance: 2071.2  on 3265  degrees of freedom
AIC: 2207.2

Number of Fisher Scoring iterations: 6

发现这时大多数的变量并不显著，故考虑剔除这些不显著的变量，这里使用逐步回归法进行变量的选择(需要注意的是，Logistic为非线性模型，回归系数是通过极大似然估计方法计算所得)。

#step函数实现逐步回归法
model2 <- step(object = model, trace = 0)
summary(model2)

Call:
glm(formula = churn ~ international_plan + voice_mail_plan + 
    number_vmail_messages + total_day_charge + total_eve_minutes + 
    total_night_charge + total_intl_calls + total_intl_charge + 
    number_customer_service_calls, family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1204  -0.5133  -0.3375  -0.1969   3.2421  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -8.067161   0.515870 -15.638  < 2e-16 ***
international_planyes          2.040338   0.145243  14.048  < 2e-16 ***
voice_mail_planyes            -2.003234   0.572352  -3.500 0.000465 ***
number_vmail_messages          0.035262   0.017964   1.963 0.049654 *  
total_day_charge               0.076589   0.006371  12.022  < 2e-16 ***
total_eve_minutes              0.007182   0.001142   6.290 3.17e-10 ***
total_night_charge             0.082547   0.024653   3.348 0.000813 ***
total_intl_calls              -0.092176   0.024988  -3.689 0.000225 ***
total_intl_charge              0.326138   0.075453   4.322 1.54e-05 ***
number_customer_service_calls  0.512256   0.039141  13.087  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2758.3  on 3332  degrees of freedom
Residual deviance: 2161.6  on 3323  degrees of freedom
AIC: 2181.6

Number of Fisher Scoring iterations: 6

从结果中发现，所有变量的P值均小于0.05，通过显著性检验，保留了相对重要的变量。模型各变量通过显著性检验的同时还需确保整个模型是显著的，只有这样才能保证模型是正确的、有意义的，下面对模型进行卡方检验。

#模型的显著性检验
anova(object = model2, test = 'Chisq')

Analysis of Deviance Table

Model: binomial, link: logit

Response: churn

Terms added sequentially (first to last)


                              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                           3332     2758.3              
international_plan             1  170.400      3331     2587.9 < 2.2e-16 ***
voice_mail_plan                1   41.868      3330     2546.0 9.765e-11 ***
number_vmail_messages          1    3.638      3329     2542.4 0.0564756 .  
total_day_charge               1  135.452      3328     2406.9 < 2.2e-16 ***
total_eve_minutes              1   30.874      3327     2376.1 2.753e-08 ***
total_night_charge             1    8.500      3326     2367.6 0.0035509 ** 
total_intl_calls               1   12.887      3325     2354.7 0.0003309 ***
total_intl_charge              1   16.210      3324     2338.5 5.668e-05 ***
number_customer_service_calls  1  176.839      3323     2161.6 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

从上图中可知，随着变量从第一个到最后一个逐个加入模型，模型最终通过显著性检验，说明由上述这些变量组成的模型是有意义的，并且是正确的。

虽然模型的偏回归系数和模型均通过显著性检验，但不代表模型能够非常准确的拟合实际值，这就需要对模型进行拟合优度检验，即通过比较模型的预测值与实际值之间的差异情况来进行检验。

Logistic回归模型的拟合优度检验一般使用偏差卡方检验、皮尔逊卡方检验和HL统计量检验三种方法，其中前两种检验适合模型中只有离散的自变量，而后一种适合模型中包含连续的自变量。拟合优度检验的原假设为“模型的预测值与实际值不存在差异”。

#模型的拟合优度检验
HL_test <- hoslem_gof(x = model)
HL_test

$chisq
[1] 13.81744

$df
[1] 8

$p.value
[1] 0.08664957

attr(,"class")
[1] "hoslem_test" "list"

从模型的拟合优度检验结果可知，该模型无法拒绝拟合优度检验的原假设，即可以认为实际值与模型的预测值之间比较接近，不存在显著差异。

以上各项指标均表示模型对电信行业客户流失数据拟合的比较理想，接下来就用该模型对测试集进行预测，预测一个未知的客户是否可能流失，从而起到流失预警的作用。

#模型对样本外数据(测试集)的预测精度
prob <- predict(object = model2, newdata= test, type = 'response')
pred <- ifelse(prob >= 0.5, 'yes','no')
pred <- factor(pred, levels =c('no','yes'), order = TRUE)
f <- table(test$churn, pred)
f

     pred
        no  yes
  no  1408   35
  yes  182   42

实验五：模型评价与优化

1).模型对非流失客户(no)的预测还是非常准确的（1408/(1408+35)=97.6%）;

2).模型对流失客户(yes)的预测不理想（42/(182+42)=18.8%）

3).模型的整体预测准确率为87.0%((1408+42)/(1408+35+182+42))，还算说得过去。

实验六：可视化输出

绘制ROC曲线

#绘制ROC曲线
roc_curve <- roc(test$churn,prob)
names(roc_curve)
x <- 1-roc_curve$specificities
y <- roc_curve$sensitivities
p <- ggplot(data = NULL, mapping = aes(x= x, y = y))
p + geom_line(colour = 'red') +geom_abline(intercept = 0, slope = 1) + annotate('text', x = 0.4, y = 0.5, label =paste('AUC=',round(roc_curve$auc,2))) + labs(x = '1-specificities',y = 'sensitivities', title = 'ROC Curve')

这里的AUC为ROC曲线和y=x直线之间的面积。在实际应用中，多个模型的比较可以通过面积大小来选择更佳的模型，选择标准是AUC越大越好。对于一个模型而言，一般AUC大于0.8就能够说明模型是比较合理的了。

案例五：电信行业客户流失分析