Loading model report | 正在读取模型报告
Model report not found, go back to "My data" | 未找到模型开发报告,返回“我的数据”
My data | 我的数据id | Variable | Type | Count | Missing | Missing_Rate | Unique | Mean | Std | Min | 25% | 50% | 75% | 95% | 99% | Max | First | Last |
---|
id | Bin | Variable | Type | Value_Range | Value_Counts | Bad | Good | Bad_Rate | Bad_Total | Good_Total | Pct_Bad | Pct_Good | Pct_Bin | WoE | IV |
---|
Candidate Variables | 候选变量
Selected Features | 选取的变量
Stepwise Info | 逐步回归过程信息
KS Curve and ROC Curve | KS曲线与ROC曲线
Results | 结果
Model: | Pseudo R-squared: | |||
---|---|---|---|---|
Dependent Variable: | AIC: | |||
Date: | BIC: | |||
No. Observations: | Log-Likelihood: | |||
Df Model: | LL-Null: | |||
Df Residuals: | LLR p-value: | |||
Converged: | Scale: | |||
No. Iterations: |
Parameters Estimation | 参数估计
Variable | Coeff. | Std.Err. | z | P>|z| | [0.025 | 0.975] | VIF |
---|
Rank Order Table | 等分排序表
# | Cnt Bin | Cnt Bad | Cnt Good | Bin Bad Rate | Remaining Bad Rate | Cum Bad | Cum Good | Cum Bad % | Cum Good % | KS | Score Min | Score Max | Score Avg |
---|
Histogram | 直方图
Model Configuration | 模型配置
Missing rate for variable exclusion | 被剔除变量缺失率阈值: | % |
String variable uniqueness | 字符型变量唯一值: | |
Keep binning WoE monotonic | 保持WoE值单调: | |
Difference of WoE between 2 adjacent bins | 相邻分箱之间的WoE差值: | |
Minimum population of a single bin | 单个分箱的最小占比: | % |
IV for variable exclusion | 用于剔除变量的IV阈值: | |
Correlation for variable exclusion | 用于剔除变量的相关系数阈值: | |
Force variable coefficients to be negative | 确保入模变量系数均为负值: | |
Significance level for entry | 变量进入模型的显著性水平: | |
Significance level for stay | 变量退出模型的显著性水平: | |
Time variable | 时间变量: | |
Variables excluded from modeling | 模型开发排除变量: |
Model Features | 特征变量
Model Prediction | 模型预测
Score =
Predicted Bad Rate =
Scorecard | 评分表
Variable | 变量 | Value | 取值 | Score | 分数 |
---|
Variable Binning Code | 变量分箱代码
Variable WoE Code | 变量 WoE 代码
Overall KS Curve and ROC Curve | 总体KS曲线与ROC曲线
Rank Order Table | 等分排序表
# | Count Bin | Count Bad | Count Good | Bin Bad Rate | Cum Bad | Cum Good | Cum Pct Bad | Cum Pct Good | KS | Score Min | Score Max | Score Avg |
---|
Monthly Stats | 每月指标
Monthly Variable Analysis | 每月变量分析
Weekly Stats | 每周指标
Weekly Variable Analysis | 每周变量分析
Overall KS Curve and ROC Curve | 总体KS曲线与ROC曲线
Rank Order Table | 等分排序表
# | Count Bin | Count Bad | Count Good | Bin Bad Rate | Cum Bad | Cum Good | Cum Pct Bad | Cum Pct Good | KS | Score Min | Score Max | Score Avg |
---|
Monthly Stats | 每月指标
Monthly Variable Analysis | 每月变量分析
Weekly Stats | 每周指标
Weekly Variable Analysis | 每周变量分析
About Data Report | 关于数据报告
Column Definition
- id: variable id number
- Variable: name of the variable
- Type: variable type
- Count: number of rows
- Missing: number of missing values
- Missing_Rate: percentage of missing values
- Unique: number of unique values
- Mean: the average of the values
- Std: the standard devietion of the values
- Min: the minimum of the values
- 25%: 25th percentile
- 50%: 50th percentile, a.k.a. the median
- 75%: 75th percentile
- 95%: 95th percentile
- 99%: 99th percentile
- Max: the max of the values
- First: the earlier values of datetime variables
- Last: the latest values of datetime variables
Data Report
Being the first step of modeling, understanding data report is one of the important things that cannot be done by a machine.
For example, in the business world, modeling data is prepared by joining together pieces of data from various sources, so missing values usually exist. A machine cannot tell whether or not it is correct, that a variable named 'gender' has over 80% missing values.
One more example, a machine cannot tell whether a value '180' in variable 'age' should be corrected before modeling.
That's why it's important to read data report scrupulously so as to ensure the correctness of modeling data.
Checklist
- Type: Is data type correct? Is there any error value that leads to incomplete conversion to numeric?
- Missing_Rate: Does missing rate make sense? t1modeler algorithm is able to handle missing values, converting missing values into something like -9999 is unnecessary.
- Unique: Is number of unique value correct? A variable with number of unique values equals to 1 means there is no information in it.
- Min: Is minimum value correct? Is there any unexpected negative value?
- Max: Is maximum value correct? Is there any outlier value?
字段含义
- id: 变量 id 序号
- Variable: 变量名称
- Type: 变量类型
- Count: 数据集行数
- Missing: 缺失值数量
- Missing_Rate: 缺失值比例
- Unique: 唯一值数量
- Mean: 平均值
- Std: 标准差
- Min: 最小值
- 25%: 25 分位值
- 50%: 50 分位值,即中位数
- 75%: 75 分位值
- 95%: 95 分位值
- 99%: 99 分位值
- Max: 最大值
- First: datetime 类型变量的最早值
- Last: datetime 类型变量的最晚值
数据报告
阅读数据报告通常是进行模型开发的第一步,并且这是单凭机器算法无法解决的一个关键步骤。
例如,实际商业环境中的建模数据由不同数据源拼接而成,因此可能存在一定的缺失率。机器算法无法告诉分析师,缺失率为 80% 的“性别”变量是否存在由拼接导致的数据错误问题。
又例如,机器算法无法分辨“年龄”为 180 的数据,是否需要在建模前进行纠正。
因此分析师需要仔细阅读额数据报告以确保建模数据的准确性。
检查清单
- Type: 数据类型是否正确,是否存在本应为数据型的变量,由于某些错误值的存在,导致无法被识别为数据型?
- Missing_Rate: 数据缺失率是否正常?本建模算法能够自动处理缺失值,因此不需要将其转换为 -9999 等特殊值。
- Unique: 唯一值数量是否正常?一个变量的唯一值数量为 1,则此变量没有任何建模价值。
- Min: 最小值是否正常?是否存在不应该出现的负数?
- Max: 最大值是否正常?是否存在过大的异常值?
About WoE Analysis | 关于WoE分析
Column Definition
- id: WoE id number
- Bin: bin number
- Variable: name of the variable
- Type: variable type
- Value_Range: value boundary of the bin
- Value_Counts: number of unique values in the bin
- Bad: number of bads in the bin (target value equals to 1)
- Good: number of goods in the bin (target value equals to 0)
- Bad_Rate: bad rate = Bad / number of observations in the bin
- Bad_Total: total number of bads
- Good_Total: total number of goods
- Pct_Bad: distribution of bads, adds up to 100% within the bin
- Pct_Good: distribution of goods, adds up to 100% within the bin
- Pct_Bin: distribution of observations, adds up to 100%
- WoE: Weight of Evidence
- IV: Information Value
WoE Analysis
WoE Analysis performs variable-wise analysis and variable binning.
By default the t1modeler algorithm keeps the monotonicity of WoE (the greater the variable value, the higher/lower of the bad rate), variable feature is easy to be perceived but it tends to have lower discriminative power (lower IV);
Setings of t1modeler algorithm can be set to 'ignore monotonicity', so variable feature is hard to be perceived but it tends to have higher discriminative power (higher IV).
Although t1modeler algorithm did most of the job, analysts are encouraged to check some critical points.
Checklist
- Bad_Rate: Is the trend of bad rate consistent with common sense? For instance, does the modeling variable reveal a trend that the better the credit history, the lower the likelihood of future credit default? If not, probably it's a good idea to scrutinize data preparation scripts.
- IV: Does a variable's discriminative power seem normal? Is there any variable which should have superior IV with common sense, but turns out having inferior IV? If there is, probably it's a good idea to scrutinize data preparation scripts.
字段含义
- id: WoE id 序号
- Bin: 分箱序号
- Variable: 变量名称
- Type: 变量类型
- Value_Range: 分箱取值范围
- Value_Counts: 分箱中唯一值数量
- Bad: 分箱中坏样本数量(目标值为 1)
- Good: 分箱中好样本数量(目标值为 0)
- Bad_Rate: 分箱坏比例 = 坏样本数量 / 分箱样本数量
- Bad_Total: 总体坏样本
- Good_Total: 总体好样本
- Pct_Bad: 坏样本分布,分箱加总为 100%
- Pct_Good: 好样本分布,分箱加总为 100%
- Pct_Bin: 分箱样本分布,分箱加总为 100%
- WoE: 证据权重
- IV: 信息值
WoE 分析
WoE 分析进行单变量分析以及数值型变量的分箱。
在默认情况下 t1modeler 算法保持变量分箱的单调性(即变量值越大则 bad rate 越高/越低),使变量保持良好的解释性但降低区分能力;
也可以修改默认值使变量分箱舍弃单调性,即牺牲变量可解释性但提高区分能力。
虽然 t1modeler 算法完成了大部分工作,但是分析师依然需要检查一些关键点。
检查清单
- Bad_Rate: 坏样本比例趋势是否与一般认知一致?例如,数据是否展现出信用历史越好则未来违约的可能性越低?如果不是,可能需要检查数据清洗是否正确。
- IV: 变量区分能力是否正常?有没有一般认知中应该具有良好区分能力的变量,但在WoE分析中只有很低的区分能力?如果有,可能需要检查数据清洗是否正确。
About Stepwise Procedure | 关于逐步回归过程
Candidate Variables
Started from all input variables, candidate variables are variables which passed the selection criteria and will be used as the inputs of stepwise procedure. Variables which are ruled out by the selection criteria generally add no value to the model.
Selection criteria:
- Remove variables having missing rate greater than certain value (Settings - Missing rate for variable exclusion);
- Remove variables having number of unique values equal to 1;
- Remove datetime variables;
- Remove string variables having number of unique values greater than certain value (Settings - String variable uniqueness);
- Remove user-defined excluded variables;
- Remove variables having IV less than certain value (Settings - IV for variable exclusion);
- Calculate pair-wise Pearson Correlation Coefficient among all remaining variables, if the coefficient is greater than certain value, within the pair remove the one with lower IV (Settings - Correlation for variable exclusion).
Selected Features
Selected features are selected by stepwise procedure. Eventually they form a model.
Stepwise Info
Stepwise info logs the complete procedure of stepwise workflow.
In each iteration of stepwise procedure:
- Select the variable which meets the significance level requirement, and which is the most significant variable (and which has negative coefficient, if setting is True) into the model (Settings - Significance level for entry, Force variable coefficients to be negative);
- Among all variables selected into the model, remove the variable which fails the significance level requirement, and which is the least significant variable (and which has positive coefficient, if setting is True) from the model (Settings - Significance level for stay, Force variable coefficients to be negative);
- Repeat the above steps. If there is no variable being selected into the model, or if there is no variable being removed from the model, end modeling.
候选变量
候选变量是在原始建模变量的基础上,通过一定的筛选条件后,保留下来作为逐步回归过程的输入。被剔除的变量,通常认为对于模型开发没有价值。
筛选条件:
- 剔除缺失率大于指定值的变量(设置 - 被剔除变量缺失率阈值);
- 剔除唯一值数量为 1 的变量;
- 剔除时间日期型(datetime)变量;
- 剔除唯一值数量大于指定值的字符型变量(设置 - 字符型变量唯一值);
- 剔除模型开发前用户指定的排除变量;
- 剔除 IV 小于指定值的变量(设置 - 用于剔除变量的 IV 阈值);
- 计算所有未被排除的变量之间的两两皮尔逊相关系数,如果相关系数高于指定值,剔除 IV 较低的一个变量(设置 - 用于剔除变量的相关系数阈值)。
选取的变量
选取的变量指通过逐步回归过程,最终进入模型的变量。
逐步回归过程信息
逐步回归过程信息记录逐步回归的整个工作过程。
在每一次迭代中:
- 首先选入满足指定值且最显著(且相关系数为负,如有设置的话)的一个变量(设置 - 变量进入模型的显著性水平,确保入模变量系数均为负值);
- 计算已选入的所有变量的显著性水平,如果存在显著性水平高于指定值(且相关系数不为负,如有设置的话)的变量, 剔除最不显著的一个变量(设置 - 变量退出模型的显著性水平,确保入模变量系数均为负值);
- 重复以上步骤。如果再也无法选入变量,或者再也无法剔除变量,则迭代结束,模型开发完毕。
About KS Curve and ROC Curve | 关于 KS 曲线与 ROC 曲线
KS Curve
KS curve shows how good a model is able to separate events and non-events (or good events and bad events).
As one of the most important indicators of measuring the performance of a model, it is common to think that the higher the max KS value, the better the model is. But analysts should use their domain knowledge to check whether the max KS value is "too good to be true".
For instance, in the universe of credit risk modeling, reasonable max KS values are likely to be something between 20 and 60, a credit risk model with max KS value equals to 90 tends to have a few glitches with data preparation. But when it comes to handwritten digits recognition modeling, a model with max KS value equals to 95 is completely reasonable.
ROC Curve
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
One of the critical metrics is AUC (Area Under the Curve). Once again it is common to think that the higher the AUC, the better the model is.
KS 曲线
KS 曲线显示一个模型区分事件与非事件(或好事件与坏事件)的能力。
作为衡量模型性能的重要指标之一,通常认为最大 KS 值越高则模型越好。但是分析师需要结合领域知识,确认最大 KS 值是否“太完美以至于无法置信”。
例如,在信用风险领域,合理的模型最大 KS 值一般为 20 到 60 之间,如果一个信用风险模型的最大 KS 值为 90,则有可能是数据准备阶段出现了一些错误。但是在手写体数字识别领域,模型最大 KS 值高达 95 依然是合理的。
ROC 曲线
接收者操作特征曲线,或者 ROC 曲线,是一种坐标图式的分析工具,用于选择最佳的信号侦测模型、舍弃次佳的模型,在同一模型中设定最佳阈值。
其中一个关键指标为 AUC(曲线下面积)。同样地通常认为 AUC 值越高则模型越好。
About Results | 关于结果
Results
- Model: Name of the model
- Dependent Variable: The target variable
- Date: Time of modeling completion
- No. Observations: Number of observations
- Df Model: Degrees of freedom of model
- Df Residuals: Degrees of freedom of residuals
- Converged: Iteration convergence
- No. Iterations: Number of iterations when convergence
- Pseudo R-squared: McFadden's pseudo-R-squared
- AIC: Akaike information criterion
- BIC: Bayesian information criterion
- Log-Likelihood: Likelihood ratio chi-squared statistic
- LL-Null: Value of the constant-only loglikelihood
- LLR p-value: The chi-squared probability of getting a log-likelihood ratio statistic greater than LLR
- Scale: A scale parameter for the covariance matrix
结果
- Model: 模型名称
- Dependent Variable: 目标变量
- Date: 模型开发完毕时间戳
- No. Observations: 观测值数量
- Df Model: 模型自由度
- Df Residuals: 残差自由度
- Converged: 迭代是否收敛
- No. Iterations: 达到收敛时迭代次数
- Pseudo R-squared: McFadden's pseudo-R-squared
- AIC: 赤池信息准则
- BIC: 贝叶斯信息准则
- Log-Likelihood: Likelihood ratio chi-squared statistic
- LL-Null: Value of the constant-only loglikelihood
- LLR p-value: The chi-squared probability of getting a log-likelihood ratio statistic greater than LLR
- Scale: A scale parameter for the covariance matrix
About Parameters Estimation | 关于参数估计
Parameters Estimation
Parameters estimation shows statistics for model features.
- Variable: variable name, const is the constant variable
- Coeff.: Coefficient, describes the size and direction of the relationship between a feature and the target variable
- Std.Err.: Standard Error of the Coefficient
- z: Z-Value,measures the ratio between the coefficient and its standard error
- P>|z|: P-Value,significance level
- [0.025: upper boundary (95% confidence interval)
- 0.975]: lower boundary (95% confidence interval)
- VIF: Variance Inflation Factor
Notes
- t1modeler algorithm ensures the P-Value of model features are lower than certain value (Settings - Significance level for entry);
- If the setting is enabled, t1modeler algorithm ensures the coefficient of model features are negative, which results in more reasonable models (Settings - Force variable coefficients to be negative);
- Generally speaking, it is a good idea to keep VIF less than 10. If there are variables with VIF greater than 10, remove the one with greatest VIF, then start modeling again.
参数估计
参数估计展示每一个模型特征变量的相关统计值。
- Variable: 变量名称,其中 const 为常数项
- Coeff.: 即 Coefficient,特征变量与目标变量的相关系数
- Std.Err.: 即 Standard Error of the Coefficient,相关系数的标准差
- z: 即 Z-Value,相关系数与相关系数的标准差之间的比值
- P>|z|: 即 P-Value,特征变量的显著性水平
- [0.025: 上边界(95% 显著性水平)
- 0.975]: 下边界(95% 显著性水平)
- VIF: 即 Variance Inflation Factor,方差膨胀因子
备注
- t1modeler 算法保证进入模型的特征变量的 P-Value 在指定值以内(设置 - 变量进入模型的显著性水平);
- 如果配置为启用,t1modeler 算法保证进入模型的特征变量的相关系数为负值,使模型具有更好的可解释性(设置 - 确保入模变量系数均为负值);
- 通常认为特征变量的 VIF 不宜超过 10。如果有 VIF 超过 10 的变量,可以每次去除一个 VIF 最高的变量,再重新进行模型开发。
About Rank Order Table | 关于等分排序表
Column Definition
- #: group number
- Cnt Bin: count of the group
- Cnt Bad: count bad of the group
- Cnt Good: count good of the group
- Bin Bad Rate: bad rate of the group = count bad / count of the group
- Cum Bad: cumulative bad
- Cum Good: cumulative good
- Cum Bad %: percentage of cumulative bad
- Cum Good %: percentage of cumulative good
- KS: KS value
- Score Min: minimum score of the group
- Score Max: maximum score of the group
- Score Avg: average score of the group
Rank Order Table
According to the modeling result, observations are scored and sorted from lowest to highest, then segmented into 10 groups. Rank order table shows whether bin bad rate of the groups follow the same rank order.
字段定义
- #: 分组序号
- Cnt Bin: 分组样本数量
- Cnt Bad: 分组坏样本数量
- Cnt Good: 分组好样本数量
- Bin Bad Rate: 分组坏比例 = 分组坏样本数量 / 分组样本数量
- Cum Bad: 累积坏样本数量
- Cum Good: 累积好样本数量
- Cum Bad %: 累积坏样本百分比
- Cum Good %: 累积好样本百分比
- KS: KS 值
- Score Min: 分组最低分数
- Score Max: 分组最高分数
- Score Avg: 分组平均分数
等分排序表
根据模型开发结果,将样本打分后从低到高排列,分成等分的 10 个分组,观察模型坏比例是否能够同样保持从低到高的排序。
About Histogram | 关于直方图
Histogram
The histogram charts show the percentage distribution of score, bad and good.
直方图
直方图展示分数,坏样本以及好样本的百分比分布。
About Model Comprehension | 关于模型解读
Model Features
Model features and their bin values are listed here.
Select different bin values, the predicted score and predicted bad rate display on the chart.
Model Prediction
Model prediction shows the predicted score, and the comparison between overall bad rate and the selected sample bad rate.
Generally speaking, the predicted score and its rank ordering are the keys to effective model implementation. Bad rate of the selected sample is just for reference.
Click the buttons "The Best" and "The Worst", model prediction shows the maximum score and the minimum score respectively. From section "4. Model Result | 模型结果", "Rank Order Table" also presents a maximum score and a minimum score. The difference between the two score pairs is, the maximum score and the minimum score in "Rank Order Table" are calculated based on the actual variables from modeling sample, while the maximum score and the minimum score in "5. Model Comprehension | 模型解读" are calculated based on the theoretic variable combinations.
Scorecard
Scorecard shows the P0 score and binning scores for each variable.
特征变量
包含所有入模特征变量以及各变量的取值。
选取不同的取值组合,可以观察对应的模型分数和坏比例。
模型预测
模型预测显示指定变量值组合所对应的分数,以及总体坏比例与指定变量值组合的坏比例的对比图。
一般情况下,模型分数及其排序性是正确使用模型的关键,指定变量值组合的坏比例仅供参考。
如果点击按钮 “最好” 以及 “最坏”,可以看到对应的最高分数以及最低分数。在 “4. Model Result | 模型结果” 页面中的 “等分排序表” 里面,也有模型的最高分数和最低分数。其不同之处在于,“等分排序表” 里面的最高分数和最低分数是模型开发样本中实际出现的最高分数和最低分数,而 “5. Model Comprehension | 模型解读” 里面的最高分数和最低分数是理论上可能得到的最高分数和最低分数。
评分表
评分表显示基础分数以及每一个变量的分箱分数。
About Bin & WoE Code | 关于 Bin 与 WoE 代码
Variable Binning Code
Python script for generating variable binnings.
Variable WoE Code
Python script for generating variable WoE values. Model scoring code is at the bottom.
变量分箱代码
用于生成分箱变量的 Python 代码。
变量 WoE 代码
用于生成 WoE 变量的 Python 代码。脚本的最下方有模型打分代码。
Validation analysis is unable to complete | 验证集分析无法完成
Reasons of why validation analysis is unable to complete:
1. Numerical features, unexpected missing values found in validaton dataframe which are not presented in development dataframe. For instance, feature "age" doesn't have any missing values in development dataframe, but missing values are found in validaton dataframe;
2. String features, unexpected categorical values found in validaton dataframe which are not presented in development dataframe. For instance, feature "occupation" contains only 2 unique values, which are "farmar" and "manager", but values of "sales" (or values of missing) are found in validaton dataframe.
Solutions:
1. Numerical features, figure out why there are unexpected missing values in validation dataframe, and replace them with correct values;
2. String features, align values in validation dataframe with those in development dataframe.
Please read the prompt and correct values in validation dataframe.
验证集分析无法完成的原因:
1. 数值型特征变量,验证集中出现了缺失值,但是开发集中没有缺失值,这会导致验证集中的缺失值无法被打分。 例如,特征变量 "age" 在开发集中没有缺失值,但是在验证集中却出现了缺失值;
2. 字符型特征变量,验证集中出现了开发集中没有出现的取值,这会导致验证集中的取值无法被打分。例如,特征变量 "occupation" 在开发集中只有两种取值,分别为 "farmer" 和 "manager",但是在验证集中却多了一种取值 "sales"(或者出现了缺失值)。
解决方法:
1. 数值型特征变量,仔细检查验证集中出现缺失值的原因,将缺失值填充为正确的取值;
2. 字符型特征变量,根据开发集的取值内容,对齐验证集的取值。
请根据提示内容修正验证集的特征变量。
Scoring Code | 打分代码
Info | 提示
An error occurred when generating the Excel report. We will fix it as soon as possible. Try downloading the report later.
创建 Excel 报告时出现未知错误。我们将尽快修复。可以稍后再度尝试下载报告。