Loading LightGBM report | 正在读取 LightGBM 报告
LightGBM report not found, go back to "My data" | 未找到 LightGBM 报告,返回“我的数据”
My data | 我的数据File name | 文件名: uci_001_adult_data_set.zip
id | Variable | Type | Count | Missing | Missing_Rate | Unique | Mean | Std | Min | 25% | 50% | 75% | 95% | 99% | Max | First | Last |
---|
KS Curve and ROC Curve | KS 曲线与 ROC 曲线
Rank Order Table | 等分排序表
# | Cnt Bin | Cnt Bad | Cnt Good | Bin Bad Rate | Remaining Bad Rate | Cum Bad | Cum Good | Cum Bad % | Cum Good % | KS | Score Min | Score Max | Score Avg |
---|
Histogram | 直方图
Feature Importance | 变量重要性
No. | Feature | Total Gains | # |
---|
Model Configuration | 模型配置
Validation dataframe not provided | 未提供验证集
KS Curve and ROC Curve | KS 曲线与 ROC 曲线
Rank Order Table | 等分排序表
# | Cnt Bin | Cnt Bad | Cnt Good | Bin Bad Rate | Remaining Bad Rate | Cum Bad | Cum Good | Cum Bad % | Cum Good % | KS | Score Min | Score Max | Score Avg |
---|
Histogram | 直方图
About Data Report | 关于数据报告
Column Definition
- id: variable id number
- Variable: name of the variable
- Type: variable type
- Count: number of rows
- Missing: number of missing values
- Missing_Rate: percentage of missing values
- Unique: number of unique values
- Mean: the average of the values
- Std: the standard devietion of the values
- Min: the minimum of the values
- 25%: 25th percentile
- 50%: 50th percentile, a.k.a. the median
- 75%: 75th percentile
- 95%: 95th percentile
- 99%: 99th percentile
- Max: the max of the values
- First: the earlier values of datetime variables
- Last: the latest values of datetime variables
Data Report
Being the first step of modeling, understanding data report is one of the important things that cannot be done by a machine.
For example, in the business world, modeling data is prepared by joining together pieces of data from various sources, so missing values usually exist. A machine cannot tell whether or not it is correct, that a variable named 'gender' has over 80% missing values.
One more example, a machine cannot tell whether a value '180' in variable 'age' should be corrected before modeling.
That's why it's important to read data report scrupulously so as to ensure the correctness of modeling data.
Checklist
- Type: Is data type correct? Is there any error value that leads to incomplete conversion to numeric?
- Missing_Rate: Does missing rate make sense? t1modeler algorithm is able to handle missing values, converting missing values into something like -9999 is unnecessary.
- Unique: Is number of unique value correct? A variable with number of unique values equals to 1 means there is no information in it.
- Min: Is minimum value correct? Is there any unexpected negative value?
- Max: Is maximum value correct? Is there any outlier value?
字段含义
- id: 变量 id 序号
- Variable: 变量名称
- Type: 变量类型
- Count: 数据集行数
- Missing: 缺失值数量
- Missing_Rate: 缺失值比例
- Unique: 唯一值数量
- Mean: 平均值
- Std: 标准差
- Min: 最小值
- 25%: 25 分位值
- 50%: 50 分位值,即中位数
- 75%: 75 分位值
- 95%: 95 分位值
- 99%: 99 分位值
- Max: 最大值
- First: datetime 类型变量的最早值
- Last: datetime 类型变量的最晚值
数据报告
阅读数据报告通常是进行模型开发的第一步,并且这是单凭机器算法无法解决的一个关键步骤。
例如,实际商业环境中的建模数据由不同数据源拼接而成,因此可能存在一定的缺失率。机器算法无法告诉分析师,缺失率为 80% 的“性别”变量是否存在由拼接导致的数据错误问题。
又例如,机器算法无法分辨“年龄”为 180 的数据,是否需要在建模前进行纠正。
因此分析师需要仔细阅读额数据报告以确保建模数据的准确性。
检查清单
- Type: 数据类型是否正确,是否存在本应为数据型的变量,由于某些错误值的存在,导致无法被识别为数据型?
- Missing_Rate: 数据缺失率是否正常?本建模算法能够自动处理缺失值,因此不需要将其转换为 -9999 等特殊值。
- Unique: 唯一值数量是否正常?一个变量的唯一值数量为 1,则此变量没有任何建模价值。
- Min: 最小值是否正常?是否存在不应该出现的负数?
- Max: 最大值是否正常?是否存在过大的异常值?
About KS Curve and ROC Curve | 关于 KS 曲线与 ROC 曲线
KS Curve
KS curve shows how good a model is able to separate events and non-events (or good events and bad events).
As one of the most important indicators of measuring the performance of a model, it is common to think that the higher the max KS value, the better the model is. But analysts should use their domain knowledge to check whether the max KS value is "too good to be true".
For instance, in the universe of credit risk modeling, reasonable max KS values are likely to be something between 20 and 60, a credit risk model with max KS value equals to 90 tends to have a few glitches with data preparation. But when it comes to handwritten digits recognition modeling, a model with max KS value equals to 95 is completely reasonable.
ROC Curve
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
One of the critical metrics is AUC (Area Under the Curve). Once again it is common to think that the higher the AUC, the better the model is.
KS 曲线
KS 曲线显示一个模型区分事件与非事件(或好事件与坏事件)的能力。
作为衡量模型性能的重要指标之一,通常认为最大 KS 值越高则模型越好。但是分析师需要结合领域知识,确认最大 KS 值是否“太完美以至于无法置信”。
例如,在信用风险领域,合理的模型最大 KS 值一般为 20 到 60 之间,如果一个信用风险模型的最大 KS 值为 90,则有可能是数据准备阶段出现了一些错误。但是在手写体数字识别领域,模型最大 KS 值高达 95 依然是合理的。
ROC 曲线
接收者操作特征曲线,或者 ROC 曲线,是一种坐标图式的分析工具,用于选择最佳的信号侦测模型、舍弃次佳的模型,在同一模型中设定最佳阈值。
其中一个关键指标为 AUC(曲线下面积)。同样地通常认为 AUC 值越高则模型越好。
About Rank Order Table | 关于等分排序表
Column Definition
- #: group number
- Cnt Bin: count of the group
- Cnt Bad: count bad of the group
- Cnt Good: count good of the group
- Bin Bad Rate: bad rate of the group = count bad / count of the group
- Cum Bad: cumulative bad
- Cum Good: cumulative good
- Cum Bad %: percentage of cumulative bad
- Cum Good %: percentage of cumulative good
- KS: KS value
- Score Min: minimum score of the group
- Score Max: maximum score of the group
- Score Avg: average score of the group
Rank Order Table
According to the modeling result, observations are scored and sorted from lowest to highest, then segmented into 10 groups. Rank order table shows whether bin bad rate of the groups follow the same rank order.
字段定义
- #: 分组序号
- Cnt Bin: 分组样本数量
- Cnt Bad: 分组坏样本数量
- Cnt Good: 分组好样本数量
- Bin Bad Rate: 分组坏比例 = 分组坏样本数量 / 分组样本数量
- Cum Bad: 累积坏样本数量
- Cum Good: 累积好样本数量
- Cum Bad %: 累积坏样本百分比
- Cum Good %: 累积好样本百分比
- KS: KS 值
- Score Min: 分组最低分数
- Score Max: 分组最高分数
- Score Avg: 分组平均分数
等分排序表
根据模型开发结果,将样本打分后从低到高排列,分成等分的 10 个分组,观察模型坏比例是否能够同样保持从低到高的排序。
About Histogram | 关于直方图
Histogram
The histogram charts show the percentage distribution of score, bad and good.
直方图
直方图展示分数,坏样本以及好样本的百分比分布。
About Feature Importance | 关于变量重要性
Feature Importance
Feature importance shows total gains of each feature in the model.
The higher the number, the more important the feature is in the model.
变量重要性
变量重要性显示每一个变量在模型中的总增益。
数值越大,则变量在模型中的重要性则越高。
About Tree Viewer | 关于树浏览器
Tree Viewer
Depends on model setting "num_iterations (num_trees)", a LightGBM model may have one or many trees.
A typical LightGBM model comprises dozens of tree, analysts may find it difficult to understand the structure of the model.
By presenting every single tree generated by LightGBM, t1modeler tree viewer helps analysts get a better understanding of the model.
树浏览器
根据不同的模型设置“迭代次数(树的个数)”,LightGBM 模型可能含有 1 棵或者多棵树。
典型的 LightGBM 模型通常含有数十棵树,分析师可能会难以理解模型的构成。
通过展示 LightGBM 模型的每一棵树,t1modeler 树浏览器帮助分析师更好地理解模型。
About Scoring Code | 关于打分代码
Scoring Code
The code for generating LightGBM score on a given dataframe.
打分代码
在给定数据集上生成 LightGBM 分数的代码。