This section covers model options configuration, model building execution and model file information.
On “Model options” window, you can configure the model parameters to make a better model. There are 4 tabs – “Normal”, “Binary model”, “Regression model”, “Multiclassification model” on which you configure options to build different types of models.
Normal
“Data preprocessing”: Preprocess data or not before performing data modeling;
“Preprocess numerical target variable”: You can choose whether preprocess a numerical target variable before performing data modeling or not.
“Smart imputation”: Use smart data imputation approach to impute missing values.
“Perform normal imputation if smart imputation fails”: Use normal data imputation approach to re-create the model if smart imputation fails.
“Ensemble method”: “Optimal model strategy” and “Simple model combination”. The former selects best top N models to build a new model and involves a relatively large amount of computation. The latter just combines all defined models to build a new model and involves a relatively small amounts of computations.
“Best number of ensembles”: Select the best model combination. 0 means selecting the most effective model among the combinations; >0 means selecting the fixed top N models; and <0 means selecting the one among the top N models that makes the most effective combination. Default is 0.
“Ensemble function”: The option specifies the approach of combining models. You can select any function included in numpy. Default is np.mean.
“Model evaluation criterion”: Use AUC, cross_entropy, KS, AP, Recall, Lift, f1_score for a binary model and default is AUC; use MSE, MAE, MAPE and R2 for a regression model; and use cross_entropy for a multiclassification model.
“Percentage of test data”: Test data percentage.
“Import test data in batches”: Set the number of rows of test data to be imported in each batch.
“Adjust scoring results”: By default, the model scoring results will be adjusted according to the average of the sample data. Without adjustment the score is the average of the balanced samples.
“Set random seeds”: You can control randomness of model building through this option; default is 0. If the value is null, two model building executions will get random results respectively. If the value is set as integer n, two executions will get same results if their ns are same; and different results if their ns are different. When the random seed is set as n, the random_state value of all models will be set as the same value and can’t be manually changed.
Binary model
You can configure parameters for binary models on “Binary model” tab. A selected binary model will be used for model building.
There are 9 types of binary model: TreeClassification, GBDTClassification, RFClassification, LogicClassification, RidgeClassification, FNNClassification, XGBClassification, CNNClassification and PCAClassification.
“Number of samples” determines the number of samples used to build a model.
Below are parameter configuration directions for binary models.
Appendix 1: Binary model parameters
Type & Range: The type is always followed by an interval indicating the parameter’s value range. Square brackets indicate a closed interval and parentheses indicate an opened interval. Braces are used for certain int and float parameters to represent drop-down-menu values; the format is {start value, end value, interval}, such as {1, 5, 1}=[1,2,3,4,5] and {1, 6, 2}=[1,3,5]. All available values of string parameters will be listed in the drop-down menu and can’t be entered manually; a null value should be selected through the drop-down menu; the bool values, which are true and false, also should be selected in the drop-down menu; values of int parameters and float parameters can be enterned manually and all available values will be listed in drop-down menu.
For some float parameters, if their values are integers, they need to be followed by .0, like 0.0 and 1.0.
TreeClassification
Parameter |
Type & Range |
Description |
criterion |
string: ["gini", "entropy"] |
Criterions for evaluating node splitting. |
splitter |
string: ["best", "random"] |
Choose a splitting strategy for each node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. |
GBDTClassification
Parameter |
Type |
Description |
loss |
string: ["deviance", "exponential"] |
A loss function. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of boosting stages. |
subsample |
float: (0, 1], {0.1, 1, 0.1} |
The ratio of sampling data used by a basic machine learning method. |
criterion |
string: ["mse", "friedman_mse", "mae"] |
Criterions for evaluating node splitting. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
RFClassification
Parameter |
Type |
Description |
n_estimators |
int: [1, +∞), {10, 500, 10} |
The number of trees. |
criterion |
string: ["gini", "entropy"] |
Criterions for evaluating node splitting. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
bootstrap |
bool |
Whether to use bootstrap when generating a tree. |
oob_score |
bool |
Whether to use out-of-bag samples to predict accuracy. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. |
LogicClassification
Parameter |
Type |
Description |
penalty |
string: ["l1", "l2", "elasticnet", "none"] |
Penalty regularization. "newton-cg", "sag" and "lbfgs" solvers support "l2" only; "elasticnet" supports ‘saga’ solver only; "none" means non-regularization and doesn’t support liblinear solver. |
dual |
bool |
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features. |
tol |
float: (0, 1) |
The tolerance value before stopping iteration. |
C |
float: (0, 1] |
Inverse of regularization strength, which must be a positive. |
fit_intercept |
bool |
Whether to include an intercept item. |
intercept_scaling |
float: (0, 1] |
Only works when solver="liblinear". |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. |
solver |
string: ["newton-cg", "lbfgs", "liblinear", "sag", "saga"] |
The optimal algorithm. |
max_iter |
int: [1, +∞), {10, 500, 10} |
Maximum iterations; only works when solver=["newton-cg", "lbfgs", "sag"]. |
multi_class |
string: ["ovr", "multinomial", "auto"] |
The algorithm for handling multiple classes. "ovr" builds a model for each class; "multinomial" doesn’t work with solver="liblinear". |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
RidgeClassification
Parameter |
Type |
Description |
alpha |
float:[0, +∞], {0.0, 10.0, 0.1} |
Regularization strength, which must be a positive. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
max_iter |
int: [1, +∞), {10, 500, 10} null |
Maximum iterations. |
tol |
float: (0, 1) |
Precision of the final solution. |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. "balanced" for auto-adjust. |
solver |
string: ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"] |
The optimal algorithm. |
FNNClassification and CNNClassification
Parameters for these two types of binary model are not supported for the time being due to some special features of the neural networks.
XGBClassification
Parameter |
Type |
Description |
max_depth |
int: [1, +∞), {1, 100, 1} |
Maximum tree depth. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of booster trees. |
objective |
string: ["binary:logistic", "binary:logitraw", "binary:hinge"] |
Learning objective; binary:logistic: Binary logistic regression for outputting probability; binary:logitraw: Binary logistic regression for outputting the score before logistic transformation; binary:hinge: Binary hinge loss for outputting class 0 or class 1 instead of the probability. |
booster |
string: ["gbtree", "gblinear", "dart"] |
The booster type used. |
gamma |
float: [0, +∞) |
The smallest loss mitigation value for node splitting. |
min_child_weight |
int: [1, +∞), {10, 1000, 10} |
The minimum sum of sampling weights of child nodes. |
max_delta_step |
int: [0, +∞), {0, 10, 1} |
The allowed longest delta step for evaluating a tree’s weight. |
subsample |
float: (0, 1], {0.1, 1.0, 0.1} |
The proportion of subsample for training a model to the whole set of samplings. |
colsample_bytree |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of the random sampling from the features for each tree. |
colsample_bylevel |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of random sampling from the features on each horizontal level for node splitting. |
reg_alpha |
float:[0, +∞], {0.0, 10.0, 0.1} |
L1 regularization term. |
reg_lambda |
float:[0, +∞], {0.0, 10.0, 0.1} |
L2 regularization term. |
scale_pos_weight |
float: (0, +∞) |
Control the balance of positive samples and negative samples. |
base_score |
float: (0, 1), {0.1, 0.9, 0.1} |
The initial value for starting a prediction. |
missing |
float: (-∞, +∞) null |
Define a missing value. |
PCAClassification
Parameter |
Type |
Description |
n_components |
int or null: [1, min(row count, column count )] |
Retain the number of principal components; null indicates auto-config, which is the default. |
whiten |
bool |
Whether to convert unit root. |
svd_solver |
string: ["auto", "full", "arpack", "randomized"] |
The SVD solver to find PCA; default is full. |
tol |
float: (0, 1) |
Tolerance to use; default is 0.0001. |
fit_intercept |
bool |
Whether to include an intercept item. |
max_iter |
int: [1, +∞), {100, 1000, 100} |
Maximum number of iterations. |
reg_solver |
string: ["newton-cg", "lbfgs", "sag", "saga"] |
The regression solver to find PCA; default is "lbfgs". |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
Regression model
You can configure parameters for regression models on “Regression model” tab. A selected regression model will be used for model building.
There are 11 types of regression models – TreeRegression, GBDTRegression, RFRegression, LRegression, LassoRegression, ENRegression, RidgeRegression, FNNRegression, XGBRegression, CNNRegression and PCARegression.
“Number of samples” determines the number of samples used to build a model.
Below are parameter configuration directions for regression models.
Appendix 2: Regression model parameters
Type & Range: The type is always followed by an interval indicating the parameter’s value range. Square brackets indicate a closed interval and parentheses indicate an opened interval. Braces are used for certain int and float parameters to represent drop-down-menu values; the format is {start value, end value, interval}, such as {1, 5, 1}=[1,2,3,4,5] and {1, 6, 2}=[1,3,5]. All available values of string parameters will be listed in the drop-down menu and can’t be entered manually; a null value should be selected through the drop-down menu; the bool values, which are true and false, also should be selected in the drop-down menu; values of int parameters and float parameters can be enterned manually and all available values will be listed in drop-down menu.
For some float parameters, if their values are integers, they need to be followed by .0, like 0.0 and 1.0.
TreeRegression
Parameter |
Type |
Description |
criterion |
string: ["mse", "friedman_mse", "mae"] |
Criterions for evaluating node splitting. |
splitter |
string: ["best", "random"] |
Choose a splitting strategy for each node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
GBDTRegression
Parameter |
Type |
Description |
loss |
string: ["ls", "lad", "huber", "quantile"] |
A loss function. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of boosting stages. |
subsample |
float: (0, 1], {0.1, 1, 0.1} |
The ratio of sampling data used by a basic machine learning method. |
criterion |
string: ["mse", "friedman_mse", "mae"] |
Criterions for evaluating node splitting. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
alpha |
float: (0, 1), {0.1, 0.9, 0.1} |
The alpha-quantile of the huber loss function and the quantile loss function. Only if loss='huber' or loss='quantile' |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
RFRegression
Parameter |
Type |
Description |
n_estimators |
int: [1, +∞), {10, 500, 10} |
The number of trees. |
criterion |
string: ["mse", "mae"] |
Criterions for evaluating node splitting. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
bootstrap |
bool |
Whether to use bootstrap when generating a tree. |
oob_score |
bool |
Whether to use out-of-bag samples to predict accuracy. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
LRegression
Parameter |
Type |
Description |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
LassoRegression
Parameter |
Type |
Description |
fit_intercept |
bool |
Whether to include an intercept item. |
alpha |
float or null:[0, +∞], {0.0, 10.0, 0.1} |
The regularized penalty factor. A null value means auto-configure and a float will disable cv and max_n_alphas. |
normalize |
bool |
Whether to normalize data. |
precompute |
string: ["auto"] bool |
Whether to precompute Gram matrix to speed up model building. |
max_iter |
int: [1, +∞), {10, 500, 10} |
Maximum number of iterations. |
cv |
int: [2, 20] |
Cross-validate the turning point |
max_n_alphas |
int: [1, +∞) , {100, 1000, 100} |
Cross-validate the number of searched alpha |
positive |
bool |
Whether to rule that a coefficient must be positive. |
ENRegression
Parameter |
Type |
Description |
alpha |
float or null: [0, +∞], {0.0, 10.0, 0.1} |
A constant multiplied by penalty item; null indicates auto-config, which is the default. |
l1_ratio |
float or null : [0, 1], {0.0, 1.0, 0.1} |
A mixed parameter; its value is L2 when l1_ratio=0; and its value is L1 when l1_ratio=1; its value is mixed proportion when 11_ratio falls between 0 and 1; null indicates auto-config, which is the default. |
n_alphas |
int: [1, +∞), {100, 1000, 100} |
Get the number of alpha; it is invalid when alpha is a float. |
cv |
int: [2, 20] |
Cross-validate the turning point; it is invalid when both alpha and l1_ratio are float. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
precompute |
bool |
Whether to precompute Gram matrix to speed up model building. |
max_iter |
int: [1, +∞), {10, 500, 10} |
Maximum iterations. |
tol |
float: (0, 1) |
The tolerance value before stopping iteration. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. It works when parameter cv is disabled; and it is invalid when cv works. |
positive |
bool |
Whether to rule that a coefficient must be positive. |
selection |
string: ["cyclic", "random"] |
"cyclic" means iteration by loop by variables; "random" represents a random iteration coefficient. |
RidgeRegression
Parameter |
Type |
Description |
alpha |
float: [0, +∞], {0.0, 10.0, 0.1} |
Regularization strength, which must be a positive. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
max_iter |
int: [1, +∞), {10, 500, 10} null |
Maximum iterations. |
tol |
float: (0, 1) |
Precision of the final solution. |
solver |
string: ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"] |
The optimal algorithm. |
FNNRegression and CNNRegression
Parameters for these two types of regression model are not supported for the time being due to some special features of the neural networks.
XGBRegression
Parameter |
Type |
Description |
max_depth |
int: [1, +∞), {1, 100, 1} |
Maximum tree depth. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of booster trees. |
objective |
string: ["reg:squarederror", "reg:squaredlogerror", "reg:logistic"] |
Learning objective; reg:squarederror: Squared error loss; reg:squaredlogerror: Squared log error loss; reg:logistic: logistic regression. |
booster |
string: ["gbtree", "gblinear", "dart"] |
The booster type used. |
gamma |
float: [0, +∞) |
The smallest loss mitigation value for node splitting. |
min_child_weight |
int: [1, +∞), {10, 1000, 10} |
The minimum sum of sampling weights of child nodes. |
max_delta_step |
int: [0, +∞), {0, 10, 1} |
The allowed longest delta step for evaluating a tree’s weight. |
subsample |
float: (0, 1], {0.1, 1.0, 0.1} |
The proportion of subsample for training a model to the whole set of samplings. |
colsample_bytree |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of the random sampling from the features for each tree. |
colsample_bylevel |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of random sampling from the features on each horizontal level for node splitting. |
reg_alpha |
float:[0, +∞], {0.0, 10.0, 0.1} |
L1 regularization term. |
reg_lambda |
float:[0, +∞], {0.0, 10.0, 0.1} |
L2 regularization term. |
scale_pos_weight |
float: (0, +∞) |
Control the balance of positive samples and negative samples. |
base_score |
float: (0, 1), {0.1, 0.9, 0.1} |
The initial value for starting a prediction. |
missing |
float: (-∞, +∞) null |
Define a missing value. |
PCARegression
Parameter |
Type |
Description |
n_components |
int or null: [1, min(row count, column count)] |
Retain the number of principal components; null indicates auto-config, which is the default. |
whiten |
bool |
Whether to convert unit root. |
svd_solver |
string: ["auto", "full", "arpack", "randomized"] |
The SVD solver to find PCA; default is full. |
tol |
float: (0, 1) |
Tolerance to use; default is 0.0001. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
Multi-category model
“Multi-category model”: It is for configuring the multi-category model. A selected multi-category model will be used for model building.
There are two types of multi-category model – XGBMultiClassification、CNNMultiClassification.
The “Number of samples” specifies the count of samples to score data according to a certain model.
The following Appendix 3 lists parameters and their descriptions for the multi-category model.
Appendix 3: multi-category model parameters
XGBMultiClassification
Parameter |
Type |
Description |
max_depth |
int: [1, +∞), {1, 100, 1} |
Maximum tree depth |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
The number of trees. |
booster |
string: ["gbtree", "gblinear", "dart"] |
The booster type used. |
gamma |
float: [0, +∞) |
The smallest loss mitigation value for node splitting. |
min_child_weight |
int: [1, +∞), {10, 1000, 10} |
The minimum sum of sampling weights of child nodes. |
max_delta_step |
int: [0, +∞), {0, 10, 1} |
The allowed longest delta step for evaluating a tree’s weight. |
subsample |
float: (0, 1], {0.1, 1.0, 0.1} |
The proportion of subsample for training a model to the whole set of samplings. |
colsample_bytree |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of the random sampling from the features for each tree. |
colsample_bylevel |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of random sampling from the features on each horizontal level for node splitting. |
reg_alpha |
float:[0, +∞], {0.0, 10.0, 0.1} |
L1 regularization term. |
reg_lambda |
float:[0, +∞], {0.0, 10.0, 0.1} |
L2 regularization term. |
scale_pos_weight |
float: (0, +∞) |
Control the balance of positive samples and negative samples. |
base_score |
float: (0, 1), {0.1, 0.9, 0.1} |
The initial value for starting a prediction. |
missing |
float: (-∞, +∞) null |
Define a missing value. |
CNNMultiClassification
This type of parameters is not supported for the time being due to some special features of the neural networks.
To build a predictive model, you must choose a target variable and then select a modeling table file through “Model file”. By default the modeling table file is stored under the same directory where the loaded data is stored and has the same name as the loaded data file. Users can define the path and name themselves. A model document is one with .pcf extension.
Click “Modeling” option under “Run”, or click on the toolbar, to pop up the “Build model” window, where model building information is output.
Model building is finished when the message “Model building is finished” appears.
“Importance” will be displayed after a model is built, as shown below. A variable’s degree of importance indicates its influence on the future predictive result. The higher the degree of importance is, the bigger a variable’s influence is. A variable with zero importance degree has no impact on the predictive result. As the following shows, the Sex variable has the biggest influence on the result.
Model presentation
YModel encapsulates multiple algorithms for model building, they are: Decision Tree, Gradient Boosting, Logistic Regression, Neural Network, Random Forest, Elastic Net, LASSO Regression, Linear Regression, Ridge Regression, and XGBoost.
After the model building is finished, you can click “Model presentation” in “Build model” window to view the algorithm(s) used to build the model, as shown below:
Model performance
Model performance can be reflected through a series of indexes and figures.
Click “Model performance” in “Build model” window:
Models built on different types of target variables are evaluated by different parameters and forms.
Below is model performance information about binary target variable “Survived”:
Here’s model performance information about numerical target variable “Age”: