esProc

Tutorial

Function Reference

Code Reference

User Reference

External Library Guide

Data File Tool Manual

DQL Tutorial

Cluster Server Manager Manual

SPL WIN Manual
YModel

User Reference

JSON-style Parameter Guide
ReportLite

User Reference
Official Website
http://doc.scudata.com:8888/WEB-INF/layout/application.jsp1

Previous | Next

Model building

Read（1309） Like（0） Label: model building,

This section covers model options configuration, model building execution and model file information.

Model options

On “Model options” window, you can configure the model parameters to make a better model. There are 4 tabs – “Normal”, “Binary model”, “Regression model”, “Multiclassification model” on which you configure options to build different types of models.

Normal

“Data preprocessing”: Preprocess data or not before performing data modeling;

“Preprocess numerical target variable”: You can choose whether preprocess a numerical target variable before performing data modeling or not.

“Smart imputation”: Use smart data imputation approach to impute missing values.

“Perform normal imputation if smart imputation fails”: Use normal data imputation approach to re-create the model if smart imputation fails.

“Ensemble method”: “Optimal model strategy” and “Simple model combination”. The former selects best top N models to build a new model and involves a relatively large amount of computation. The latter just combines all defined models to build a new model and involves a relatively small amounts of computations.

“Best number of ensembles”: Select the best model combination. 0 means selecting the most effective model among the combinations; >0 means selecting the fixed top N models; and <0 means selecting the one among the top N models that makes the most effective combination. Default is 0.

“Ensemble function”: The option specifies the approach of combining models. You can select any function included in numpy. Default is np.mean.

“Model evaluation criterion”: Use AUC, cross_entropy, KS, AP, Recall, Lift, f1_score for a binary model and default is AUC; use MSE, MAE, MAPE and R2 for a regression model; and use cross_entropy for a multiclassification model.

“Percentage of test data”: Test data percentage.

“Import test data in batches”: Set the number of rows of test data to be imported in each batch.

“Adjust scoring results”: By default, the model scoring results will be adjusted according to the average of the sample data. Without adjustment the score is the average of the balanced samples.

“Set random seeds”: You can control randomness of model building through this option; default is 0. If the value is null, two model building executions will get random results respectively. If the value is set as integer n, two executions will get same results if their ns are same; and different results if their ns are different. When the random seed is set as n, the random_state value of all models will be set as the same value and can’t be manually changed.

Binary model

You can configure parameters for binary models on “Binary model” tab. A selected binary model will be used for model building.

There are 9 types of binary model: TreeClassification, GBDTClassification, RFClassification, LogicClassification, RidgeClassification, FNNClassification, XGBClassification, CNNClassification and PCAClassification.

“Number of samples” determines the number of samples used to build a model.

Below are parameter configuration directions for binary models.

Appendix 1: Binary model parameters

Type & Range: The type is always followed by an interval indicating the parameter’s value range. Square brackets indicate a closed interval and parentheses indicate an opened interval. Braces are used for certain int and float parameters to represent drop-down-menu values; the format is {start value, end value, interval}, such as {1, 5, 1}=[1,2,3,4,5] and {1, 6, 2}=[1,3,5]. All available values of string parameters will be listed in the drop-down menu and can’t be entered manually; a null value should be selected through the drop-down menu; the bool values, which are true and false, also should be selected in the drop-down menu; values of int parameters and float parameters can be enterned manually and all available values will be listed in drop-down menu.

For some float parameters, if their values are integers, they need to be followed by .0, like 0.0 and 1.0.

TreeClassification

Parameter	Type & Range	Description
criterion	string: ["gini", "entropy"]	Criterions for evaluating node splitting.
splitter	string: ["best", "random"]	Choose a splitting strategy for each node.
max_depth	int: [1, +∞), {1, 100, 1} null	Maximum tree depth.
min_samples_split	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_samples_leaf	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_weight_fraction_leaf	float: [0, 1), {0, 0.1, 0.01}	Minimum weight among all weights of the input sampling data at a leaf node.
max_features	int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null	Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features.
max_leaf_nodes	int: [1, +∞), {10, 1000, 10} null	Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes.
min_impurity_decrease	float: [0, 1)	The lowest impurity decrease for node splitting.
class_weight	string: [“balanced”] null	Weights associated with classes in the form {class_label: weight}.

GBDTClassification

Parameter	Type	Description
loss	string: ["deviance", "exponential"]	A loss function.
learning_rate	float: (0, 1), {0.1, 0.9, 0.1}	The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution.
n_estimators	int: [1, +∞), {10, 500, 10}	Number of boosting stages.
subsample	float: (0, 1], {0.1, 1, 0.1}	The ratio of sampling data used by a basic machine learning method.
criterion	string: ["mse", "friedman_mse", "mae"]	Criterions for evaluating node splitting.
min_samples_split	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_samples_leaf	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_weight_fraction_leaf	float: [0, 1), {0, 0.1, 0.01}	Minimum weight among all weights of the input sampling data at a leaf node.
max_depth	int: [1, +∞), {1, 100, 1} null	Maximum tree depth.
min_impurity_decrease	float: [0, 1)	The lowest impurity decrease for node splitting.
max_features	int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null	Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features.
max_leaf_nodes	int: [1, +∞), {10, 1000, 10} null	Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes.
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false.

RFClassification

Parameter	Type	Description
n_estimators	int: [1, +∞), {10, 500, 10}	The number of trees.
criterion	string: ["gini", "entropy"]	Criterions for evaluating node splitting.
max_depth	int: [1, +∞), {1, 100, 1} null	Maximum tree depth.
min_samples_split	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_samples_leaf	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_weight_fraction_leaf	float: [0, 1), {0, 0.1, 0.01}	Minimum weight among all weights of the input sampling data at a leaf node.
max_features	int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null	Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features.
max_leaf_nodes	int: [1, +∞), {10, 1000, 10} null	Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes.
min_impurity_decrease	float: [0, 1)	The lowest impurity decrease for node splitting.
bootstrap	bool	Whether to use bootstrap when generating a tree.
oob_score	bool	Whether to use out-of-bag samples to predict accuracy.
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false.
class_weight	string: [“balanced”] null	Weights associated with classes in the form {class_label: weight}.

LogicClassification

Parameter	Type	Description
penalty	string: ["l1", "l2", "elasticnet", "none"]	Penalty regularization. "newton-cg", "sag" and "lbfgs" solvers support "l2" only; "elasticnet" supports ‘saga’ solver only; "none" means non-regularization and doesn’t support liblinear solver.
dual	bool	Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.
tol	float: (0, 1)	The tolerance value before stopping iteration.
C	float: (0, 1]	Inverse of regularization strength, which must be a positive.
fit_intercept	bool	Whether to include an intercept item.
intercept_scaling	float: (0, 1]	Only works when solver="liblinear".
class_weight	string: [“balanced”] null	Weights associated with classes in the form {class_label: weight}.
solver	string: ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]	The optimal algorithm.
max_iter	int: [1, +∞), {10, 500, 10}	Maximum iterations; only works when solver=["newton-cg", "lbfgs", "sag"].
multi_class	string: ["ovr", "multinomial", "auto"]	The algorithm for handling multiple classes. "ovr" builds a model for each class; "multinomial" doesn’t work with solver="liblinear".
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false.

RidgeClassification

Parameter	Type	Description
alpha	float:[0, +∞], {0.0, 10.0, 0.1}	Regularization strength, which must be a positive.
fit_intercept	bool	Whether to include an intercept item.
normalize	bool	Whether to normalize data.
max_iter	int: [1, +∞), {10, 500, 10} null	Maximum iterations.
tol	float: (0, 1)	Precision of the final solution.
class_weight	string: [“balanced”] null	Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. "balanced" for auto-adjust.
solver	string: ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"]	The optimal algorithm.

FNNClassification and CNNClassification

Parameters for these two types of binary model are not supported for the time being due to some special features of the neural networks.

XGBClassification

Parameter	Type	Description
max_depth	int: [1, +∞), {1, 100, 1}	Maximum tree depth.
learning_rate	float: (0, 1), {0.1, 0.9, 0.1}	The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution.
n_estimators	int: [1, +∞), {10, 500, 10}	Number of booster trees.
objective	string: ["binary:logistic", "binary:logitraw", "binary:hinge"]	Learning objective; binary:logistic: Binary logistic regression for outputting probability; binary:logitraw: Binary logistic regression for outputting the score before logistic transformation; binary:hinge: Binary hinge loss for outputting class 0 or class 1 instead of the probability.
booster	string: ["gbtree", "gblinear", "dart"]	The booster type used.
gamma	float: [0, +∞)	The smallest loss mitigation value for node splitting.
min_child_weight	int: [1, +∞), {10, 1000, 10}	The minimum sum of sampling weights of child nodes.
max_delta_step	int: [0, +∞), {0, 10, 1}	The allowed longest delta step for evaluating a tree’s weight.
subsample	float: (0, 1], {0.1, 1.0, 0.1}	The proportion of subsample for training a model to the whole set of samplings.
colsample_bytree	float: (0, 1], {0.1, 1.0, 0.1}	Proportion of the random sampling from the features for each tree.
colsample_bylevel	float: (0, 1], {0.1, 1.0, 0.1}	Proportion of random sampling from the features on each horizontal level for node splitting.
reg_alpha	float:[0, +∞], {0.0, 10.0, 0.1}	L1 regularization term.
reg_lambda	float:[0, +∞], {0.0, 10.0, 0.1}	L2 regularization term.
scale_pos_weight	float: (0, +∞)	Control the balance of positive samples and negative samples.
base_score	float: (0, 1), {0.1, 0.9, 0.1}	The initial value for starting a prediction.
missing	float: (-∞, +∞) null	Define a missing value.

PCAClassification

Parameter	Type	Description
n_components	int or null: [1, min(row count, column count )]	Retain the number of principal components; null indicates auto-config, which is the default.
whiten	bool	Whether to convert unit root.
svd_solver	string: ["auto", "full", "arpack", "randomized"]	The SVD solver to find PCA; default is full.
tol	float: (0, 1)	Tolerance to use; default is 0.0001.
fit_intercept	bool	Whether to include an intercept item.
max_iter	int: [1, +∞), {100, 1000, 100}	Maximum number of iterations.
reg_solver	string: ["newton-cg", "lbfgs", "sag", "saga"]	The regression solver to find PCA; default is "lbfgs".
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false.

Regression model

You can configure parameters for regression models on “Regression model” tab. A selected regression model will be used for model building.

There are 11 types of regression models – TreeRegression, GBDTRegression, RFRegression, LRegression, LassoRegression, ENRegression, RidgeRegression, FNNRegression, XGBRegression, CNNRegression and PCARegression.

“Number of samples” determines the number of samples used to build a model.

Below are parameter configuration directions for regression models.

Appendix 2: Regression model parameters

For some float parameters, if their values are integers, they need to be followed by .0, like 0.0 and 1.0.

TreeRegression

Parameter	Type	Description
criterion	string: ["mse", "friedman_mse", "mae"]	Criterions for evaluating node splitting.
splitter	string: ["best", "random"]	Choose a splitting strategy for each node.
max_depth	int: [1, +∞), {1, 100, 1} null	Maximum tree depth.
min_samples_split	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_samples_leaf	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_weight_fraction_leaf	float: [0, 1), {0, 0.1, 0.01}	Minimum weight among all weights of the input sampling data at a leaf node.
max_features	int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null	Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features.
max_leaf_nodes	int: [1, +∞), {10, 1000, 10} null	Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes.
min_impurity_decrease	float: [0, 1)	The lowest impurity decrease for node splitting.

GBDTRegression

Parameter	Type	Description
loss	string: ["ls", "lad", "huber", "quantile"]	A loss function.
learning_rate	float: (0, 1), {0.1, 0.9, 0.1}	The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution.
n_estimators	int: [1, +∞), {10, 500, 10}	Number of boosting stages.
subsample	float: (0, 1], {0.1, 1, 0.1}	The ratio of sampling data used by a basic machine learning method.
criterion	string: ["mse", "friedman_mse", "mae"]	Criterions for evaluating node splitting.
min_samples_split	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_samples_leaf	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_weight_fraction_leaf	float: [0, 1), {0, 0.1, 0.01}	Minimum weight among all weights of the input sampling data at a leaf node.
max_depth	int: [1, +∞), {1, 100, 1} null	Maximum tree depth.
min_impurity_decrease	float: [0, 1)	The lowest impurity decrease for node splitting.
max_features	int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null	Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features.
alpha	float: (0, 1), {0.1, 0.9, 0.1}	The alpha-quantile of the huber loss function and the quantile loss function. Only if loss='huber' or loss='quantile'
max_leaf_nodes	int: [1, +∞), {10, 1000, 10} null	Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes.
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false.

RFRegression

Parameter	Type	Description
n_estimators	int: [1, +∞), {10, 500, 10}	The number of trees.
criterion	string: ["mse", "mae"]	Criterions for evaluating node splitting.
max_depth	int: [1, +∞), {1, 100, 1} null	Maximum tree depth.
min_samples_split	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_samples_leaf	int: [1, +∞), {10, 1000, 10} float: (0, 1)	Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data.
min_weight_fraction_leaf	float: [0, 1), {0, 0.1, 0.01}	Minimum weight among all weights of the input sampling data at a leaf node.
max_features	int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null	Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features.
max_leaf_nodes	int: [1, +∞), {10, 1000, 10} null	Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes.
min_impurity_decrease	float: [0, 1)	The lowest impurity decrease for node splitting.
bootstrap	bool	Whether to use bootstrap when generating a tree.
oob_score	bool	Whether to use out-of-bag samples to predict accuracy.
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false.

LRegression

Parameter	Type	Description
fit_intercept	bool	Whether to include an intercept item.
normalize	bool	Whether to normalize data.

LassoRegression

Parameter	Type	Description
fit_intercept	bool	Whether to include an intercept item.
alpha	float or null:[0, +∞], {0.0, 10.0, 0.1}	The regularized penalty factor. A null value means auto-configure and a float will disable cv and max_n_alphas.
normalize	bool	Whether to normalize data.
precompute	string: ["auto"] bool	Whether to precompute Gram matrix to speed up model building.
max_iter	int: [1, +∞), {10, 500, 10}	Maximum number of iterations.
cv	int: [2, 20]	Cross-validate the turning point
max_n_alphas	int: [1, +∞) , {100, 1000, 100}	Cross-validate the number of searched alpha
positive	bool	Whether to rule that a coefficient must be positive.

ENRegression

Parameter	Type	Description
alpha	float or null: [0, +∞], {0.0, 10.0, 0.1}	A constant multiplied by penalty item; null indicates auto-config, which is the default.
l1_ratio	float or null : [0, 1], {0.0, 1.0, 0.1}	A mixed parameter; its value is L2 when l1_ratio=0; and its value is L1 when l1_ratio=1; its value is mixed proportion when 11_ratio falls between 0 and 1; null indicates auto-config, which is the default.
n_alphas	int: [1, +∞), {100, 1000, 100}	Get the number of alpha; it is invalid when alpha is a float.
cv	int: [2, 20]	Cross-validate the turning point; it is invalid when both alpha and l1_ratio are float.
fit_intercept	bool	Whether to include an intercept item.
normalize	bool	Whether to normalize data.
precompute	bool	Whether to precompute Gram matrix to speed up model building.
max_iter	int: [1, +∞), {10, 500, 10}	Maximum iterations.
tol	float: (0, 1)	The tolerance value before stopping iteration.
warm_start	bool	Use the result of the previous iteration if the value is true; and won’t use that if the value is false. It works when parameter cv is disabled; and it is invalid when cv works.
positive	bool	Whether to rule that a coefficient must be positive.
selection	string: ["cyclic", "random"]	"cyclic" means iteration by loop by variables; "random" represents a random iteration coefficient.

RidgeRegression

Parameter	Type	Description
alpha	float: [0, +∞], {0.0, 10.0, 0.1}	Regularization strength, which must be a positive.
fit_intercept	bool	Whether to include an intercept item.
normalize	bool	Whether to normalize data.
max_iter	int: [1, +∞), {10, 500, 10} null	Maximum iterations.
tol	float: (0, 1)	Precision of the final solution.
solver	string: ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"]	The optimal algorithm.

FNNRegression and CNNRegression

Parameters for these two types of regression model are not supported for the time being due to some special features of the neural networks.

XGBRegression

Parameter	Type	Description
max_depth	int: [1, +∞), {1, 100, 1}	Maximum tree depth.
learning_rate	float: (0, 1), {0.1, 0.9, 0.1}	The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution.
n_estimators	int: [1, +∞), {10, 500, 10}	Number of booster trees.
objective	string: ["reg:squarederror", "reg:squaredlogerror", "reg:logistic"]	Learning objective; reg:squarederror: Squared error loss; reg:squaredlogerror: Squared log error loss; reg:logistic: logistic regression.
booster	string: ["gbtree", "gblinear", "dart"]	The booster type used.
gamma	float: [0, +∞)	The smallest loss mitigation value for node splitting.
min_child_weight	int: [1, +∞), {10, 1000, 10}	The minimum sum of sampling weights of child nodes.
max_delta_step	int: [0, +∞), {0, 10, 1}	The allowed longest delta step for evaluating a tree’s weight.
subsample	float: (0, 1], {0.1, 1.0, 0.1}	The proportion of subsample for training a model to the whole set of samplings.
colsample_bytree	float: (0, 1], {0.1, 1.0, 0.1}	Proportion of the random sampling from the features for each tree.
colsample_bylevel	float: (0, 1], {0.1, 1.0, 0.1}	Proportion of random sampling from the features on each horizontal level for node splitting.
reg_alpha	float:[0, +∞], {0.0, 10.0, 0.1}	L1 regularization term.
reg_lambda	float:[0, +∞], {0.0, 10.0, 0.1}	L2 regularization term.
scale_pos_weight	float: (0, +∞)	Control the balance of positive samples and negative samples.
base_score	float: (0, 1), {0.1, 0.9, 0.1}	The initial value for starting a prediction.
missing	float: (-∞, +∞) null	Define a missing value.

PCARegression

Parameter	Type	Description
n_components	int or null: [1, min(row count, column count)]	Retain the number of principal components; null indicates auto-config, which is the default.
whiten	bool	Whether to convert unit root.
svd_solver	string: ["auto", "full", "arpack", "randomized"]	The SVD solver to find PCA; default is full.
tol	float: (0, 1)	Tolerance to use; default is 0.0001.
fit_intercept	bool	Whether to include an intercept item.
normalize	bool	Whether to normalize data.

Multi-category model

“Multi-category model”: It is for configuring the multi-category model. A selected multi-category model will be used for model building.

There are two types of multi-category model – XGBMultiClassification、CNNMultiClassification.

The “Number of samples” specifies the count of samples to score data according to a certain model.

The following Appendix 3 lists parameters and their descriptions for the multi-category model.

Appendix 3: multi-category model parameters

XGBMultiClassification

Parameter	Type	Description
max_depth	int: [1, +∞), {1, 100, 1}	Maximum tree depth
learning_rate	float: (0, 1), {0.1, 0.9, 0.1}	The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution.
n_estimators	int: [1, +∞), {10, 500, 10}	The number of trees.
booster	string: ["gbtree", "gblinear", "dart"]	The booster type used.
gamma	float: [0, +∞)	The smallest loss mitigation value for node splitting.
min_child_weight	int: [1, +∞), {10, 1000, 10}	The minimum sum of sampling weights of child nodes.
max_delta_step	int: [0, +∞), {0, 10, 1}	The allowed longest delta step for evaluating a tree’s weight.
subsample	float: (0, 1], {0.1, 1.0, 0.1}	The proportion of subsample for training a model to the whole set of samplings.
colsample_bytree	float: (0, 1], {0.1, 1.0, 0.1}	Proportion of the random sampling from the features for each tree.
colsample_bylevel	float: (0, 1], {0.1, 1.0, 0.1}	Proportion of random sampling from the features on each horizontal level for node splitting.
reg_alpha	float:[0, +∞], {0.0, 10.0, 0.1}	L1 regularization term.
reg_lambda	float:[0, +∞], {0.0, 10.0, 0.1}	L2 regularization term.
scale_pos_weight	float: (0, +∞)	Control the balance of positive samples and negative samples.
base_score	float: (0, 1), {0.1, 0.9, 0.1}	The initial value for starting a prediction.
missing	float: (-∞, +∞) null	Define a missing value.

CNNMultiClassification

This type of parameters is not supported for the time being due to some special features of the neural networks.

Execute model building

To build a predictive model, you must choose a target variable and then select a modeling table file through “Model file”. By default the modeling table file is stored under the same directory where the loaded data is stored and has the same name as the loaded data file. Users can define the path and name themselves. A model document is one with .pcf extension.

Click “Modeling” option under “Run”, or click on the toolbar, to pop up the “Build model” window, where model building information is output.

Model building is finished when the message “Model building is finished” appears.

“Importance” will be displayed after a model is built, as shown below. A variable’s degree of importance indicates its influence on the future predictive result. The higher the degree of importance is, the bigger a variable’s influence is. A variable with zero importance degree has no impact on the predictive result. As the following shows, the Sex variable has the biggest influence on the result.

Model file information

Model presentation

YModel encapsulates multiple algorithms for model building, they are: Decision Tree, Gradient Boosting, Logistic Regression, Neural Network, Random Forest, Elastic Net, LASSO Regression, Linear Regression, Ridge Regression, and XGBoost.

After the model building is finished, you can click “Model presentation” in “Build model” window to view the algorithm(s) used to build the model, as shown below:

Model performance

Model performance can be reflected through a series of indexes and figures.

Click “Model performance” in “Build model” window:

Models built on different types of target variables are evaluated by different parameters and forms.

Below is model performance information about binary target variable “Survived”:

Here’s model performance information about numerical target variable “Age”: