Before we start to build a model, we need to configure a set of parameters. They are data source, variables on which the model is build, model building options, the path where a model file is stored, whether to preprocess data or not, etc. On execution these parameters will be saved as a mcf file in the format of JSON strings. Here let’s learn about the parameters for model building.
Parameters
{
"modelType" // Source data type; default is 0, which represents a local file
"modelMetaData": //Metadata{
"fieldList": // List of variables [
{
"varName"// Variable name
"dataType" //Variable type: 0 – Default type auto-check; 1 – Binary variable; 2 – Unary variable; 3 – Categorical variable; 11 – Numerical variable; 12 – Count variable whose value is an integer; 13 – Datetime variable; 20 – ID; 21 – Text string
//Note: An auto-check error could arise
"isTarget" // A target variable or not; a Boolean value of true or false; only one true value is allowed.
"isActionable" // Operatable or not; a Boolean value of true or false; default is true
"isSource" // Non-target variable or not; a Boolean value of true or false; default is true
"importance"// Variable importance degree (0-1) returned after model building is finished
"isComputeCol"// Computed variable (column) or not; a Boolean value of true or false
"isTypeDefined" // Data type is defined or not; a Boolean value of true or false
"isSourceDefined" // Variable selection is done or not; a Boolean value of true or false
},
… //Next variable
… //
]
"modelFile" // Model file (.pcf file) path
"dsType" // Loaded file type – 0, txt and csv
"dsConfig" // File loading configuration {
"srcFilePath" //Source file path
"hasTitle" // Load headers or not
"fieldNames"//Field name
"fieldTypes" //Field type
"useDisp" // Display line count or not
"dispNumber" //Display line count
"charset" // Character set configuration
"separator" // Separator
"isEscape"// Remove all quotation marks
"isTransQuota" // Use double quotation marks as escape character or not
"checkValid" // Check a line where column count does not match value count at line 1
"skipErrorRow" // Skip ineligible lines
"useTop" // Import Top N line or not
"topNumber" // Import Top N lines
"useBlock"// Import block by block or not
"blockIndex" // Block number
"blockCount" //Block count
"dateFormat" // Date format
"timeFormat" // Time format
"dateTimeFormat" // Datetime format
"missingFormat" // Missing value definition
"language" // Language for locale
"country" // Country for locale
"variant" //locale variable
"formats" // List of datetime formats
},
"needPrepare" // Preprocess source data or not
"parallelNumber" // Parallel tasks count for preprocessing
"isIntelligenceImpute" // Intelligent impute or not
"isResample" //Resample or not
"balanceParams" // Balanced sample ratio (int type); target variable balance parameters, whose value is the range 1-9; [1] means the ratio of majority sample and minority sample is 1:1
"advanceSelect" // Use advanced variable selection configuration or not; output all data for preprocessing when selected
"optimalParam" // Search optimal parameter or not
"resampleMultiple" //Sample multiplier
"testDataPercent" // Test data percentage is 1%-99%
"ensembleMethod" // “Best top N” and “Simple”; the former selects best N models to build a new model and involves a comparatively large computation amount; the latter just combines all defined models and involves a comparatively small computation amount
"resampleBestN" //If is_resample=true, resample and select multiple models, which is equivalent to Best N; recommended default N is 3
"resampleNumber" // Sample count
"ensembleFunc" // Ensemble function
"ensembleBestN" // Best N for model building ensemble
"dataBalances" // Target variable data ratio of majority to minority (float type)
"chunkSize" // Scoring result set chunk count
"adjustProb" // Adjust scoring result or not
"fixedSeed" // Fixed seeds or not
"randomState" // Set random seeds to control model building randomness; default is 0; 0 – get random model object after two executions; n – same integer in two executions generates same model objects; otherwise different model objects
"classModels": // List of classification model parameters
[
"{\"modelName\"//model name,\"count\"//sample count}",
]
"regressionModels": // List of regression model parameters
[
"{\"modelName\"// model name,\"count\"// sample count }",
"isEscape" // Remove all quotation marks at data loading
"classCategoryCount"// Category count
"numberCategoryCount"// Segment count
"groupMaxCount"// Max record count in a group
"groupMinCount"// Min record count in a group
"accuracySetting" // Threshold value display configuration {
"accuracyMin" // Min threshold value
"accuracyMax" // Max threshold value
"accuracyCount" // Segment count
}
],
"srcFilePath" //mtx file path
"rowCount" // Number of rows
"colCount" // Number of columns
}
}
Example
We’ll illustrate how to use a JSON file to build a model using the following data.
Titanic passenger data:
12 fields (variables) for model building: PassengerId, Survied, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
The target variable is Survived. It is a binary variable that represents whether a passenger is alive or dead (1 is survived; 0 is dead)
The JSON file for model building is as follows:
{
"modelType": 0,
"modelMetaData":{
"fieldList": [
{
"varName": "PassengerId",
"dataType":20, // PassengerId is an ID type variable
"isTarget":false,
"isActionable": true,
"isSource": true,
"importance": 0 ,
"isComputeCol": false,
"isTypeDefined": false,
"isSourceDefined": false
},{
"varName": "Survived",
"dataType": 1, //Survived is a binary variable
"isTarget": true, //Survived is the target variable
"isActionable": true,
"isSource": true,
" importance ": 0
"isComputeCol": false,
"isTypeDefined": false,
"isSourceDefined": false
}, {
"varName": "Pclass",
"dataType": 3, // Pclass is the categorical variable
"isTarget": false,
"isActionable": true,
"isSource": true,
"importance": 0 ,
"isComputeCol": false,
"isTypeDefined": false,
"isSourceDefined": false
}, {
"varName": "Age",
"dataType": 11, //count variable
"isTarget": false,
"isActionable": true,
"isSource": true,
"importance": 0,
"isComputeCol": false,
"isTypeDefined": false,
"isSourceDefined": false
},
… //Next variable
… //
}],
"modelFile": "C:\\Program Files\\raqsoft\\ymodel\\documents\\csv\\train.pcf",
"dsType": 0,
"dsConfig": {
"srcFilePath":"C:\\ProgramFiles\\raqsoft\\ymodel\\documents\\csv\\train.csv",
"hasTitle": true, //
"fieldNames": ["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"],
"fieldTypes": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"useDisp": false,
"dispNumber": 0,
"charset": "GBK",
"separator": ",",
"isEscape": true,
"isTransQuota": false,
"checkValid": true,
"skipErrorRow": false,
"useTop": false,
"topNumber": 10000,
"useBlock": false,
"blockIndex": 1,
"blockCount": 1,
"dateFormat": "yyyy/MM/dd",
"timeFormat": "HH:mm:ss",
"dateTimeFormat": "yyyy/MM/dd HH:mm:ss",
"missingFormat": "NULL|N/A",
"language": "zh",
"country": "CN",
"variant": "",
"formats": ["", "", "", "", "", "", "", "", "", "", "", ""]
},
"needPrepare": true,
"parallelNumber": 1,
"isIntelligenceImpute": true,
"isResample": true,
"balanceParams": [1],
"advanceSelect": false,
"optimalParam": false,
"resampleMultiple": 150,
"testDataPercent": 0,
"ensembleMethod": "best_n",
"resampleBestN": 3,
"resampleNumber": 5,
"ensembleFunc": "np.mean",
"ensembleBestN": 0,
"dataBalances": [1.5325202941894531],
"chunkSize": 1000000,
"adjustProb": true,
"fixedSeed": true,
"randomState": 0,
"classModels":[
"{\"modelName\"//model name:\"TreeClassification\",\"count\"//sample count:1}",
"{\"modelName\":\"GBDTClassification\",\"count\":1}",
"{\"modelName\":\"RFClassification\",\"count\":1}",
"{\"modelName\":\"LogicClassification\",\"count\":1}",
"{\"modelName\":\"RidgeClassification\",\"count\":1}",
"{\"modelName\":\"FNNClassification\",\"count\":1}",
"{\"modelName\":\"XGBClassification\",\"count\":1}"
],
"regressionModels": //List of regression model parameters [
"{\"modelName\":\"TreeRegression\",\"count\":1}",
"{\"modelName\":\"GBDTRegression\",\"count\":1}",
"{\"modelName\":\"RFRegression\",\"count\":1}",
"{\"modelName\":\"LRegression\",\"count\":1}",
"{\"modelName\":\"LassoRegression\",\"count\":1}",
"{\"modelName\":\"ENRegression\",\"count\":1}",
"{\"modelName\":\"RidgeRegression\",\"count\":1}",
"{\"modelName\":\"FNNRegression\",\"count\":1}",
"{\"modelName\":\"XGBRegression\",\"count\":1}"
],
"isEscape": true,
"classCategoryCount": 15,
"numberCategoryCount": 24,
"groupMaxCount": 3000000,
"groupMinCount": 1000000,
"accuracySetting": {
"accuracyMin": 0.05,
"accuracyMax": 0.95,
"accuracyCount": 20
}
},
"srcFilePath": "train3.mtx", mtx
"rowCount": 623,
"colCount": 12
}