A parameter server (PS) is used to process a large number of offline and online training jobs. Scalable Multiple Additive Regression Tree (SMART) is an iterative algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Multiclass Classification component of Platform for AI (PAI) supports training jobs for tens of billions of samples and hundreds of thousands of features. The component can run training jobs on thousands of nodes. The component also supports multiple data formats and optimization technologies, such as approximation by using histograms.
Limits
The input data of the PS-SMART Multiclass Classification component must meet the following requirements:
Data in the destination columns must be of numeric data types. If the data type in the MaxCompute table is STRING, the data must be converted into a numeric data type. For example, if the classification object is a string, such as Good/Medium/Bad, you must convert the string into 0/1/2.
If the data is in the key-value format, feature IDs must be positive integers and feature values must be real numbers. If the feature IDs are of the STRING type, you must use the serialization component to serialize the data. If the feature values are categorical strings, you must perform feature engineering, such as feature discretization, to process the values.
The PS-SMART Multiclass Classification component supports hundreds of thousands of feature-related jobs. However, these jobs are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms in the training. GBDT algorithms are suitable for scenarios in which continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. We recommend that you do not perform feature discretization on continuous features of numeric data types.
The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merging of a local sketch into a global sketch. The structures of trees vary when jobs run on multiple worker nodes in distributed mode. However, the training effect of the model is theoretically the same. You may obtain different results even if you use the same data and parameters during training.
If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training jobs after the required resources are provided. The waiting period increases with the amount of the requested resources.
Usage notes
When you use the PS-SMART Multiclass Classification component, take note of the following items:
The PS-SMART Multiclass Classification component supports hundreds of thousands of feature-related jobs. However, these jobs are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms in the training. GBDT algorithms are suitable for scenarios in which continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. We recommend that you do not perform feature discretization on continuous features of numeric data types.
The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merging of a local sketch into a global sketch. The structures of trees vary when jobs run on multiple worker nodes in distributed mode. However, the training effect of the model is theoretically the same. You may obtain different results even if you use the same data and parameters during training.
If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training jobs after the required resources are provided. The waiting period increases with the amount of the requested resources.
Configure the component
Method 1: Configure the component in the PAI console
Add the PS-SMART Multiclass Classification component on the pipeline page of Machine Learning Designer. Configure the following parameters:
Category | Parameter | Description |
Fields Setting | Use Sparse Format | Specify whether the input data is in the sparse format. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9. |
Feature Columns | Select the feature columns for training from the input table. If the data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If the data in the input table is key-value pairs in the sparse format, and keys and values are of numeric data types, only columns of the STRING type are supported. | |
Label Column | The label column in the input table. Columns of the STRING and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification. | |
Weight Column | Select the column that contains the weight of each row of samples. Columns of numeric data types are supported. | |
Parameters Setting | Classes | The number of classes for multiclass classification. If you set the parameter to n, the values of the label column are {0,1,2,...,n-1}. |
Evaluation Indicator Type | You can set this parameter to Multiclass Negative Log Likelihood or Multiclass Classification Error. | |
Trees | The number of trees. The value must be a positive integer. The value of Trees is proportional to the training duration. | |
Maximum Decision Tree Depth | The default value is 5, which indicates that up to 32 leaf nodes can be configured. | |
Data Sampling Ratio | The data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training. | |
Feature Sampling Fraction | The feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training. | |
L1 Penalty Coefficient | The size of a leaf node. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | |
L2 Penalty Coefficient | The size of a leaf node. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | |
Learning Rate | Enter the learning rate. Valid values: (0,1). | |
Sketch-based Approximate Precision | Enter the threshold for selecting quantiles when you build a sketch. A smaller value indicates that more bins can be obtained. In most cases, the default value 0.03 is used. | |
Minimum Split Loss Change | Enter the minimum loss change required for splitting a node. A larger value indicates a lower probability of node splitting. | |
Features | Enter the number of features or the maximum feature ID. Configure this parameter if you want to assess resource usage. | |
Global Offset | Enter the initial prediction values of all samples. | |
Random Seed | Enter the random seed. The value of this parameter must be an integer. | |
Feature Importance Type | The type of feature. Valid values:
| |
Tuning | Cores | The number of cores. By default, the system determines the value. |
Memory Size per Core (MB) | The memory size of each core. Unit: MB. In most cases, the system determines the memory size. |
Method 2: Configure the component by using PAI commands
Use PAI commands to configure the PS-SMART Multiclass Classification component. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
--Training
PAI -name ps_smart
-project algo_public
-DinputTableName="smart_multiclass_input"
-DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
-DoutputTableName="pai_temp_24515_545859_2"
-DoutputImportanceTableName="pai_temp_24515_545859_3"
-DlabelColName="label"
-DfeatureColNames="features"
-DenableSparse="true"
-Dobjective="multi:softprob"
-Dmetric="mlogloss"
-DfeatureImportanceType="gain"
-DtreeCount="5"
-DmaxDepth="5"
-Dshrinkage="0.3"
-Dl2="1.0"
-Dl1="0"
-Dlifecycle="3"
-DsketchEps="0.03"
-DsampleRatio="1.0"
-DfeatureRatio="1.0"
-DbaseScore="0.5"
-DminSplitLoss="0"
--Prediction
PAI -name prediction
-project algo_public
-DinputTableName="smart_multiclass_input";
-DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
-DoutputTableName="pai_temp_24515_545860_1"
-DfeatureColNames="features"
-DappendColNames="label,features"
-DenableSparse="true"
-DkvDelimiter=":"
-Dlifecycle="28"
Module | Parameter | Required | Default value | Description |
Data parameters | featureColNames | Yes | N/A | The feature columns that are selected from the input table for training. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported. |
labelColName | Yes | N/A | The label column in the input table. Columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be {0,1,2,…,n-1} in multiclass classification. n indicates the number of classes. | |
weightCol | No | N/A | Select the column that contains the weight of each row of samples. Columns of numeric data types are supported. | |
enableSparse | No | false | Specify whether the input data is in the sparse format. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9. | |
inputTableName | Yes | N/A | The name of the input table. | |
modelName | Yes | N/A | The name of the output model. | |
outputImportanceTableName | No | N/A | The name of the table that contains feature importance. | |
inputTablePartitions | No | N/A | The partitions that are selected from the input table for training. Format: ds=1/pt=1. | |
outputTableName | No | N/A | The MaxCompute table that is generated. The table is a binary file that cannot be read and can be obtained only by using the PS-SMART prediction component. | |
lifecycle | No | 3 | The lifecycle of the output table. | |
Algorithm parameters | classNum | Yes | N/A | The number of classes for multiclass classification. If you set this parameter to n, the values of the label column are {0,1,2,...,n-1}. |
objective | Yes | N/A | The type of the objective function. If you use multiclass classification for training, specify the multi:softprob objective function. | |
metric | No | N/A | The evaluation metric type of the training set, which is specified in stdout of the coordinator in a logview. Valid values:
| |
treeCount | No | 1 | The number of trees. The value is proportional to the amount of training time. | |
maxDepth | No | 5 | The maximum depth of a tree. Valid values: 1 to 20. | |
sampleRatio | No | 1.0 | The data sampling ratio. Valid values: (0,1]. If you set this parameter to 1.0, no data is sampled. | |
featureRatio | No | 1.0 | The feature sampling ratio. Valid values: (0,1]. If you set this parameter to 1.0, no data is sampled. | |
l1 | No | 0 | The L1 penalty coefficient. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | |
l2 | No | 1.0 | The L2 penalty coefficient. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | |
shrinkage | No | 0.3 | Valid values: (0,1). | |
sketchEps | No | 0.03 | The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value indicates that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1). | |
minSplitLoss | No | 0 | The minimum loss change required for splitting a node. A larger value indicates a lower probability of node splitting. | |
featureNum | No | N/A | The number of features or the maximum feature ID. Configure this parameter if you want to assess resource usage. | |
baseScore | No | 0.5 | The initial prediction values of all samples. | |
randSeed | No | N/A | The random seed. The value of this parameter must be an integer. | |
featureImportanceType | No | gain | The type of the feature importance. Valid values:
| |
Tuning parameters | coreNum | No | Automatically allocated | The number of cores used in computing. The speed of the computing algorithm increases with the value of this parameter. |
memSizePerCore | No | Automatically allocated | The memory size of each core. Unit: MB. |
PS-SMART model deployment
If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.
After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Deploy a model service in the PAI console.