PS-SMART Multiclass Classification - Platform For AI - Alibaba Cloud Documentation Center

A parameter server (PS) is used to process a large number of offline and online training jobs. Scalable Multiple Additive Regression Tree (SMART) is an iterative algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Multiclass Classification component of Platform for AI (PAI) supports training jobs for tens of billions of samples and hundreds of thousands of features. The component can run training jobs on thousands of nodes. The component also supports multiple data formats and optimization technologies, such as approximation by using histograms.

Limits

The input data of the PS-SMART Multiclass Classification component must meet the following requirements:

Data in the destination columns must be of numeric data types. If the data type in the MaxCompute table is STRING, the data must be converted into a numeric data type. For example, if the classification object is a string, such as Good/Medium/Bad, you must convert the string into 0/1/2.
If the data is in the key-value format, feature IDs must be positive integers and feature values must be real numbers. If the feature IDs are of the STRING type, you must use the serialization component to serialize the data. If the feature values are categorical strings, you must perform feature engineering, such as feature discretization, to process the values.
The PS-SMART Multiclass Classification component supports hundreds of thousands of feature-related jobs. However, these jobs are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms in the training. GBDT algorithms are suitable for scenarios in which continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. We recommend that you do not perform feature discretization on continuous features of numeric data types.
The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merging of a local sketch into a global sketch. The structures of trees vary when jobs run on multiple worker nodes in distributed mode. However, the training effect of the model is theoretically the same. You may obtain different results even if you use the same data and parameters during training.
If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training jobs after the required resources are provided. The waiting period increases with the amount of the requested resources.

Usage notes

When you use the PS-SMART Multiclass Classification component, take note of the following items:

The PS-SMART Multiclass Classification component supports hundreds of thousands of feature-related jobs. However, these jobs are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms in the training. GBDT algorithms are suitable for scenarios in which continuous features are used for training. You can perform one-hot encoding on categorical features to filter low-frequency features. We recommend that you do not perform feature discretization on continuous features of numeric data types.
The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merging of a local sketch into a global sketch. The structures of trees vary when jobs run on multiple worker nodes in distributed mode. However, the training effect of the model is theoretically the same. You may obtain different results even if you use the same data and parameters during training.
If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training jobs after the required resources are provided. The waiting period increases with the amount of the requested resources.

Configure the component

Method 1: Configure the component in the PAI console

Add the PS-SMART Multiclass Classification component on the pipeline page of Machine Learning Designer. Configure the following parameters:

Category	Parameter	Description
Fields Setting	Use Sparse Format	Specify whether the input data is in the sparse format. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.
	Feature Columns	Select the feature columns for training from the input table. If the data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If the data in the input table is key-value pairs in the sparse format, and keys and values are of numeric data types, only columns of the STRING type are supported.
	Label Column	The label column in the input table. Columns of the STRING and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification.
	Weight Column	Select the column that contains the weight of each row of samples. Columns of numeric data types are supported.
Parameters Setting	Classes	The number of classes for multiclass classification. If you set the parameter to n, the values of the label column are {0,1,2,...,n-1}.
	Evaluation Indicator Type	You can set this parameter to Multiclass Negative Log Likelihood or Multiclass Classification Error.
	Trees	The number of trees. The value must be a positive integer. The value of Trees is proportional to the training duration.
	Maximum Decision Tree Depth	The default value is 5, which indicates that up to 32 leaf nodes can be configured.
	Data Sampling Ratio	The data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training.
	Feature Sampling Fraction	The feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training.
	L1 Penalty Coefficient	The size of a leaf node. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
	L2 Penalty Coefficient	The size of a leaf node. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
	Learning Rate	Enter the learning rate. Valid values: (0,1).
	Sketch-based Approximate Precision	Enter the threshold for selecting quantiles when you build a sketch. A smaller value indicates that more bins can be obtained. In most cases, the default value 0.03 is used.
	Minimum Split Loss Change	Enter the minimum loss change required for splitting a node. A larger value indicates a lower probability of node splitting.
	Features	Enter the number of features or the maximum feature ID. Configure this parameter if you want to assess resource usage.
	Global Offset	Enter the initial prediction values of all samples.
	Random Seed	Enter the random seed. The value of this parameter must be an integer.
	Feature Importance Type	The type of feature. Valid values: Weight: the number of splits of the feature. Gain: the information gain provided by the feature. Cover: the number of samples covered by the feature on the split node.
Tuning	Cores	The number of cores. By default, the system determines the value.
Tuning	Memory Size per Core (MB)	The memory size of each core. Unit: MB. In most cases, the system determines the memory size.

Method 2: Configure the component by using PAI commands

Use PAI commands to configure the PS-SMART Multiclass Classification component. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

--Training 
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_multiclass_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -DenableSparse="true"
    -Dobjective="multi:softprob"
    -Dmetric="mlogloss"
    -DfeatureImportanceType="gain"
    -DtreeCount="5"
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0"
--Prediction 
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_multiclass_input";
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="features"
    -DappendColNames="label,features"
    -DenableSparse="true"
    -DkvDelimiter=":"
    -Dlifecycle="28"

Module	Parameter	Required	Default value	Description
Data parameters	featureColNames	Yes	N/A	The feature columns that are selected from the input table for training. If data in the input table is in the dense format, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported.
	labelColName	Yes	N/A	The label column in the input table. Columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be {0,1,2,…,n-1} in multiclass classification. n indicates the number of classes.
	weightCol	No	N/A	Select the column that contains the weight of each row of samples. Columns of numeric data types are supported.
	enableSparse	No	false	Specify whether the input data is in the sparse format. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9.
	inputTableName	Yes	N/A	The name of the input table.
	modelName	Yes	N/A	The name of the output model.
	outputImportanceTableName	No	N/A	The name of the table that contains feature importance.
	inputTablePartitions	No	N/A	The partitions that are selected from the input table for training. Format: ds=1/pt=1.
	outputTableName	No	N/A	The MaxCompute table that is generated. The table is a binary file that cannot be read and can be obtained only by using the PS-SMART prediction component.
	lifecycle	No	3	The lifecycle of the output table.
Algorithm parameters	classNum	Yes	N/A	The number of classes for multiclass classification. If you set this parameter to n, the values of the label column are {0,1,2,...,n-1}.
	objective	Yes	N/A	The type of the objective function. If you use multiclass classification for training, specify the multi:softprob objective function.
	metric	No	N/A	The evaluation metric type of the training set, which is specified in stdout of the coordinator in a logview. Valid values: mlogloss: corresponds to the Multiclass Negative Log Likelihood value of the Evaluation Index Type parameter in the console. merror: corresponds to the Multiclass Classification Error value of the Evaluation Index Type parameter in the console.
	treeCount	No	1	The number of trees. The value is proportional to the amount of training time.
	maxDepth	No	5	The maximum depth of a tree. Valid values: 1 to 20.
	sampleRatio	No	1.0	The data sampling ratio. Valid values: (0,1]. If you set this parameter to 1.0, no data is sampled.
	featureRatio	No	1.0	The feature sampling ratio. Valid values: (0,1]. If you set this parameter to 1.0, no data is sampled.
	l1	No	0	The L1 penalty coefficient. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
	l2	No	1.0	The L2 penalty coefficient. A larger value indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value.
	shrinkage	No	0.3	Valid values: (0,1).
	sketchEps	No	0.03	The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value indicates that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1).
	minSplitLoss	No	0	The minimum loss change required for splitting a node. A larger value indicates a lower probability of node splitting.
	featureNum	No	N/A	The number of features or the maximum feature ID. Configure this parameter if you want to assess resource usage.
	baseScore	No	0.5	The initial prediction values of all samples.
	randSeed	No	N/A	The random seed. The value of this parameter must be an integer.
	featureImportanceType	No	gain	The type of the feature importance. Valid values: weight: indicates the number of splits of the feature. gain: indicates the information gain provided by the feature. cover: indicates the number of samples covered by the feature on the splitting node.
Tuning parameters	coreNum	No	Automatically allocated	The number of cores used in computing. The speed of the computing algorithm increases with the value of this parameter.
Tuning parameters	memSizePerCore	No	Automatically allocated	The memory size of each core. Unit: MB.

PS-SMART model deployment

If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.

After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Deploy a model service in the PAI console.