Tree Depth - Platform For AI - Alibaba Cloud Documentation Center

The depth of a tree refers to the number of leaf vertices on the path from the root vertex to the farthest leaf vertex in a decision tree model. The Tree Depth component is an important parameter that affects the complexity and fitting ability of the model. A deeper tree can capture more data modes. This may cause overfitting. A shallower tree may cause underfitting. Therefore, you must select an appropriate tree depth to ensure the performance and generalization capability of the model.

Configure the component

Method 1: Configure the component on the pipeline page

You can add the Tree Depth component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Edge Table: Start Vertex Column	The start vertex column in the edge table.
Fields Setting	Edge Table: End Vertex Column	The end vertex column in the edge table.
Tuning	Workers	The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.
	Memory Size per Worker	The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the `OutOfMemory` error is reported.
	Data Split Size (MB)	The data split size. Unit: MB. Default value: 64.

Method 2: Configure the component by using PAI commands

You can configure the Tree Depth component by using PAI commands. You can use the SQL Script component to run PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component in the "SQL Script" topic.

PAI -name TreeDepth
    -project algo_public
    -DinputEdgeTableName=TreeDepth_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=TreeDepth_func_test_result;

Parameter	Required	Default value	Description
inputEdgeTableName	Yes	No default value	The name of the input edge table.
inputEdgeTablePartitions	No	Full table	The partitions in the input edge table.
fromVertexCol	Yes	No default value	The start vertex column in the input edge table.
toVertexCol	Yes	No default value	The end vertex column in the input edge table.
outputTableName	Yes	No default value	The name of the output table.
outputTablePartitions	No	No default value	The partitions in the output table.
lifecycle	No	No default value	The lifecycle of the output table.
workerNum	No	Not specified	The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.
workerMem	No	4096	The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the `OutOfMemory` error is reported.
splitSize	No	64	The data split size.

Example

Add the SQL Script component as a vertex to the canvas. Deselect the Use Script Mode and Whether the system adds a create table statement check boxes, and enter the following SQL statements in the SQL Script editor.

drop table if exists TreeDepth_func_test_edge;
create table TreeDepth_func_test_edge as
select * from
(
    select '0' as flow_out_id, '1' as flow_in_id
    union all
    select '0' as flow_out_id, '2' as flow_in_id
    union all
    select '1' as flow_out_id, '3' as flow_in_id
    union all
    select '1' as flow_out_id, '4' as flow_in_id
    union all
    select '2' as flow_out_id, '4' as flow_in_id
    union all
    select '2' as flow_out_id, '5' as flow_in_id
    union all
    select '4' as flow_out_id, '6' as flow_in_id
    union all
    select 'a' as flow_out_id, 'b' as flow_in_id
    union all
    select 'a' as flow_out_id, 'c' as flow_in_id
    union all
    select 'c' as flow_out_id, 'd' as flow_in_id
    union all
    select 'c' as flow_out_id, 'e' as flow_in_id
)tmp;
drop table if exists TreeDepth_func_test_result;
create table TreeDepth_func_test_result
(
  node string,
  root string,
  depth bigint
);

Data structure

图结构

Add the SQL Script component as a vertex to the canvas. Deselect the Use Script Mode and Whether the system adds a create table statement check boxes, enter the following PAI commands in the SQL Script editor, and then connect the two components you added.
```
drop table if exists ${o1};
PAI -name TreeDepth
    -project algo_public
    -DinputEdgeTableName=TreeDepth_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=${o1};
```
Click on the upper-left corner of the canvas to run the pipeline.

Right-click the SQL Script component in Step 2 and choose View Data > SQL Script Output to view the training results.

| node | root | depth |
| ---- | ---- | ----- |
| a    | a    | 0     |
| b    | a    | 1     |
| c    | a    | 1     |
| d    | a    | 2     |
| e    | a    | 2     |
| 0    | 0    | 0     |
| 1    | 0    | 1     |
| 2    | 0    | 1     |
| 3    | 0    | 2     |
| 4    | 0    | 2     |
| 5    | 0    | 2     |
| 6    | 0    | 3     |