All Products
Search
Document Center

Platform For AI:Discrete Feature Analysis

Last Updated:Dec 30, 2024

Discrete feature analysis is a technique used to handle and analyze features with a limited number of distinct categories. This approach assesses the distribution of discrete features, calculates the metrics such as the gini index and entropy of each discrete feature, and evaluates feature importance by using metrics such as Gini Gain, Information Gain, and Information Gain Ratio. These evaluations help identify features that significantly affect the model performance.

Configure the component

You can use one of the following methods to configure the Discrete Feature Analysis component.

Method 1: Configure the component on the pipeline page

On the pipeline details page in Machine Learning Designer, add the Discrete Feature Analysis component to the pipeline and configure the parameters described in the following table.

Parameter

Description

Feature Columns

The columns to represent the features of data in training samples.

Label Column

The label column.

Sparse Matrix

If data in an input table is in the sparse format, features must be in the key-value pair format.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.

PAI
-name enum_feature_selection
-project algo_public
-DinputTableName=enumfeautreselection_input
-DlabelColName=label
-DfeatureColNames=col0,col1
-DenableSparse=false
-DoutputCntTableName=enumfeautreselection_output_cntTable
-DoutputValueTableName=enumfeautreselection_output_valuetable
-DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;

Parameter

Required

Default value

Description

inputTableName

Yes

No default value

The name of the input table.

inputTablePartitions

No

Full table

The partitions that are selected from the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate them with commas (,).

featureColNames

No

No default value

The feature columns that are selected from the input table for training.

labelColName

No

No default value

The name of the label column in the input table.

enableSparse

No

false

Specifies whether the input data is in the sparse format. Valid values: true and false.

kvFeatureColNames

No

Full table

The names of the feature columns that are in the key-value pair format.

kvDelimiter

No

:

The delimiter that is used to separate keys and values if data in an input table is in the sparse format.

itemDelimiter

No

,

The delimiter that is used to separate key-value pairs if data in an input table is in the sparse format.

outputCntTableName

No

N/A

The output distribution table that contains the enumerated values of discrete features.

outputValueTableName

No

N/A

The output table that contains gini and entropy values of discrete features.

outputEnumValueTableName

No

N/A

The output table that contains enumerated gini and entropy values of discrete features.

lifecycle

No

No default value

The lifecycle of the table.

coreNum

No

Determined by the system

The number of cores that are used in computing. The value must be a positive integer.

memSizePerCore

No

Determined by the system

The memory size of each core. Valid values: 1 to 65536. Unit: MB.

Example

Execute the following SQL statements to generate input data:

drop table if exists enum_feature_selection_test_input;
create table enum_feature_selection_test_input
as
select
    *
from
(
    select
        '00' as col_string,
        1 as col_bigint,
        0.0 as col_double
    from dual
    union all
        select
            cast(null as string) as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            0 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            cast(null as double) as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '00' as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
) tmp;

Input data:

+------------+------------+------------+
| col_string | col_bigint | col_double |
+------------+------------+------------+
| 01         | 1          | 1.0        |
| 01         | 0          | 1.0        |
| 01         | 1          | NULL       |
| NULL       | 0          | 0.0        |
| 00         | 1          | 0.0        |
| 00         | 0          | 0.0        |
+------------+------------+------------+
  • PAI command

    • Command

      drop table if exists enum_feature_selection_test_input_enum_value_output;
      drop table if exists enum_feature_selection_test_input_cnt_output;
      drop table if exists enum_feature_selection_test_input_value_output;
      PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";
    • Command output

      • enum_feature_selection_test_input_cnt_output

        +------------+------------+------------+------------+
        | colname    | colvalue   | labelvalue | cnt        |
        +------------+------------+------------+------------+
        | col_double | NULL       | 1          | 1          |
        | col_double | 0          | 0          | 2          |
        | col_double | 0          | 1          | 1          |
        | col_double | 1          | 0          | 1          |
        | col_double | 1          | 1          | 1          |
        | col_string | NULL       | 0          | 1          |
        | col_string | 00         | 0          | 1          |
        | col_string | 00         | 1          | 1          |
        | col_string | 01         | 0          | 1          |
        | col_string | 01         | 1          | 2          |
        +------------+------------+------------+------------+
      • enum_feature_selection_test_input_value_output

        +------------+------------+------------+------------+------------+---------------+
        | colname    | gini       | entropy    | infogain   | ginigain   | infogainratio |
        +------------+------------+------------+------------+------------+---------------+
        | col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196 | 0.1111111111111111 | 0.14221913160264427 |
        | col_string | 0.38888888888888884 | 0.792481250360578 | 0.20751874963942196 | 0.11111111111111116 | 0.14221913160264427 |
        +------------+------------+------------+------------+------------+---------------+
      • enum_feature_selection_test_input_enum_value_output

        +------------+------------+------------+------------+
        | colname    | colvalue   | gini       | entropy    |
        +------------+------------+------------+------------+
        | col_double | NULL       | 0.0        | 0.0        |
        | col_double | 0          | 0.22222222222222224 | 0.4591479170272448 |
        | col_double | 1          | 0.16666666666666666 | 0.3333333333333333 |
        | col_string | NULL       | 0.0        | 0.0        |
        | col_string | 00         | 0.16666666666666666 | 0.3333333333333333 |
        | col_string | 01         | 0.2222222222222222 | 0.4591479170272448 |
        +------------+------------+------------+------------+