Migrate data from an Amazon Redshift cluster to an AnalyticDB for PostgreSQL instance - AnalyticDB

You can use Alibaba Cloud Data Online Migration or the ossimport tool to migrate data from Amazon Redshift to AnalyticDB for PostgreSQL.

Prerequisites

An Amazon Redshift cluster is created.
Amazon S3 that is used to export Amazon Redshift data is activated.
Alibaba Cloud Object Storage Service (OSS) is activated. For more information, see What is OSS?
An OSS bucket is created. For more information, see Create a bucket.
Note
To facilitate subsequent data import, we recommend that you create an OSS bucket in the same region as the AnalyticDB for PostgreSQL instance.
An AnalyticDB for PostgreSQL instance is created. For information about how to select instance specifications, see the "Specification selection" section of this topic.

Specification selection

This section describes how to select specifications for an AnalyticDB for PostgreSQL instance based on the specifications of the source Amazon Redshift cluster.

An Amazon Redshift cluster consists of a leader node and multiple compute nodes.

Leader node: acts as the coordinator node within an AnalyticDB for PostgreSQL instance. The leader node communicates with the client, analyzes and develops query execution plans, and performs database operations.
Compute node: acts as a set of compute nodes within an AnalyticDB for PostgreSQL instance. Each compute node in an Amazon Redshift cluster is partitioned into slices. Each node slice acts as a compute node within an AnalyticDB for PostgreSQL instance and performs computing for data storage and query.

When you create an AnalyticDB for PostgreSQL instance, you can select the node specifications of the instance based on specifications of each node slice in the source Amazon Redshift cluster.

Example

An Amazon Redshift cluster consists of four compute nodes. Each compute node consists of two node slices. The specifications of each node slice are 2 cores, 16 GB memory, and 1 TB storage.

When you create an AnalyticDB for PostgreSQL instance, you can set the number of compute nodes to 8, node specifications to 2 cores and 16 GB memory, and single-node storage to 1,000 GB.

Note

For information about how to create an AnalyticDB for PostgreSQL instance, see Create an instance.
We recommend that you set the storage type to Enhanced SSD (ESSD) because ESSDs provide better I/O performance than ultra disks.

Procedure

Step 1: Export data from an Amazon Redshift cluster to an Amazon S3 bucket

You can execute the UNLOAD statement to export data. For more information, see UNLOAD.

Syntax:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
AUTHORIZATION
[ option [ ... ] ]

WHERE option IS
{ [ FORMAT [ AS ] ] CSV | PARQUET
| PARTITION BY ( column_name [, ... ] ) [ INCLUDE ]
| MANIFEST [ VERBOSE ] 
| HEADER           
| DELIMITER [ AS ] 'delimiter-char' 
| FIXEDWIDTH [ AS ] 'fixedwidth-spec'   
| ENCRYPTED [ AUTO ]
| BZIP2  
| GZIP 
| ZSTD
| ADDQUOTES 
| NULL [ AS ] 'null-string'
| ESCAPE
| ALLOWOVERWRITE
| CLEANPATH
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE [AS] max-size [ MB | GB ] 
| REGION [AS] 'Amazon-region' }

Note

We recommend that you use the FORMAT AS PARQUET or CSV parameter to specify the format of the exported data.
To allow more files to be generated and improve data export efficiency, we recommend that you specify the PARALLEL ON parameter to export data in parallel.
To generate more files, we recommend that you specify the MAXFILESIZE parameter to limit the size of files. The recommended value is an integer multiple of the number of nodes within an AnalyticDB for PostgreSQL instance. This improves the data import efficiency when data is subsequently imported in parallel from OSS foreign tables to the AnalyticDB for PostgreSQL instance.

Step 2: Synchronize data from Amazon S3 to OSS

You can use Alibaba Cloud Data Online Migration or the ossimport tool to synchronize data from Amazon S3 to OSS.

Data Online Migration

Log on to the Data Transport console.

Create the source and destination data addresses for data synchronization.

In the left-side navigation pane, choose Data Online Migration > Data Address.
Click Create Data Address to create a source data address.

In the Create Data Address panel, configure the parameters that are described in the following table and click OK.

Parameter	Required	Description
Name	Yes	The name of the source data address. The name must meet the following requirements: The name is 3 to 63 characters in length. The name must be case-sensitive and can contain lowercase letters, digits, hyphens (-), and underscores (_). The name is encoded in the UTF-8 format and cannot start with a hyphen (-) or an underscore (_).
Type	Yes	The type of the source data address. Select AWS S3.
Domain Name	Yes	The endpoint of the source data address. For more information, see the "Amazon S3 endpoints" section of the Amazon Simple Storage Service endpoints and quotas topic.
AccessKeyId	Yes	The access key pair of the Identity and Access Management (IAM) user that is used to read data from the source data address. Amazon S3 uses the access key pair to check whether the IAM user has the permissions to read data from the source data address.
SecretAccessKey	Yes
Bucket	Yes	The name of the Amazon S3 bucket in which the data to be migrated is stored. Note The bucket name cannot contain invalid characters, such as spaces, line breaks, or tab characters.
Prefix	No	The prefix of the source data address. You can specify a prefix to migrate specific data. If you specify a prefix, only the data in the directory that is specified by the prefix is migrated. A prefix cannot start with a forward slash (/) and must end with a forward slash (/). If you do not specify a prefix, all data in the Amazon S3 bucket is migrated.
Tunnel	No	The name of the tunnel that you want to use. Important This parameter is required only when you migrate data to the cloud by using Express Connect circuits or VPN gateways or migrate data from self-managed databases to the cloud. If data at the destination data address is stored in a local file system or you need to migrate data over an Express Connect circuit in an environment such as Alibaba Finance Cloud or Apsara Stack, you must create and deploy an agent.
Agent	No	The name of the agent that you want to use. Important This parameter is required only when you migrate data to the cloud by using Express Connect circuits or VPN gateways or migrate data from self-managed databases to the cloud. You can select up to 30 agents at a time for a specific tunnel.

Click Create Data Address again to create a destination data address.

In the Create Data Address panel, configure the parameters that are described in the following table and click OK.

Parameter	Required	Description
Data Type	Yes	The type of the destination data address. Select OSS.
Data Name	Yes	The name of the source data address. The name must be 3 to 63 characters in length. The name cannot contain special characters except for hyphens (-) and underscores (_).
Data Region	Yes	The region in which the destination data address resides.
Enable Transfer Acceleration	No	If you want to use Data Online Migration together with the transfer acceleration feature of Object Storage Service (OSS), enable the transfer acceleration feature for buckets. The transfer acceleration feature takes effect within 30 minutes after it is enabled. After you enable the feature, wait for 30 minutes before you create a migration job. Important If you use the transfer acceleration feature to access your OSS buckets, you are charged additional transfer acceleration fees. For more information, see Transfer acceleration.
OSS Endpoint	Yes	The endpoint of the destination data address. Select an endpoint based on the region in which the destination data address resides. For more information, see Regions and endpoints. Note When you migrate data from third-party cloud services to OSS, you must access OSS from a public endpoint over the Internet.
Access Key Id and Access Key Secret	Yes	The AccessKey pair of the RAM user that is used to write data to the destination data address. The AccessKey pair is used by OSS to check whether the RAM user has the permissions to write data to the destination data address.
OSS Bucket	Yes	The OSS bucket to which the data is migrated.
OSS Prefix	No	The prefix of the destination data address. If you specify a prefix, the source data is migrated to the specified directory. The value cannot start with a forward slash (/) and must end with a forward slash (/). Example: `data/to/oss/`. If you do not specify a prefix, the source data is migrated to the root directory of the destination bucket. Important If the name of a source file starts with a forward slash (/), you must specify an OSS prefix when you specify the destination data address. If no OSS prefix is specified, the migration job fails. For example, if the name of a file to be migrated is /test/test.png, you must specify an OSS prefix, such as oss/. After the migration is complete, the name of the file changes from /test/test.png to oss//test/test.png.

Create an online migration job.

In the left-side navigation pane, choose Data Online Migration > Migration Jobs.
Click Create Job.
In the Create Job panel, read the terms of the migration service and select I agree to the above terms and activate the Data Transport service. Then, click Next.
In the Fee Reminder message, click Yes, Go Ahead.

In the Create Job panel, configure the parameters that are described in the following table and click Next.

Parameter	Required	Description
Job Name	Yes	The name of the source data address. The name must be 3 to 63 characters in length. The name cannot contain special characters except for hyphens (-) and underscores (_).
Source Data Address	Yes	The source data address that you created.
Destination Data Address	Yes	The destination data address that you created.
Specified Directory	No	The directories in which the files and subdirectories are migrated or not migrated. Valid values: Do not filter: migrates all data from the source data address. Exclude: does not migrate the files or subdirectories in the specified directories. For example, you want to migrate all the directories in the `root_dir/` directory except `root_dir/bad_sub1/` and `root_dir/bad_sub2/`. In this case, you can select this method and specify `bad_sub1/` and `bad_sub2/`. Contain: migrates only the files and subdirectories in the specified directories. For example, you want to migrate only `root_dir/good_sub1/` and `root_dir/good_sub2/` in the `root_dir/` directory. In this case, you can select this method and specify `good_sub1/` and `good_sub2/`. Note A directory name can contain only digits and letters. If you use special characters, the migration job may fail. A directory name cannot start with a forward slash (/) or a backslash (\), and cannot contain double forward slashes (//), double periods (..), or double quotation marks ("). The character string that consists of all the specified directory names cannot exceed 10 KB in length. A directory name must end with a forward slash (/). For example, you can specify `docs/` as the directory name. You can specify up to 20 directories of the Exclude or Contain type.
Migration Type	Yes	The data migration mode. Valid values: Full: The system migrates all data from the source data address to the destination data address. You must configure the Start Time Point of File parameter. The system migrates all data whose last modification time is later than the specified time. After all data is migrated, the migration job ends. If you change the data at the source data address after the migration job is complete, you can submit another job to migrate all data. In this case, the system migrates only the changed data. Incremental: The system runs a migration job based on the values that you specified for the Migration Interval and Migration Times parameters. You must configure the Start Time Point of File parameter. During the first migration, the system migrates all data whose last modification time is later than the specified time. After the first migration is complete, incremental migration is performed based on the value that you specified for the Migration Interval parameter. For each incremental migration, the system migrates only the data that is created or modified after the previous migration starts and before the current migration starts. If you set the Migration Times parameter to N, full migration is performed once and then incremental migration is performed (N - 1) times. For example, the current time is 08:00, March 10, 2019 and you set the Migration Interval parameter to 1, the Migration Times parameter to 5, and the Start Time Point of File parameter to 08:00, March 5, 2019. When the first migration starts, the system migrates all files whose last modification time is between 08:00, March 5, 2019 and 08:00, March 10, 2019. In this example, the first migration requires one hour to complete and the second migration starts at 10:00, March 10, 2019. The duration from 08:00 to 10:00 consists of one hour for the first migration process and one hour for the migration interval. During the second migration, the system migrates the files whose last modification time is between 08:00, March 10, 2019 and 10:00, March 10, 2019. The migration job consists of one full migration and four incremental migrations. Important Before full or incremental migration starts, the system compares the files at the source data address with the files at the destination data address. If a source file has the same name as a destination file, the destination file is overwritten when one of the following conditions is met: The Content-Type values of the files at the source data address and the destination data address are different. The last modification time of the file at the source data address is later than that of the file at the destination data address. The size of the file at the source data address is different from that of the file at the destination data address.
Start Time Point of File	Yes	The time to filter data to be migrated. The system migrates the data that is created or modified after the specified time. Valid values: All: migrates all files. Assign: migrates only the files that are created or modified after the specified time. For example, if you set the time to 08:00:00, November 1, 2018, only the files that are created or modified after 08:00:00, November 1, 2018 are migrated.
Migration Interval	This parameter is required if you set the Migration Type parameter to Incremental.	The default value is 1, and the maximum value is 24. Unit: hours.
Migration Times	This parameter is required if you set the Migration Type parameter to Incremental.	The default value is 1, and the maximum value is 30.
File Overwrite Method	Yes	The method that is used to process the file at the source data address whose name is the same as that of a file at the destination data address. Valid values: LastModified: If a source file has the same name as a destination file, the system compares the last modified time indicated by the LastModified property of the files. If the last modification time of the file at the source data address is later than that of the file at the destination data address, the file at the source data address is migrated and the file at the destination data address is overwritten. If the last modification time of the file at the source data address is earlier than that of the file at the destination data address, the file at the source data address is not migrated and the file at the destination data address is retained. If the two files have the same last modification time, the system checks the values of the Size and Content-Type properties of the files. If the two files have the same Size and Content-Type values, the file at the source data address is not migrated. If the two files have different values of at least one property, the file at the source data address is migrated and the file at the destination data address is overwritten. Condition: If a source file has the same name as a destination file, the system compares the values of the LastModified, Size, and Content-Type properties of the files. If all property values of the two files are the same, the file at the source data address is not migrated. If the two files have different values of at least one property, the file at the source data address is migrated and the file at the destination data address is overwritten. All: If a source file has the same name as a destination file, the system performs no comparison and directly overwrites the destination file with the source file. No: If a source file has the same name as a destination file, the system performs no comparison and directly skips the source file in the migration. Warning If you select Condition or LastModified, earlier files may overwrite later files. If you select Condition or LastModified, make sure that the source files have LastModified, Size, and Content-Type values. Otherwise, the overwrite policy may become ineffective and data is not migrated as expected.

In the Performance step, go to the Data Prediction section and configure the Data Size and File Count parameters.
Note
To ensure that the migration job is successful, we recommend that you estimate the amount of data that you want to migrate as accurately as possible. For more information, see Prepare for a migration job.
(Optional) In the Performance step, go to the Flow Control section, specify the Time Range and Max Flow(MB/s) parameters, and then click Add.
Note
To ensure business continuity, we recommend that you configure the Time Range and Max Flow parameters based on traffic fluctuations.
Click Create and wait until the migration job is complete.

ossimport

Download and install ossimport in standalone mode. For more information, see Overview.

The ossimport tool in standalone mode has the following file structure:

ossimport
├── bin
│ └── ossimport2.jar  # The JAR package that contains the Master, Worker, TaskTracker, and Console modules.
├── conf
│ ├── local_job.cfg   # The job configuration file.
│   └── sys.properties  # The configuration file that contains system parameters.
├── console.bat         # The Windows command line utility that is used to run tasks in a distributed manner.
├── console.sh          # The Linux command line utility that is used to run tasks in a distributed manner.
├── import.bat          # The script that automatically imports files based on the conf/local_job.cfg configuration file in Windows. The configuration file contains parameters that specify data migration operations such as start, migration, verification, and retry.
├── import.sh           # The script that automatically imports files based on the conf/local_job.cfg configuration file in Linux. The configuration file contains parameters that specify data migration operations, such as start, migration, verification, and retry.
├── logs                # The directory that contains logs.
└ ── README.md # The file that provides a description of ossimport. We recommend that you read the file before you use ossimport.

Configure ossimport in standalone mode.

You need to modify only the following parameters in the conf/local_job.cfg configuration file:

srcType=s3
srcAccessKey=<Your AWS access key>
srcSecretKey=<Your AWS secret key>
srcDomain=<The domain name that corresponds to the region in which the Amazon S3 bucket is deployed>
srcBucket=<The name of the Amazon S3 bucket>
destAccessKey=<Your Alibaba Cloud AccessKey ID>
destSecretKey=<Your Alibaba Cloud AccessKey secret>
destDomain=<The endpoint that corresponds to the region in which the OSS bucket is deployed>
destBucket=<The name of the OSS bucket>
destPrefix=
isSkipExistFile=true

For more information, see Overview.

Run ossimport to synchronize data to OSS. For information about how to use ossimport in standalone mode, see Standalone deployment.

Step 3: Create a destination table

Create a destination table in the AnalyticDB for PostgreSQL instance to load data from Amazon Redshift. The destination table must have the same schema as the source table. For more information, see CREATE TABLE.

Modify information about database objects such as schemas, tables, functions, and views to convert DDL statements. For more information, see the "Convert DDL statements" section of this topic.

Step 4: Import data from OSS to the AnalyticDB for PostgreSQL instance

You can use the COPY statement or an OSS foreign table to import data to AnalyticDB for PostgreSQL.

For information about how to use the COPY statement to import OSS data, see Use the COPY or UNLOAD statement to import or export data between OSS foreign tables and AnalyticDB for PostgreSQL tables.
For information about how to use an OSS foreign table to import OSS data, see the "Use OSS foreign tables for data lake analysis" section of the Use OSS foreign tables for data lake analysis topic.

Convert DDL statements

The DDL statements of Amazon Redshift are slightly different from the DDL statements of AnalyticDB for PostgreSQL and must be converted before you migrate data.

CREATE SCHEMA

Execute the CREATE SCHEMA statement of AnalyticDB for PostgreSQL to create a schema. Example:

CREATE SCHEMA schema1 AUTHORIZATION xxxpoc;
GRANT ALL ON SCHEMA schema1 TO xxxpoc;
GRANT ALL ON SCHEMA schema1 TO public;
COMMENT ON SCHEMA model IS 'for xxx migration  poc test';

CREATE SCHEMA oss_external_table AUTHORIZATION xxxpoc;

CREATE FUNCTION

AnalyticDB for PostgreSQL does not support specific SQL functions of Amazon Redshift. You must modify or rewrite the SQL functions. Examples:

Replace the CONVERT_TIMEZONE(a,b,c) function with the following function:
```
timezone(b, timezone(a,c))
```
Replace the GETDATE() function with the following function:
```
current_timestamp(0):timestamp
```

Replace or optimize user-defined functions (UDFs). The following sample code provides an example of a UDF of Amazon Redshift:

CREATE OR REPLACE FUNCTION public.f_jdate(dt timestamp WITHOUT time zone)
RETURNS character varying AS
'      from datetime import timedelta, datetime
       if dt.hour < 4:
              d = timedelta(days=-1)
              dt = dt + d
       return str(dt.date())'
LANGUAGE plpythonu IMMUTABLE;
COMMIT;

Replace the preceding function with the following function:

to_char(a - interval '4 hour', 'yyyy-mm-dd')

Replace other Amazon Redshift functions.
For information about standard PostgreSQL functions, see Functions and Operators. You can modify or rewrite the SQL functions of Amazon Redshift that are not supported by AnalyticDB for PostgreSQL, such as the following functions:

CREATE TABLE

Modify the name of a table.
The name of an Amazon Redshift table can be up to 127 characters in length, whereas the name of an AnalyticDB for PostgreSQL table can be up to 63 characters in length. If the names of database objects such as tables, functions, and views are more than 63 characters in length, you must remove the excess characters.
Modify the compression algorithm.
Replace ENCODE XXX in a CREATE TABLE statement of Amazon Redshift with the following clause:
```
WITH (COMPRESSTYPE={ZLIB|ZSTD|RLE_TYPE|NONE})
```
For information about the compression algorithms supported by AnalyticDB for PostgreSQL, see the "Data compression" section of the Define storage models for tables topic.
Modify distribution keys.
Amazon Redshift supports three distribution keys. For more information, see Distribution styles. You must modify distribution keys based on the following rules:
- EVEN distribution: Use DISTRIBUTED RANDOMLY.
- KEY distribution: Use DISTRIBUTED BY (column, [ ... ] ).
- ALL distribution: Use DISTRIBUTED REPLICATED.
Modify sort keys.
Replace the COMPOUND or INTERLEAVED option in the [ COMPOUND | INTERLEAVED ] SORTKEY (column_name [, ...] ) ] sort key clause of Amazon Redshift with the following clause:
```
ORDER BY (column, [ ... ])
```

Examples:

Example 1:

The following sample code provides an example of a CREATE TABLE statement of Amazon Redshift:

CREATE TABLE schema1.table1
(
    filed1 VARCHAR(100) ENCODE lzo,
    filed2 INTEGER DISTKEY,
    filed3 INTEGER,
    filed4 BIGINT ENCODE lzo,
    filed5 INTEGER
)
INTERLEAVED SORTKEY
(
    filed1,
    filed2
);

Convert the preceding statement into the following CREATE TABLE statement of AnalyticDB for PostgreSQL:

CREATE TABLE schema1.table1
(
    filed1 VARCHAR(100) ,
    filed2 INTEGER,
    filed3 INTEGER,
    filed4 BIGINT,
    filed5 INTEGER
)
WITH(APPENDONLY=true,ORIENTATION=column,COMPRESSTYPE=zstd)
DISTRIBUTED BY (filed2)
ORDER BY (filed1, filed2);

-- Sort
SORT schema1.table1;
MULTISORT schema1.table1;

Example 2:

The following sample code provides an example of a CREATE TABLE statement of Amazon Redshift, which includes the ENCODE and SORTKEY options:

CREATE TABLE schema2.table2
(
    filed1 VARCHAR(50) ENCODE lzo,
    filed2 VARCHAR(50) ENCODE lzo,
    filed3 VARCHAR(20) ENCODE lzo,
)
DISTSTYLE EVEN
INTERLEAVED SORTKEY
(
    filed1
);

Convert the preceding statement into the following CREATE TABLE statement of AnalyticDB for PostgreSQL:

CREATE TABLE schema2.table2
(
    filed1 VARCHAR(50),
    filed2 VARCHAR(50),
    filed3 VARCHAR(20),
)
WITH(APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zstd)
DISTRIBUTED randomly
ORDER BY (filed1);

-- Sort
SORT schema2.table2;
MULTISORT schema2.table2;

Note

For more information about sort keys, see Use sort keys and rough set indexes to accelerate queries in column-oriented tables.

CREATE VIEW

Convert a CREATE VIEW statement of Amazon Redshift into an SQL statement that complies with the AnalyticDB for PostgreSQL syntax. The conversion rules are similar to the conversion rules of CREATE TABLE statements.

Note

The WITH NO SCHEMA BINDING clause is not supported.

References

For information about automatic data migration, see Configure automatic data migration from an Amazon Redshift cluster to an AnalyticDB for PostgreSQL instance.