How to query CSV and JSON files using Python SDK - Object Storage Service

This topic describes how to query CSV and JSON files using the SelectObject operation in the Python SDK.

Usage notes

In this topic, the public endpoint of the China (Hangzhou) region is used. If you want to access OSS from other Alibaba Cloud services in the same region as OSS, use an internal endpoint. For more information about OSS regions and endpoints, see Regions and endpoints.
In this topic, an OSSClient instance is created by using an OSS endpoint. If you want to create an OSSClient instance by using custom domain names or Security Token Service (STS), see Initialization.
To query objects, you must have the oss:GetObject permission. For more information, see Attach a custom policy to a RAM user.
OSS supports querying only CSV and JSON files.

Python SDK example

The following code shows how to query CSV and JSON files:

import oss2
from oss2.credentials import EnvironmentVariableCredentialsProvider

def select_call_back(consumed_bytes, total_bytes = None):
    print('Consumed Bytes:' + str(consumed_bytes) + '\n')

# Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are set.
auth = oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider())

# Specify the endpoint of the region where the bucket is located. For example, if the bucket is in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com.
endpoint = "https://oss-cn-hangzhou.aliyuncs.com"

# Specify the region where the endpoint is located, such as cn-hangzhou. Note that this parameter is required for V4 signatures.
region = "cn-hangzhou"

# Replace yourBucketName with the bucket name.
bucket = oss2.Bucket(auth, endpoint, "yourBucketName", region=region)

key ='python_select.csv'
content ='Tom Hanks,USA,45\r\n'*1024
filename ='python_select.csv'

# Upload the CSV file.
bucket.put_object(key, content)
# Parameters for the SelectObject operation.
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
                    'RecordDelimiter': '\r\n',
                    'LineRange': (500, 1000)}

csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)

### JSON DOCUMENT
key =  'python_select.json'
content =  "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename =  'python_select.json'
# Upload the JSON DOCUMENT.
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

### JSON LINES
key =  'python_select_lines.json'
content =  "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename =  'python_select.json'
# Upload the JSON LINE.
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)

result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content =  result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
           "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

Python SelectObject

The following section describes the methods of the Python Select API, such as select_object, select_object_to_file, and create_select_object_meta.

select_object

Example: select_object
```
def select_object(self, key, sql,
                   progress_callback=None,
                   select_params=None
                   byte_range=None
                   headers=None
                   ):
```
The preceding example shows how to run an SQL statement on a file that is specified by a key and returns the query result.
- sql is the raw SQL string. Base64 encoding is not required.
- Progress_callback is an optional callback function used to report progress.
- select_params specifies various parameters and behaviors for the select operation.
- headers specifies the header information to include in the request. This parameter has the same behavior as the get_object method. For example, for CSV files, you can use this parameter to specify a byte range in the file for the SQL query.

Parameters supported by select_params

Parameter	Description
Json_Type	If Json_Type is not specified, the file is considered a CSV file by default. If Json_Type is set to DOCUMENT, the file is a JSON DOCUMENT file. If Json_Type is set to LINES, the file is a JSON LINE file.
CsvHeaderInfo	The header information of the CSV file. Valid values are None, Ignore, and Use. None: The file has no header information. Ignore: The file has header information, but it is not used in the SQL statement. Use: The file has header information, and the column names in the header are used in the SQL statement.
CommentCharacter	The comment character in the CSV file. Only one character is supported. The default value is None, which indicates that there is no comment character.
RecordDelimiter	The row delimiter in the CSV file. Only one or two characters are supported. The default value is \n.
OutputRecordDelimiter	The row delimiter in the select output. The default value is \n.
FieldDelimiter	The column delimiter for the CSV file. Only one character is supported. The default value is a comma (,).
OutputFieldDelimiter	The column delimiter in the select output. The default value is a comma (,).
QuoteCharacter	The quote character for columns in the CSV file. Only one character is supported. The default value is a double quotation mark ("). Row and column delimiters within quotation marks are treated as regular characters.
SplitRange	Performs a sharded query using splits. The format is (start, end), which is a closed interval. This indicates that the query range is from split #start to split #end.
LineRange	Performs a sharded query using rows. The format is (start, end), which is a closed interval. This indicates that the query range is from row number #start to row number #end.
CompressionType	The compression type. The value can be GZIP. The default value is None.
KeepAllColumns	If this parameter is set to true, columns in the original CSV file that are not in the select list are output as empty values, but their positions are retained. The default value is False. If the columns in the CSV file are firstname, lastname, and age, and the SQL statement is `select firstname, age from ossobject`: If KeepAllColumns is true, the output is firstname,,age (with an extra comma in the middle). If KeepAllColumns is false, the output is firstname,age. Note This option is introduced so that the code originally used to process data returned by GetObject can be switched to SelectObject without modification.
OutputRawData	If this parameter is True, the output is select data without frame packaging. If no data is returned for a long time, a timeout may occur. If this parameter is False, the output is data packaged in frames. The default value is False.
EnablePayloadCrc	Calculates a cyclic redundancy check (CRC) value for each frame. The default value is False.
OutputHeader	Used only for CSV files. This indicates that the first line of the output is the header information.
SkipPartialDataRecord	If this parameter is True and a column value in a CSV file or a key in a JSON file does not exist, the entire record is skipped. If this parameter is False, the column is treated as null. For example, a row contains the columns firstname,lastname,age. The SQL statement is `select _1, _4 from ossobject`. If the parameter is True, this row is skipped. If the parameter is False, firstname,\n is returned.
MaxSkippedRecordsAllowed	The maximum number of rows that can be skipped. The default value is 0, which indicates that an error is returned if any row is skipped.
ParseJsonNumberAsString	If this parameter is True, all numbers in the JSON file are parsed as strings. If this parameter is False, numbers are parsed as integers or floating-point numbers. The default value is False. If a JSON file contains high-precision floating-point numbers, parsing them directly as floating-point numbers may cause a loss of precision. To retain the original precision, set this parameter to True and cast the column to the decimal type in the SQL statement.

Return value: The method returns a SelectObjectResult object. This object supports the read() function to retrieve all query results. It also supports the __iter__ method.
Note
If the query result is large, calling the read() function blocks the process until the complete result is returned and can consume a large amount of memory. To process each chunk, use the __iter__ method, for example, `foreach chunk in result`. Using the __iter__ method reduces memory usage. It also allows the client to process each chunk as soon as it is returned by the OSS server, rather than waiting for the entire result.

select_object_to_file
```
def select_object_to_file(self, key, filename, sql,
                   progress_callback=None,
                   select_params=None
                   headers=None
                   ):
```
The preceding example shows how to run an SQL statement on a file that is specified by a key and write the query result to a specified file.
The parameters are the same as those for select_object.

create_select_object_meta

Syntax:
```
def create_select_object_meta(self, key, select_meta_params=None, header=None):
```
The preceding example shows how to create or retrieve select metadata for a file that is specified by a key. Select metadata includes the total number of rows, the total number of columns for CSV files, and the total number of splits for the file.
If metadata has already been created for the file, this function does not recreate the metadata unless OverwriteIfExists is set to true in the parameters.
Creating select metadata requires scanning the entire file.

Parameters supported in select_meta_params

Parameter	Description
Json_Type	If Json_Type is not specified, the file is considered a CSV file by default. If specified, the value must be LINES, which indicates that the file is a JSON LINES file. Note This operation is not supported for JSON DOCUMENT files.
RecordDelimiter	The line feed character for the CSV file.
FieldDelimiter	The column delimiter for the CSV file.
QuoteCharacter	The quote character for columns in the CSV file. Row and column delimiters within quotation marks are treated as regular characters.
CompressionType	The compression type. Currently, no compression types are supported. Therefore, the value can only be None.
OverwriteIfExists	Overwrites the original select metadata. You do not need to use this option in normal cases.

Return value: A GetSelectObjectMetaResult object that includes the rows and splits properties. For CSV files, the internal select_resp object also includes the columns value, which indicates the number of columns in the CSV file.

References

For the complete sample code that shows how to query files, see GitHub example.
For more information about the API operation used to query files, see SelectObject.