COPY FROM

You can use the COPY FROM statement to copy data from a file into a table.

Table of contents

Synopsis

COPY table_identifier
  [ ( column_ident [, ...] ) ]
  [ PARTITION (partition_column = value [ , ... ]) ]
  FROM uri [ WITH ( option = value [, ...] ) ] [ RETURN SUMMARY ]

Description

A COPY FROM copies data from a URI to the specified table.

The nodes in the cluster will attempt to read the files available at the URI and import the data.

Here’s an example:

cr> COPY quotes FROM 'file:///tmp/import_data/quotes.json';
COPY OK, 3 rows affected (... sec)

File formats

CrateDB accepts both JSON and CSV inputs. The format is inferred from the file extension (.json or .csv respectively) if possible. The format can also be set as an option. If a format is not specified and the format cannot be inferred, the file will be processed as JSON.

JSON files must contain a single JSON object per line and all files must be UTF-8 encoded. Also, any empty lines are skipped.

Example JSON data:

{"id": 1, "quote": "Don't panic"}
{"id": 2, "quote": "Ford, you're turning into a penguin. Stop it."}

A CSV file may or may not contain a header. See CSV header option for further details.

Example CSV data:

id,quote
1,"Don't panic"
2,"Ford, you're turning into a penguin. Stop it."

Example CSV data with no header:

1,"Don't panic"
2,"Ford, you're turning into a penguin. Stop it."

See also: Importing data.

Data type checks

CrateDB checks if the columns’ data types match the types from the import file. It casts the types and will always import the data as in the source file. Furthermore CrateDB will check for all Column constraints.

For example a WKT string cannot be imported into a column of geo_shape or geo_point type, since there is no implicit cast to the GeoJSON format.

Note

In case the COPY FROM statement fails, the log output on the node will provide an error message. Any data that has been imported until then has been written to the table and should be deleted before restarting the import.

Parameters

table_ident

The name (optionally schema-qualified) of an existing table where the data should be put.

column_ident

Used in an optional columns declaration, each column_ident is the name of a column in the table_ident table.

This currently only has an effect if using the CSV file format. See the header section for how it behaves.

uri

An expression or array of expressions. Each expression must evaluate to a string literal that is a well-formed URI.

URIs must use one of the supported URI schemes. CrateDB supports globbing for the file and s3 URI schemes.

Note

If the URI scheme is missing, CrateDB assumes the value is a pathname and will prepend the file URI scheme (i.e., file://). So, for example, CrateDB will convert /tmp/file.json to file:///tmp/file.json.

URI globbing

With file and s3 URI schemes, you can use pathname globbing (i.e., * wildcards) with the COPY FROM statement to construct URIs that can match multiple directories and files.

Suppose you used file:///tmp/import_data/*/*.json as the URI. This URI would match all JSON files located in subdirectories of the /tmp/import_data directory.

So, for example, these files would match:

  • /tmp/import_data/foo/1.json

  • /tmp/import_data/bar/2.json

  • /tmp/import_data/1/boz.json

Caution

A file named /tmp/import_data/foo/.json would also match the file:///tmp/import_data/*/*.json URI. The * wildcard matches any number of characters, including none.

However, these files would not match:

  • /tmp/import_data/1.json (two few subdirectories)

  • /tmp/import_data/foo/bar/2.json (too many subdirectories)

  • /tmp/import_data/1/boz.js (file extension mismatch)

URI schemes

CrateDB supports the following URI schemes:

file

You can use the file:// scheme to specify an absolute path to one or more files accessible via the local filesystem of one or more CrateDB nodes.

For example:

file:///path/to/dir

The files must be accessible on at least one node and the system user running the crate process must have read access to every file specified.

By default, every node will attempt to import every file. If the file is accessible on multiple nodes, you can set the shared option to true in order to avoid importing duplicates.

Use RETURN SUMMARY to get information about what actions were performed on each node.

Tip

If you are running CrateDB inside a container, the file must be inside the container. If you are using Docker, you may have to configure a Docker volume to accomplish this.

Tip

If you are using Microsoft Windows, you must include the drive letter in the file URI.

For example:

file://C:\/tmp/import_data/quotes.json

Consult the Windows documentation for more information.

s3

You can use the s3:// scheme to access buckets on the Amazon Simple Storage Service (Amazon S3).

For example:

s3://[<accesskey>:<secretkey>@][<host>:<port>/]<bucketname>/<path>

S3 compatible storage providers can be specified by the optional pair of host and port, which defaults to Amazon S3 if not provided.

Here is a more concrete example:

COPY t FROM 's3://accessKey:secretKey@s3.amazonaws.com:443/myBucket/key/a.json' with (protocol = 'https')

If no credentials are set the s3 client will operate in anonymous mode. See AWS Java Documentation.

Using the s3:// scheme automatically sets the shared to true.

Tip

A secretkey provided by Amazon Web Services can contain characters such as ‘/’, ‘+’ or ‘=’. These characters must be URL encoded. For a detailed explanation read the official AWS documentation.

To escape a secret key, you can use a snippet like this:

sh$ python -c "from getpass import getpass; from urllib.parse import quote_plus; print(quote_plus(getpass('secret_key: ')))"

This will prompt for the secret key and print the encoded variant.

Additionally, versions prior to 0.51.x use HTTP for connections to S3. Since 0.51.x these connections are using the HTTPS protocol. Please make sure you update your firewall rules to allow outgoing connections on port 443.

Other schemes

In addition to the schemes above, CrateDB supports all protocols supported by the URL implementation of its JVM (typically http, https, ftp, and jar). Please refer to the documentation of the JVM vendor for an accurate list of supported protocols.

Note

These schemes do not support wildcard expansion.

Clauses

The COPY FROM statement supports the following clauses:

PARTITION

If the table is partitioned, the optional PARTITION clause can be used to import data into one partition exclusively.

[ PARTITION ( partition_column = value [ , ... ] ) ]
partition_column

One of the column names used for table partitioning

value

The respective column value.

All partition columns (specified by the PARTITIONED BY clause) must be listed inside the parentheses along with their respective values using the partition_column = value syntax (separated by commas).

Because each partition corresponds to a unique set of partition column row values, this clause uniquely identifies a single partition for import.

Tip

The SHOW CREATE TABLE statement will show you the complete list of partition columns specified by the PARTITIONED BY clause.

Caution

Partitioned tables do not store the row values for the partition columns, hence every row will be imported into the specified partition regardless of partition column values.

WITH

You can use the optional WITH clause to specify option values.

[ WITH ( option = value [, ...] ) ]

The WITH clause supports the following options:

bulk_size

CrateDB will process the lines it reads from the path in bulks. This option specifies the size of one batch. The provided value must be greater than 0, the default value is 10000.

fail_fast

A boolean value indicating if the COPY FROM operation should abort early after an error. This is best effort and due to the distributed execution, it may continue processing some records before it aborts. Defaults to false.

wait_for_completion

A boolean value indicating if the COPY FROM should wait for the copy operation to complete. If set to false the request returns at once and the copy operation runs in the background. Defaults to true.

shared

This option should be set to true if the URI’s location is accessible by more than one CrateDB node to prevent them from importing the same file.

The default value depends on the scheme of each URI.

If an array of URIs is passed to COPY FROM this option will overwrite the default for all URIs.

node_filters

A filter expression to select the nodes to run the read operation.

It’s an object in the form of:

{
    name = '<node_name_regex>',
    id = '<node_id_regex>'
}

Only one of the keys is required.

The name regular expression is applied on the name of all execution nodes, whereas the id regex is applied on the node id.

If both keys are set, both regular expressions have to match for a node to be included.

If the shared option is false, a strict node filter might exclude nodes with access to the data leading to a partial import.

To verify which nodes match the filter, run the statement with EXPLAIN.

num_readers

The number of nodes that will read the resources specified in the URI. Defaults to the number of nodes available in the cluster. If the option is set to a number greater than the number of available nodes it will still use each node only once to do the import. However, the value must be an integer greater than 0.

If shared is set to false this option has to be used with caution. It might exclude the wrong nodes, causing COPY FROM to read no files or only a subset of the files.

compression

The default value is null, set to gzip to read gzipped files.

protocol

Used for s3 scheme only. It is set to HTTPS by default.

overwrite_duplicates

Default: false

COPY FROM by default won’t overwrite rows if a document with the same primary key already exists. Set to true to overwrite duplicate rows.

empty_string_as_null

If set to true the empty_string_as_null option enables conversion of empty strings into NULL. The default value is false meaning that no action will be taken on empty strings during the COPY FROM execution.

The option is only supported when using the CSV format, otherwise, it will be ignored.

delimiter

Specifies a single one-byte character that separates columns within each line of the file. The default delimiter is ,.

The option is only supported when using the CSV format, otherwise, it will be ignored.

format

This option specifies the format of the input file. Available formats are csv or json. If a format is not specified and the format cannot be guessed from the file extension, the file will be processed as JSON.

validation

Default: true

Enables or disables parsing and validation of source data.

If set to false CrateDB will not validate the data in any way. Use this with caution if you’re sure the data is in the right format - for example if you exported the data using COPY TO.

All constraints - including type value constraints are bypassed. If you import invalid data this way, subsequent queries can fail.

If the table has generated columns or default expressions, validation will always take place and the parameter is ignored.

RETURN SUMMARY

By using the optional RETURN SUMMARY clause, a per-node result set will be returned containing information about possible failures and successfully inserted records.

[ RETURN SUMMARY ]

Column Name

Description

Return Type

node

Information about the node that has processed the URI resource.

OBJECT

node['id']

The id of the node.

TEXT

node['name']

The name of the node.

TEXT

uri

The URI the node has processed.

TEXT

error_count

The total number of records which failed. A NULL value indicates a general URI reading error, the error will be listed inside the errors column.

BIGINT

success_count

The total number of records which were inserted. A NULL value indicates a general URI reading error, the error will be listed inside the errors column.

BIGINT

errors

Contains detailed information about all errors.

OBJECT

errors[ERROR_MSG]

Contains information about a type of an error.

OBJECT

errors[ERROR_MSG]['count']

The number records failed with this error.

BIGINT

errors[ERROR_MSG]['line_numbers']

The line numbers of the source URI where the error occurred, limited to the first 50 errors, to avoid buffer pressure on clients.

ARRAY