Data manipulation

This section provides an overview of how to manipulate data (e.g., inserting rows) with CrateDB.

Table of contents

Inserting data

Inserting data to CrateDB is done by using the SQL INSERT statement.

Note

The column list is always ordered based on the column position in the CREATE TABLE statement of the table. If the insert columns are omitted, the values in the VALUES clauses must correspond to the table columns in that order.

Inserting a row:

cr> insert into locations (id, date, description, kind, name, position)
... values (
...   '14',
...   '2013-09-12T21:43:59.000Z',
...   'Blagulon Kappa is the planet to which the police are native.',
...   'Planet',
...   'Blagulon Kappa',
...   7
... );
INSERT OK, 1 row affected (... sec)

When inserting rows with the VALUES clause all data is validated in terms of data types compatibility and compliance with defined constraints, and if there are any issues an error message is returned and no rows are inserted.

Inserting multiple rows at once (aka. bulk insert) can be done by defining multiple values for the INSERT statement:

cr> insert into locations (id, date, description, kind, name, position) values
... (
...   '16',
...   '2013-09-14T21:43:59.000Z',
...   'Blagulon Kappa II is the planet to which the police are native.',
...   'Planet',
...   'Blagulon Kappa II',
...   19
... ),
... (
...   '17',
...   '2013-09-13T16:43:59.000Z',
...   'Brontitall is a planet with a warm, rich atmosphere and no mountains.',
...   'Planet',
...   'Brontitall',
...   10
... );
INSERT OK, 2 rows affected (... sec)

When inserting into tables containing Generated columns or Base Columns having the Default clause specified, their values can be safely omitted. They are generated upon insert:

cr> CREATE TABLE debit_card (
...   owner text,
...   num_part1 integer,
...   num_part2 integer,
...   check_sum integer GENERATED ALWAYS AS ((num_part1 + num_part2) * 42),
...   "user" text DEFAULT 'crate'
... );
CREATE OK, 1 row affected (... sec)
cr> insert into debit_card (owner, num_part1, num_part2) values
... ('Zaphod Beeblebrox', 1234, 5678);
INSERT OK, 1 row affected (... sec)
cr> select * from debit_card;
+-------------------+-----------+-----------+-----------+-------+
| owner             | num_part1 | num_part2 | check_sum | user  |
+-------------------+-----------+-----------+-----------+-------+
| Zaphod Beeblebrox |      1234 |      5678 |    290304 | crate |
+-------------------+-----------+-----------+-----------+-------+
SELECT 1 row in set (... sec)

For Generated columns, if the value is given, it is validated against the generation clause of the column and the currently inserted row:

cr> insert into debit_card (owner, num_part1, num_part2, check_sum) values
... ('Arthur Dent', 9876, 5432, 642935);
SQLParseException[Given value 642935 for generated column check_sum does not match calculation ((num_part1 + num_part2) * 42) = 642936]

Inserting data by query

It is possible to insert data using a query instead of values. Column data types of source and target table can differ as long as the values are castable. This gives the opportunity to restructure the tables data, renaming a field, changing a field’s data type or convert a normal table into a partitioned one.

Caution

When inserting data from a query, there is no error message returned when rows fail to be inserted, they are instead skipped, and the number of rows affected is decreased to reflect the actual number of rows for which the operation succeeded.

Example of changing a field’s data type, in this case, changing the position data type from integer to smallint:

cr> create table locations2 (
...     id text primary key,
...     name text,
...     date timestamp with time zone,
...     kind text,
...     position smallint,
...     description text
... ) clustered by (id) into 2 shards with (number_of_replicas = 0);
CREATE OK, 1 row affected (... sec)
cr> insert into locations2 (id, name, date, kind, position, description)
... (
...     select id, name, date, kind, position, description
...     from locations
...     where position < 10
... );
INSERT OK, 14 rows affected (... sec)

Example of creating a new partitioned table out of the locations table with data partitioned by year:

cr> create table locations_parted (
...     id text primary key,
...     name text,
...     year text primary key,
...     date timestamp with time zone,
...     kind text,
...     position integer
... ) clustered by (id) into 2 shards
... partitioned by (year) with (number_of_replicas = 0);
CREATE OK, 1 row affected (... sec)
cr> insert into locations_parted (id, name, year, date, kind, position)
... (
...     select
...         id,
...         name,
...         date_format('%Y', date),
...         date,
...         kind,
...         position
...     from locations
... );
INSERT OK, 16 rows affected (... sec)

Resulting partitions of the last insert by query:

cr> select table_name, partition_ident, values, number_of_shards, number_of_replicas
... from information_schema.table_partitions
... where table_name = 'locations_parted'
... order by partition_ident;
+------------------+-----------------+------------------+------------------+--------------------+
| table_name       | partition_ident | values           | number_of_shards | number_of_replicas |
+------------------+-----------------+------------------+------------------+--------------------+
| locations_parted | 042j2e9n74      | {"year": "1979"} |                2 |                  0 |
| locations_parted | 042j4c1h6c      | {"year": "2013"} |                2 |                  0 |
+------------------+-----------------+------------------+------------------+--------------------+
SELECT 2 rows in set (... sec)

Note

limit, offset and order by are not supported inside the query statement.

Upserts (ON CONFLICT DO UPDATE SET)

The ON CONFLICT DO UPDATE SET clause is used to update the existing row if inserting is not possible because of a duplicate-key conflict if a document with the same PRIMARY KEY already exists. This is type of operation is commonly referred to as an upsert, being a combination of “update” and “insert”.

cr> SELECT
...     name,
...     visits,
...     extract(year from last_visit) AS last_visit
... FROM uservisits ORDER BY NAME;
+----------+--------+------------+
| name     | visits | last_visit |
+----------+--------+------------+
| Ford     |      1 | 2013       |
| Trillian |      3 | 2013       |
+----------+--------+------------+
SELECT 2 rows in set (... sec)
cr> INSERT INTO uservisits (id, name, visits, last_visit) VALUES
... (
...     0,
...     'Ford',
...     1,
...     '2015-01-12'
... ) ON CONFLICT (id) DO UPDATE SET
...     visits = visits + 1;
INSERT OK, 1 row affected (... sec)
cr> SELECT
...     name,
...     visits,
...     extract(year from last_visit) AS last_visit
... FROM uservisits WHERE id = 0;
+------+--------+------------+
| name | visits | last_visit |
+------+--------+------------+
| Ford |      2 | 2013       |
+------+--------+------------+
SELECT 1 row in set (... sec)

It’s possible to refer to values which would be inserted if no duplicate-key conflict occurred, by using the special excluded table. This table is especially useful in multiple-row inserts, to refer to the current rows values:

cr> INSERT INTO uservisits (id, name, visits, last_visit) VALUES
... (
...     0,
...     'Ford',
...     2,
...     '2016-01-13'
... ),
... (
...     1,
...     'Trillian',
...     5,
...     '2016-01-15'
... ) ON CONFLICT (id) DO UPDATE SET
...     visits = visits + excluded.visits,
...     last_visit = excluded.last_visit;
INSERT OK, 2 rows affected (... sec)
cr> SELECT
...     name,
...     visits,
...     extract(year from last_visit) AS last_visit
... FROM uservisits ORDER BY name;
+----------+--------+------------+
| name     | visits | last_visit |
+----------+--------+------------+
| Ford     |      4 | 2016       |
| Trillian |      8 | 2016       |
+----------+--------+------------+
SELECT 2 rows in set (... sec)

This can also be done when using a query instead of values:

cr> CREATE TABLE uservisits2 (
...   id integer primary key,
...   name text,
...   visits integer,
...   last_visit timestamp with time zone
... ) CLUSTERED BY (id) INTO 2 SHARDS WITH (number_of_replicas = 0);
CREATE OK, 1 row affected (... sec)
cr> INSERT INTO uservisits2 (id, name, visits, last_visit)
... (
...     SELECT id, name, visits, last_visit
...     FROM uservisits
... );
INSERT OK, 2 rows affected (... sec)
cr> INSERT INTO uservisits2 (id, name, visits, last_visit)
... (
...     SELECT id, name, visits, last_visit
...     FROM uservisits
... ) ON CONFLICT (id) DO UPDATE SET
...     visits = visits + excluded.visits,
...     last_visit = excluded.last_visit;
INSERT OK, 2 rows affected (... sec)
cr> SELECT
...     name,
...     visits,
...     extract(year from last_visit) AS last_visit
... FROM uservisits ORDER BY name;
+----------+--------+------------+
| name     | visits | last_visit |
+----------+--------+------------+
| Ford     |      4 | 2016       |
| Trillian |      8 | 2016       |
+----------+--------+------------+
SELECT 2 rows in set (... sec)

Updating data

In order to update documents in CrateDB the SQL UPDATE statement can be used:

cr> update locations set description = 'Updated description'
... where name = 'Bartledan';
UPDATE OK, 1 row affected (... sec)

Updating nested objects is also supported:

cr> update locations set inhabitants['name'] = 'Human' where name = 'Bartledan';
UPDATE OK, 1 row affected (... sec)

It’s also possible to reference a column within the expression, for example to increment a number like this:

cr> update locations set position = position + 1 where position < 3;
UPDATE OK, 6 rows affected (... sec)

Note

If the same documents are updated concurrently an VersionConflictException might occur. CrateDB contains a retry logic that tries to resolve the conflict automatically.

Deleting data

Deleting rows in CrateDB is done using the SQL DELETE statement:

cr> delete from locations where position > 3;
DELETE OK, ... rows affected (... sec)

Import and export

Importing data

Using the COPY FROM statement, CrateDB nodes can import data from local files or files that are available over the network.

The supported data formats are JSON and CSV. The format is inferred from the file extension, if possible. Alternatively the format can also be provided as an option (see WITH). If the format is not provided and cannot be inferred from the file extension, it will be processed as JSON.

JSON files must contain a single JSON object per line.

Example JSON data:

{"id": 1, "quote": "Don't panic"}
{"id": 2, "quote": "Ford, you're turning into a penguin. Stop it."}

CSV files must contain a header with comma-separated values, which will be added as columns.

Example CSV data:

id,quote
1,"Don't panic"
2,"Ford, you're turning into a penguin. Stop it."

Note

  • The COPY FROM statement will convert and validate your data.

  • Values for generated columns will be computed if the data does not contain them, otherwise they will be imported and validated

  • Furthermore, column names in your data are considered case sensitive (as if they were quoted in a SQL statement).

For further information, including how to import data to Partitioned tables, take a look at the COPY FROM reference.

Example

Here’s an example statement:

cr> COPY quotes FROM 'file:///tmp/import_data/quotes.json';
COPY OK, 3 rows affected (... sec)

This statement imports data from the /tmp/import_data/quotes.json file into a table named quotes.

Note

The file you specify must be available on one of the CrateDB nodes. This statement will not work with files that are local to your client.

For the above statement, every node in the cluster will attempt to import data from a file located at /tmp/import_data/quotes.json relative to the crate process (i.e., if you are running CrateDB inside a container, the file must also be inside the container).

If you want to import data from a file that on your local computer using COPY FROM, you must first transfer the file to one of the CrateDB nodes.

Consult the COPY FROM reference for additional information.

If you want to import all files inside the /tmp/import_data directory on every CrateDB node, you can use a wildcard, like so:

cr> COPY quotes FROM '/tmp/import_data/*' WITH (bulk_size = 4);
COPY OK, 3 rows affected (... sec)

This wildcard can also be used to only match certain files in a directory:

cr> COPY quotes FROM '/tmp/import_data/qu*.json';
COPY OK, 3 rows affected (... sec)

Detailed error reporting

If the RETURN_SUMMARY clause is specified, a result set containing information about failures and successfully imported records is returned.

cr> COPY locations FROM '/tmp/import_data/locations_with_failure/locations*.json' RETURN SUMMARY;
 +--...--+----------...--------+---------------+-------------+--------------------...-------------------------------------+
 | node  | uri                 | success_count | error_count | errors                                                     |
 +--...--+----------...--------+---------------+-------------+--------------------...-------------------------------------+
 | {...} | .../locations1.json |             6 |           0 | {}                                                         |
 | {...} | .../locations2.json |             5 |           2 | {"Cannot cast value...{"count": ..., "line_numbers": ...}} |
 +--...--+----------...--------+---------------+-------------+--------------------...-------------------------------------+
 COPY 2 rows in set (... sec)

If an error happens while processing the URI in general, the error_count and success_count columns will contains NULL values to indicate that no records were processed.

cr> COPY locations FROM '/tmp/import_data/not-existing.json' RETURN SUMMARY;
 +--...--+-----------...---------+---------------+-------------+------------------------...------------------------+
 | node  | uri                   | success_count | error_count | errors                                            |
 +--...--+-----------...---------+---------------+-------------+------------------------...------------------------+
 | {...} | .../not-existing.json |          NULL |        NULL | {"...not-existing.json (...)": {"count": 1, ...}} |
 +--...--+-----------...---------+---------------+-------------+------------------------...------------------------+
COPY 1 row in set (... sec)

See COPY FROM for more information.

Exporting data

Data can be exported using the COPY TO statement. Data is exported in a distributed way, meaning each node will export its own data.

Replicated data is not exported. So every row of an exported table is stored only once.

This example shows how to export a given table into files named after the table and shard ID with gzip compression:

cr> REFRESH TABLE quotes;
REFRESH OK...
cr> COPY quotes TO DIRECTORY '/tmp/' with (compression='gzip');
COPY OK, 3 rows affected ...

Instead of exporting a whole table, rows can be filtered by an optional WHERE clause condition. This is useful if only a subset of the data needs to be exported:

cr> COPY quotes WHERE match(quote_ft, 'time') TO DIRECTORY '/tmp/' WITH (compression='gzip');
COPY OK, 2 rows affected ...

For further details see COPY TO.