CREATE TABLE

Create a new table.

Table of contents

Synopsis

CREATE TABLE [ IF NOT EXISTS ] table_ident ( [
    {
        base_column_definition
      | generated_column_definition
      | table_constraint
    }
    [, ... ] ]
)
[ PARTITIONED BY (column_name [, ...] ) ]
[ CLUSTERED [ BY (routing_column) ] INTO num_shards SHARDS ]
[ WITH ( table_parameter [= value] [, ... ] ) ]

where base_column_definition:

column_name data_type
[ DEFAULT default_expr ]
[ column_constraint [ ... ] ]  [ storage_options ]

where generated_column_definition is:

column_name [ data_type ] [ GENERATED ALWAYS ]
AS [ ( ] generation_expression [ ) ]
[ column_constraint [ ... ] ]

where column_constraint is:

{ PRIMARY KEY |
  NOT NULL |
  INDEX { OFF | USING { PLAIN |
                        FULLTEXT [ WITH ( analyzer = analyzer_name ) ]  }
  [ CONSTRAINT constraint_name ] CHECK (boolean_expression)
}

where storage_options is:

STORAGE WITH ( option = value_expression [, ... ] )

and table_constraint is:

{ PRIMARY KEY ( column_name [, ... ] ) |
  INDEX index_name USING FULLTEXT ( column_name [, ... ] )
       [ WITH ( analyzer = analyzer_name ) ]
  [ CONSTRAINT constraint_name ] CHECK (boolean_expression)
}

Description

CREATE TABLE will create a new, initially empty table.

If the table_ident does not contain a schema, the table is created in the doc schema. Otherwise it is created in the given schema, which is implicitly created, if it didn’t exist yet.

A table consists of one or more base columns and any number of generated columns and/or table_constraints.

The optional constraint clauses specify constraints (tests) that new or updated rows must satisfy for an INSERT, UPDATE or COPY FROM operation to succeed. A constraint is an SQL object that helps define the set of valid values in the table in various ways.

There are two ways to define constraints: table constraints and column constraints. A column constraint is defined as part of a column definition. A table constraint definition is not tied to a particular column, and it can encompass more than one column. Every column constraint can also be written as a table constraint; a column constraint is only a notational convenience for use when the constraint only affects one column.

Table elements

Base Columns

A base column is a persistent column in the table metadata. In relational terms it is an attribute of the tuple of the table-relation. It has a name, a type, an optional default clause and optional constraints.

Base columns are readable and writable (if the table itself is writable). Values for base columns are given in DML statements explicitly or omitted, in which case their value is null.

Default clause

The optional default clause defines the default value of the column. The value is inserted when the column is a target of an INSERT or COPY FROM statement that doesn’t contain an explicit value for it.

The default clause expression is variable-free, it means that subqueries and cross-references to other columns are not allowed.

Note

Default values are not allowed for columns of type OBJECT:

cr> CREATE TABLE tbl (obj OBJECT DEFAULT {key='foo'})
SQLParseException[Default values are not allowed for object columns: obj]

They are allowed for sub columns of an object column. If an object column has at least one child with a default expression it will implicitly create the full object unless it’s within an array.

An example:

cr> CREATE TABLE object_defaults (id int, obj OBJECT AS (key TEXT DEFAULT ''))
CREATE OK, 1 row affected  (... sec)

cr> INSERT INTO object_defaults (id) VALUES (1)
INSERT OK, 1 row affected  (... sec)

cr> REFRESH TABLE object_defaults
REFRESH OK, 1 row affected  (... sec)

cr> SELECT obj FROM object_defaults
+-------------+
| obj         |
+-------------+
| {"key": ""} |
+-------------+
SELECT 1 row in set (... sec)

Generated columns

A generated column is a persistent column that is computed as needed from the generation_expression for every INSERT, UPDATE and COPY FROM operation.

The GENERATED ALWAYS part of the syntax is optional.

Note

A generated column is not a virtual column. The computed value is stored in the table like a base column is. The automatic computation of the value is what makes it different.

Table constraints

Table constraints are constraints that are applied to more than one column or to the table as a whole.

Column constraints

Column constraints are constraints that are applied on each column of the table separately.

Storage options

Storage options can be applied on each column of the table separately.

Parameters

table_ident

The name (optionally schema-qualified) of the table to be created.

column_name

The name of a column to be created in the new table.

data_type

The data type of the column. This can include array and object specifiers.

generation_expression

An expression (usually a function call) that is applied in the context of the current row. As such, it can reference other base columns of the table. Referencing other generated columns (including itself) is not supported. The generation expression is evaluated each time a row is inserted or the referenced base columns are updated.

IF NOT EXISTS

If the optional IF NOT EXISTS clause is used, this statement won’t do anything if the table exists already.

CLUSTERED

The optional CLUSTERED clause specifies how a table should be distributed across a cluster.

[ CLUSTERED [ BY (routing_column) ] INTO num_shards SHARDS ]
num_shards

Specifies the number of shards a table is stored in. Must be greater than 0. If not provided, the number of shards is calculated based on the number of currently active data nodes with the following formula:

num_shards = max(4, num_data_nodes * 2)

Note

The minimum value of num_shards is set to 4. This means if the calculation of num_shards does not exceeds its minimum it applies the minimum value to each table or partition as default.

routing_column

Specify a routing column that determines how rows are sharded.

All rows that have the same routing_column row value are stored in the same shard. If a primary key has been defined, it will be used as the default routing column, otherwise the internal document ID is used.

PARTITIONED BY

The PARTITIONED clause splits the created table into separate partitions for every distinct combination of row values in the specified partition columns.

[ PARTITIONED BY ( column_name [ , ... ] ) ]
column_name

The name of a column to be used for partitioning. Multiple columns names can be specified inside the parentheses and must be separated by commas.

The following restrictions apply:

  • Partition columns may not be part of the CLUSTERED clause

  • Partition columns must only contain primitive types

  • Partition columns may not be inside an object array

  • Partition columns may not be indexed with a fulltext index with analyzer

  • If the table has a PRIMARY KEY constraint, all of the partition columns must be included in the primary key definition

Caution

Partition columns cannot be altered by an UPDATE statement.

WITH

The optional WITH clause can specify parameters for tables.

[ WITH ( table_parameter [= value] [, ... ] ) ]
table_parameter

Specifies an optional parameter for the table.

Note

Some parameters are nested, and therefore need to be wrapped in double quotes in order to be set. For example:

WITH ("allocation.max_retries" = 5)

Nested parameters are those that contain a . between parameter names (e.g. write.wait_for_active_shards).

Available parameters are:

number_of_replicas

Specifies the number or range of replicas each shard of a table should have for normal operation, the default is to have 0-1 replica.

The number of replicas is defined like this:

min_replicas [ - [ max_replicas ] ]
min_replicas

The minimum number of replicas required.

max_replicas

The maximum number of replicas.

The actual maximum number of replicas is max(num_replicas, N-1), where N is the number of data nodes in the cluster. If max_replicas is the string all then it will always be N.

See also

Replication

number_of_routing_shards

This number specifies the hashing space that is used internally to distribute documents across shards.

This is an optional setting that enables users to later on increase the number of shards using ALTER TABLE. It’s not possible to update this setting after table creation.

refresh_interval

In CrateDB new written records are not immediately visible. A user has to either invoke the REFRESH statement or wait for an automatic background refresh.

The interval of this background refresh is specified in milliseconds using this refresh_interval setting.

By default it’s not specified, which causes tables to be refreshed once every second but only if the table is not idle. A table can become idle if no query accesses it for more than 30 seconds.

If a table is idle, the periodic refresh is temporarily disabled. A query hitting an idle table will trigger a refresh and enable the periodic refresh again.

When refresh_interval is set explicitly, table is refreshed regardless of idle state. Use ALTER TABLE RESET to switch to default 1 second refresh and freeze-on-idle behavior.

value

The refresh interval in milliseconds. A value smaller or equal than 0 turns off the automatic refresh. A value of greater than 0 schedules a periodic refresh of the table.

Note

A refresh_interval of 0 does not guarantee that new writes are NOT visible to subsequent reads. Only the periodic refresh is disabled. There are other internal factors that might trigger a refresh.

Note

On partitioned tables, the idle mechanism works per partition. This can be useful for time-based partitions where older partitions are rarely queried.

The downside is that if many partitions are idle and a query activates them, there will be a spike in refresh load. If you’ve such an access pattern, you may want to set an explicit refresh_interval to have a permanent background refresh.

write.wait_for_active_shards

Specifies the number of shard copies that need to be active for write operations to proceed. If less shard copies are active the operation must wait and retry for up to 30s before timing out.

value

all or a positive integer up to the total number of configured shard copies (number_of_replicas + 1).

A value of 1 means only the primary has to be active. A value of 2 means the primary plus one replica shard has to be active, and so on.

The default value is set to 1.

all is a special value that means all shards (primary + replicas) must be active for write operations to proceed.

Increasing the number of shard copies to wait for improves the resiliency of the system. It reduces the chance of write operations not writing to the desired number of shard copies, but it does not eliminate the possibility completely, because the check occurs before the write operation starts.

Replica shard copies that missed some writes will be brought up to date by the system eventually, but in case a node holding the primary copy has a system failure, the replica copy couldn’t be promoted automatically as it would lead to data loss since the system is aware that the replica shard didn’t receive all writes. In such a scenario, ALTER TABLE .. REROUTE PROMOTE REPLICA can be used to force the allocation of a stale replica copy to at least recover the data that is available in the stale replica copy.

Say you’ve a 3 node cluster and a table with 1 configured replica. With write.wait_for_active_shards=1 and number_of_replicas=1 a node in the cluster can be restarted without affecting write operations because the primary copies are either active or the replicas can be quickly promoted.

If write.wait_for_active_shards would be set to 2 instead and a node is stopped, the write operations would block until the replica is fully replicated again or the write operations would timeout in case the replication is not fast enough.

blocks.read_only

Allows to have a read only table.

value

Table is read only if value set to true. Allows writes and table settings changes if set to false.

blocks.read_only_allow_delete

Allows to have a read only table that additionally can be deleted.

value

Table is read only and can be deleted if value set to true. Allows writes and table settings changes if set to false. This flag should not be set manually as it’s used, in an automated way, by the mechanism that protects CrateDB nodes from running out of available disk space.

When a disk on a node exceeds the cluster.routing.allocation.disk.watermark.flood_stage threshold, this block is applied (set to true) to all tables on that affected node. Once you’ve freed disk space again and the threshold is undershot, the setting is automatically reset to false for the affected tables.

Note

During maintenance operations, you might want to temporarily disable reads, writes or table settings changes. To achieve this, please use the corresponding settings blocks.read, blocks.write, blocks.metadata, or blocks.read_only, which must be manually reset after the maintenance operation has been completed.

blocks.read

disable/enable all the read operations

value

Set to true to disable all read operations for a table, otherwise set false.

blocks.write

disable/enable all the write operations

value

Set to true to disable all write operations and table settings modifications, otherwise set false.

blocks.metadata

disable/enable the table settings modifications.

values

Disables the table settings modifications if set to true. If set to false, table settings modifications are enabled.

soft_deletes.enabled

Indicates whether soft deletes are enabled or disabled.

Soft deletes allow CrateDB to preserve recent deletions within the Lucene index. This information is used for shard recovery.

Before the introduction of soft deletes, CrateDB had to retain the information in the Translog. Using soft deletes uses less storage than the Translog equivalent and is faster.

Soft deletes are mandatory in CrateDB 5.0, therefore this setting can no longer be modified. It will always be set to true.

The setting will be removed in CrateDB 6.0.

soft_deletes.retention_lease.period

The maximum period for which a retention lease is retained before it is considered expired.

value

12h (default). Any positive time value is allowed.

CrateDB sometimes needs to replay operations that were executed on one shard on other shards. For example if a shard copy is temporarily unavailable but write operations to the primary copy continues, the missed operations have to be replayed once the shard copy becomes available again.

If soft deletes are enabled, CrateDB uses a Lucene feature to preserve recent deletions in the Lucene index so that they can be replayed. Because of that, deleted documents still occupy disk space, which is why CrateDB only preserves certain recently-deleted documents. CrateDB eventually fully discards deleted documents to prevent the index growing larger despite having deleted documents.

CrateDB keeps track of operations it expects to need to replay using a mechanism called shard history retention leases. Retention leases are a mechanism that allows CrateDB to determine which soft-deleted operations can be safely discarded.

If a shard copy fails, it stops updating its shard history retention lease, indicating that the soft-deleted operations should be preserved for later recovery.

However, to prevent CrateDB from holding onto shard retention leases forever, they expire after soft_deletes.retention_lease.period, which defaults to 12h. Once a retention lease has expired CrateDB can again discard soft-deleted operations. In case a shard copy recovers after a retention lease has expired, CrateDB will fall back to copying the whole index since it can no longer replay the missing history.

codec

By default data is stored using LZ4 compression. This can be changed to best_compression which uses DEFLATE for a higher compression ratio, at the expense of slower column value lookups.

values

default or best_compression

store.type

The store type setting allows you to control how data is stored and accessed on disk. It’s not possible to update this setting after table creation. The following storage types are supported:

fs

Default file system implementation. It will pick the best implementation depending on the operating environment, which is currently hybridfs on all supported systems but is subject to change.

niofs

The NIO FS type stores the shard index on the file system (Lucene NIOFSDirectory) using NIO. It allows multiple threads to read from the same file concurrently.

mmapfs

The MMap FS type stores the shard index on the file system (Lucene MMapDirectory) by mapping a file into memory (mmap). Memory mapping uses up a portion of the virtual memory address space in your process equal to the size of the file being mapped. Before using this type, be sure you have allowed plenty of virtual address space.

hybridfs

The hybridfs type is a hybrid of niofs and mmapfs, which chooses the best file system type for each type of file based on the read access pattern. Similarly to mmapfs be sure you have allowed plenty of virtual address space.

It is possible to restrict the use of the mmapfs and hybridfs store type via the node.store.allow_mmap node setting.

mapping.total_fields.limit

Sets the maximum number of columns that is allowed for a table. Default is 1000.

value

Maximum amount of fields in the Lucene index mapping. This includes both the user facing mapping (columns) and internal fields.

translog.flush_threshold_size

Sets size of transaction log prior to flushing.

value

Size (bytes) of translog.

translog.sync_interval

How often the translog is fsynced to disk. Defaults to 5s. When setting this interval, please keep in mind that changes logged during this interval and not synced to disk may get lost in case of a failure. This setting only takes effect if translog.durability is set to ASYNC.

value

Interval in milliseconds.

translog.durability

If set to ASYNC the translog gets flushed to disk in the background every translog.sync_interval. If set to REQUEST the flush happens after every operation.

value

REQUEST (default), ASYNC

routing.allocation.total_shards_per_node

Controls the total number of shards (replicas and primaries) allowed to be allocated on a single node. Defaults to unbounded (-1).

value

Number of shards per node.

routing.allocation.enable

Controls shard allocation for a specific table. Can be set to:

all

Allows shard allocation for all shards. (Default)

primaries

Allows shard allocation only for primary shards.

new_primaries

Allows shard allocation only for primary shards for new tables.

none

No shard allocation allowed.

allocation.max_retries

Defines the number of attempts to allocate a shard before giving up and leaving the shard unallocated.

value

Number of retries to allocate a shard. Defaults to 5.

routing.allocation.include.{attribute}

Assign the table to a node whose {attribute} has at least one of the comma-separated values.

routing.allocation.require.{attribute}

Assign the table to a node whose {attribute} has all of the comma-separated values.

routing.allocation.exclude.{attribute}

Assign the table to a node whose {attribute} has none of the comma-separated values.

unassigned.node_left.delayed_timeout

Delay the allocation of replica shards which have become unassigned because a node has left. It defaults to 1m to give a node time to restart completely (which can take some time when the node has lots of shards). Setting the timeout to 0 will start allocation immediately. This setting can be changed on runtime in order to increase/decrease the delayed allocation if needed.

column_policy

Specifies the column policy of the table. The default column policy is strict.

The column policy is defined like this:

WITH ( column_policy = {'dynamic' | 'strict'} )
strict

Rejecting any column on INSERT, UPDATE or COPY FROM which is not defined in the schema

dynamic

New columns can be added using INSERT, UPDATE or COPY FROM. New columns added to dynamic tables are, once added, usable as usual columns. One can retrieve them, sort by them and use them in WHERE clauses.

max_ngram_diff

Specifies the maximum difference between max_ngram and min_ngram when using the NGramTokenizer or the NGramTokenFilter. The default is 1.

max_shingle_diff

Specifies the maximum difference between min_shingle_size and max_shingle_size when using the ShingleTokenFilter. The default is 3.

merge.scheduler.max_thread_count

The maximum number of threads on a single shard that may be merging at once. Defaults to Math.max(1, Math.min(4, Runtime.getRuntime().availableProcessors() / 2)) which works well for a good solid-state-disk (SSD). If your index is on spinning platter drives instead, decrease this to 1.