nccopy - Copy a netCDF file, optionally changing format,
compression, or chunking in the output.
nccopy [-k kind_name ] [-kind_code] [-d n
] [-s] [-c chunkspec ] [-u] [-w] [-[v|V] var1,...] [-[g|G]
grp1,...] [-m bufsize ] [-h chunk_cache ] [-e
cache_elems ] [-r] [-F filterspec ] [-L n ] [-M n
] infile outfile
The nccopy utility copies an input netCDF file in any
supported format variant to an output netCDF file, optionally converting the
output to any compatible netCDF format variant, compressing the data, or
rechunking the data. For example, if built with the netCDF-3 library, a
netCDF classic file may be copied to a netCDF 64-bit offset file, permitting
larger variables. If built with the netCDF-4 library, a netCDF classic file
may be copied to a netCDF-4 file or to a netCDF-4 classic model file as
well, permitting data compression, efficient schema changes, larger variable
sizes, and use of other netCDF-4 features.
If no output format is specified, with either -k kind_name
or -kind_code, then the output will use the same format as the input,
unless the input is classic or 64-bit offset and either chunking or
compression is specified, in which case the output will be netCDF-4 classic
model format. Attempting some kinds of format conversion will result in an
error, if the conversion is not possible. For example, an attempt to copy a
netCDF-4 file that uses features of the enhanced model, such as groups or
variable-length strings, to any of the other kinds of netCDF formats that
use the classic model will result in an error.
nccopy also serves as an example of a generic netCDF-4
program, with its ability to read any valid netCDF file and handle nested
groups, strings, and user-defined types, including arbitrarily nested
compound types, variable-length types, and data of any valid netCDF-4
type.
If DAP support was enabled when nccopy was built, the file
name may specify a DAP URL. This may be used to convert data on DAP servers
to local netCDF files.
- -k kind_name
- Use format name to specify the kind of file to be created and, by
inference, the data model (i.e. netcdf-3 (classic) or netcdf-4
(enhanced)). The possible arguments are:
- 'nc3' or 'classic' => netCDF classic format
- 'nc6' or '64-bit offset' => netCDF 64-bit format
- 'nc4' or 'netCDF-4' => netCDF-4 format (enhanced data model)
- 'nc7' or 'netCDF-4 classic model' => netCDF-4 classic model format
- Note: The old format numbers '1', '2', '3', '4', equivalent to the format
names 'nc3', 'nc6', 'nc4', or 'nc7' respectively, are also still accepted
but deprecated, due to easy confusion between format numbers and format
names.
- [-kind_code]
- Use format numeric code (instead of format name) to specify the kind of
file to be created and, by inference, the data model (i.e. netcdf-3
(classic) versus netcdf-4 (enhanced)). The numeric codes are:
- 3 => netcdf classic format
- 6 => netCDF 64-bit format
- 4 => netCDF-4 format (enhanced data model)
- 7 => netCDF-4 classic model format
The numeric code "7" is used because "7=3+4",
specifying the format that uses the netCDF-3 data model for compatibility
with the netCDF-4 storage format for performance. Credit is due to NCO for
use of these numeric codes instead of the old and confusing format
numbers.
- -d n
- For netCDF-4 output, including netCDF-4 classic model, specify deflation
level (level of compression) for variable data output. 0 corresponds to no
compression and 9 to maximum compression, with higher levels of
compression requiring marginally more time to compress or uncompress than
lower levels. Compression achieved may also depend on output chunking
parameters. If this option is specified for a classic format or 64-bit
offset format input file, it is not necessary to also specify that the
output should be netCDF-4 classic model, as that will be the default. If
this option is not specified and the input file has compressed variables,
the compression will still be preserved in the output, using the same
chunking as in the input by default.
- Note that nccopy requires all variables to be compressed using the
same compression level, but the API has no such restriction. With a
program you can customize compression for each variable
independently.
- -s
- For netCDF-4 output, including netCDF-4 classic model, specify shuffling
of variable data bytes before compression or after decompression.
Shuffling refers to interlacing of bytes in a chunk so that the first
bytes of all values are contiguous in storage, followed by all the second
bytes, and so on, which often improves compression. This option is ignored
unless a non-zero deflation level is specified. Using -d0 to specify no
deflation on input data that has been compressed and shuffled turns off
both compression and shuffling in the output.
- -u
- Convert any unlimited size dimensions in the input to fixed size
dimensions in the output. This can speed up variable-at-a-time access, but
slow down record-at-a-time access to multiple variables along an unlimited
dimension.
- -w
- Keep output in memory (as a diskless netCDF file) until output is closed,
at which time output file is written to disk. This can greatly speedup
operations such as converting unlimited dimension to fixed size (-u
option), chunking, rechunking, or compressing the input. It requires that
available memory is large enough to hold the output file. This option may
provide a larger speedup than careful tuning of the -m, -h, or -e options,
and it's certainly a lot simpler.
- -c
chunkspec
- For netCDF-4 output, including netCDF-4 classic model, specify chunking
(multidimensional tiling) for variable data in the output. This is useful
to specify the units of disk access, compression, or other filters such as
checksums. Changing the chunking in a netCDF file can also greatly speedup
access, by choosing chunk shapes that are appropriate for the most common
access patterns.
- The chunkspec argument has two forms. The first form is the
original, deprecated form and is a string of comma-separated associations,
each specifying a dimension name, a '/' character, and optionally the
corresponding chunk length for that dimension. No blanks should appear in
the chunkspec string, except possibly escaped blanks that are part of a
dimension name. A chunkspec names at least one dimension, and may omit
dimensions which are not to be chunked or for which the default chunk
length is desired. If a dimension name is followed by a '/' character but
no subsequent chunk length, the actual dimension length is assumed. If
copying a classic model file to a netCDF-4 output file and not naming all
dimensions in the chunkspec, unnamed dimensions will also use the actual
dimension length for the chunk length. An example of a chunkspec for
variables that use 'm' and 'n' dimensions might be 'm/100,n/200' to
specify 100 by 200 chunks. To see the chunking resulting from copying with
a chunkspec, use the '-s' option of ncdump on the output file.
- The chunkspec '/' that omits all dimension names and corresponding chunk
lengths specifies that no chunking is to occur in the output, so can be
used to unchunk all the chunked variables. To see the chunking resulting
from copying with a chunkspec, use the '-s' option of ncdump on the output
file.
- As an I/O optimization, nccopy has a threshold for the minimum size
of non-record variables that get chunked, currently 8192 bytes. The -M
flag can be used to override this value.
- Note that nccopy requires variables that share a dimension to also
share the chunk size associated with that dimension, but the programming
interface has no such restriction. If you need to customize chunking for
variables independently, you will need to use the second form of
chunkspec. This second form of chunkspec has this syntax:
var:n1,n2,...,nn . This assumes that the variable named
"var" has rank n. The chunking to be applied to each dimension
of the variable is specified by the values of n1 through nn. This second
form of chunking specification can be repeated multiple times to specify
the exact chunking for different variables. If the variable is specified
but no chunk sizes are specified (i.e. -c var: ) then chunking is
disabled for that variable. If the same variable is specified more than
once, the second and later specifications are ignored. Also, this second
form, per-variable chunking, takes precedence over any per-dimension
chunking except the bare "/" case.
- -v var1,...
- The output will include data values for the specified variables, in
addition to the declarations of all dimensions, variables, and attributes.
One or more variables must be specified by name in the comma-delimited
list following this option. The list must be a single argument to the
command, hence cannot contain unescaped blanks or other white space
characters. The named variables must be valid netCDF variables in the
input-file. A variable within a group in a netCDF-4 file may be specified
with an absolute path name, such as "/GroupA/GroupA2/var". Use
of a relative path name such as 'var' or "grp/var" specifies all
matching variable names in the file. The default, without this option, is
to include data values for all variables in the output.
- -V var1,...
- The output will include the specified variables only but all dimensions
and global or group attributes. One or more variables must be specified by
name in the comma-delimited list following this option. The list must be a
single argument to the command, hence cannot contain unescaped blanks or
other white space characters. The named variables must be valid netCDF
variables in the input-file. A variable within a group in a netCDF-4 file
may be specified with an absolute path name, such as
'/GroupA/GroupA2/var'. Use of a relative path name such as 'var' or
'grp/var' specifies all matching variable names in the file. The default,
without this option, is to include all variables in the
output.
- -g grp1,...
- The output will include data values only for the specified groups. One or
more groups must be specified by name in the comma-delimited list
following this option. The list must be a single argument to the command.
The named groups must be valid netCDF groups in the input-file. The
default, without this option, is to include data values for all groups in
the output.
- -G grp1,...
- The output will include only the specified groups. One or more groups must
be specified by name in the comma-delimited list following this option.
The list must be a single argument to the command. The named groups must
be valid netCDF groups in the input-file. The default, without this
option, is to include all groups in the output.
- -m bufsize
- An integer or floating-point number that specifies the size, in bytes, of
the copy buffer used to copy large variables. A suffix of K, M, G, or T
multiplies the copy buffer size by one thousand, million, billion, or
trillion, respectively. The default is 5 Mbytes, but will be increased if
necessary to hold at least one chunk of netCDF-4 chunked variables in the
input file. You may want to specify a value larger than the default for
copying large files over high latency networks. Using the '-w' option may
provide better performance, if the output fits in memory.
- -h chunk_cache
- For netCDF-4 output, including netCDF-4 classic model, an integer or
floating-point number that specifies the size in bytes of chunk cache
allocated for each chunked variable. This is not a property of the file,
but merely a performance tuning parameter for avoiding compressing or
decompressing the same data multiple times while copying and changing
chunk shapes. A suffix of K, M, G, or T multiplies the chunk cache size by
one thousand, million, billion, or trillion, respectively. The default is
4.194304 Mbytes (or whatever was specified for the configure-time constant
CHUNK_CACHE_SIZE when the netCDF library was built). Ideally, the
nccopy utility should accept only one memory buffer size and divide
it optimally between a copy buffer and chunk cache, but no general
algorithm for computing the optimum chunk cache size has been implemented
yet. Using the '-w' option may provide better performance, if the output
fits in memory.
- -e cache_elems
- For netCDF-4 output, including netCDF-4 classic model, specifies number of
chunks that the chunk cache can hold. A suffix of K, M, G, or T multiplies
the number of chunks that can be held in the cache by one thousand,
million, billion, or trillion, respectively. This is not a property of the
file, but merely a performance tuning parameter for avoiding compressing
or decompressing the same data multiple times while copying and changing
chunk shapes. The default is 1009 (or whatever was specified for the
configure-time constant CHUNK_CACHE_NELEMS when the netCDF library was
built). Ideally, the nccopy utility should determine an optimum
value for this parameter, but no general algorithm for computing the
optimum number of chunk cache elements has been implemented yet.
- -r
- Read netCDF classic or 64-bit offset input file into a diskless netCDF
file in memory before copying. Requires that input file be small enough to
fit into memory. For nccopy, this doesn't seem to provide any
significant speedup, so may not be a useful option.
- -L n
- Set the log level; only usable if nccopy supports netCDF-4
(enhanced).
- -M n
- Set the minimum chunk size; only usable if nccopy supports netCDF-4
(enhanced).
- -F
filterspec
- For netCDF-4 output, including netCDF-4 classic model, specify a filter to
apply to an specified variable in the output. As a rule, the filter is a
compression/decompression algorithm with a unique numeric identifier
assigned by the HDF Group (see
https://support.hdfgroup.org/services/filters.html).
- The filterspec argument has this general form.
fqn,filterid,param1,param2...paramn
The fqn (fully qualified name) is the name of a variable prefixed by its
containing groups with the group names separated by forward slash ('/'). An
example might be /g1/g2/var. Alternatively, just the variable name
can be given if it is in the root group: e.g. var. Backslash
escapes may be used as needed. The filterid is an unsigned positive integer
representing the id assigned by the HDFgroup to the filter. Following the id
is a sequence of parameters defining the operation of the filter. Each
parameter is a 32-bit unsigned integer.
- This parameter may be repeated multiple times with different variable
names.
Make a copy of foo1.nc, a netCDF file of any type, to foo2.nc, a
netCDF file of the same type:
Note that the above copy will not be as fast as use of cp or other
simple copy utility, because the file is copied using only the netCDF API.
If the input file has extra bytes after the end of the netCDF data, those
will not be copied, because they are not accessible through the netCDF
interface. If the original file was generated in "No fill" mode so
that fill values are not stored for padding for data alignment, the output
file may have different padding bytes.
Convert a netCDF-4 classic model file, compressed.nc, that uses
compression, to a netCDF-3 file classic.nc:
nccopy -k classic compressed.nc classic.nc
Note that 'nc3' could be used instead of 'classic'.
Download the variable 'time_bnds' and its associated attributes
from an OPeNDAP server and copy the result to a netCDF file named
'tb.nc':
nccopy
'http://test.opendap.org/opendap/data/nc/sst.mnmean.nc.gz?time_bnds'
tb.nc
Note that URLs that name specific variables as command-line
arguments should generally be quoted, to avoid the shell interpreting
special characters such as '?'.
Compress all the variables in the input file foo.nc, a netCDF file
of any type, to the output file bar.nc:
If foo.nc was a classic or 64-bit offset netCDF file, bar.nc will
be a netCDF-4 classic model netCDF file, because the classic and 64-bit
offset format variants don't support compression. If foo.nc was a netCDF-4
file with some variables compressed using various deflation levels, the
output will also be a netCDF-4 file of the same type, but all the variables,
including any uncompressed variables in the input, will now use deflation
level 1.
Assume the input data includes gridded variables that use time,
lat, lon dimensions, with 1000 times by 1000 latitudes by 1000 longitudes,
and that the time dimension varies most slowly. Also assume that users want
quick access to data at all times for a small set of lat-lon points.
Accessing data for 1000 times would typically require accessing 1000 disk
blocks, which may be slow.
Reorganizing the data into chunks on disk that have all the time
in each chunk for a few lat and lon coordinates would greatly speed up such
access. To chunk the data in the input file slow.nc, a netCDF file of any
type, to the output file fast.nc, you could use;
nccopy -c time/1000,lat/40,lon/40 slow.nc fast.nc
to specify data chunks of 1000 times, 40 latitudes, and 40
longitudes. If you had enough memory to contain the output file, you could
speed up the rechunking operation significantly by creating the output in
memory before writing it to disk on close (using the -w flag):
nccopy -w -c time/1000,lat/40,lon/40 slow.nc fast.nc
Alternatively, one could write this using the alternate, variable-specific
chunking specification and assuming that times, lat, and lon are variables.
nccopy -c time:1000 -c lat:40 -c lon:40 slow.nc fast.nc
The complete set of chunking rules is captured here. As a rough
summary, these rules preserve all chunking properties from the input file.
These rules apply only when the selected output format supports chunking,
i.e. for the netcdf-4 variants.
The variable specific chunking specification should be obvious and
translates directly to the corresponding "nc_def_var_chunking" API
call.
The original per-dimension, chunking specification requires some
interpretation by nccopy. The following rules are applied in the given order
independently for each variable to be copied from input to output. The rules
are written assuming we are trying to determine the chunking for a given
output variable Vout that comes from an input variable Vin.
- 1.
- For each dimension of Vout explicitly specified on the command line (using
the '-c' option), apply the chunking value for that dimension regardless
of input format or input properties.
- 2.
- For dimensions of Vout not named on the command line, preserve chunk sizes
from the corresponding input variable, if it is chunked.
- 3.
- If Vin is contiguous, and none of its dimensions are named on the command
line, and chunking is not mandated by other options, then make Vout be
contiguous.
- 4.
- If the input variable is contiguous (or is some netcdf-3 variant) and
there are no options requiring chunking, or the '/' special case for the
'-c' option is specified, then the output variable V is marked as
contiguous.
- 5.
- Final, default case: some or all chunk sizes are not determined by the
command line or the input variable. This includes the non-chunked input
cases such as netcdf-3, cdf5, and DAP. In these cases retain all chunk
sizes determined by previous rules, and use the full dimension size as the
default. The exception is unlimited dimensions, where the default is 4
megabytes.