Nextflow scripting
The Nextflow scripting language is an extension of the Groovy programming language. Groovy is a powerful programming language for the Java virtual machine. The Nextflow syntax has been specialized to ease the writing of computational pipelines in a declarative manner.
Nextflow can execute any piece of Groovy code or use any library for the JVM platform.
For a detailed description of the Groovy programming language, reference these links:
Below you can find a crash course in the most important language constructs used in the Nextflow scripting language.
Warning
Nextflow uses UTF-8 as the default character encoding for source files. Make sure to use UTF-8 encoding when editing Nextflow scripts with your preferred text editor.
Language basics
Hello world
To print something is as easy as using one of the print
or println
methods.
println "Hello, World!"
The only difference between the two is that the println
method implicitly appends a new line character
to the printed string.
Variables
To define a variable, simply assign a value to it:
x = 1
println x
x = new java.util.Date()
println x
x = -3.1499392
println x
x = false
println x
x = "Hi"
println x
Lists
A List object can be defined by placing the list items in square brackets:
myList = [1776, -1, 33, 99, 0, 928734928763]
You can access a given item in the list with square-bracket notation (indexes start at 0):
println myList[0]
In order to get the length of the list use the size
method:
println myList.size()
Learn more about lists:
Maps
Maps are used to store associative arrays or dictionaries. They are unordered collections of heterogeneous, named data:
scores = [ "Brett":100, "Pete":"Did not finish", "Andrew":86.87934 ]
Note that each of the values stored in the map can be of a different type. Brett
is an integer, Pete
is a string,
and Andrew
is a floating-point number.
We can access the values in a map in two main ways:
println scores["Pete"]
println scores.Pete
To add data to or modify a map, the syntax is similar to adding values to list:
scores["Pete"] = 3
scores["Cedric"] = 120
Learn more about maps:
Multiple assignment
An array or a list object can used to assign to multiple variables at once:
(a, b, c) = [10, 20, 'foo']
assert a == 10 && b == 20 && c == 'foo'
The three variables on the left of the assignment operator are initialized by the corresponding item in the list.
Read more about Multiple assignment in the Groovy documentation.
Conditional Execution
One of the most important features of any programming language is the ability to execute different code under
different conditions. The simplest way to do this is to use the if
construct:
x = Math.random()
if( x < 0.5 ) {
println "You lost."
}
else {
println "You won!"
}
Strings
Strings can be defined by enclosing text in single or double quotes ('
or "
characters):
println "he said 'cheese' once"
println 'he said "cheese!" again'
Strings can be concatenated with +
:
a = "world"
print "hello " + a + "\n"
String interpolation
There is an important difference between single-quoted and double-quoted strings: Double-quoted strings support variable interpolations, while single-quoted strings do not.
In practice, double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $
character,
or the value of any expression by using the ${expression}
syntax, similar to Bash/shell scripts:
foxtype = 'quick'
foxcolor = ['b', 'r', 'o', 'w', 'n']
println "The $foxtype ${foxcolor.join()} fox"
x = 'Hello'
println '$x + $y'
This code prints:
The quick brown fox
$x + $y
Multi-line strings
A block of text that span multiple lines can be defined by delimiting it with triple single or double quotes:
text = """
hello there James
how are you today?
"""
Note
Like before, multi-line strings inside double quotes support variable interpolation, while single-quoted multi-line strings do not.
As in Bash/shell scripts, terminating a line in a multi-line string with a \
character prevents a
a new line character from separating that line from the one that follows:
myLongCmdline = """
blastp \
-in $input_query \
-out $output_file \
-db $blast_database \
-html
"""
result = myLongCmdline.execute().text
In the preceding example, blastp
and its -in
, -out
, -db
and -html
switches and
their arguments are effectively a single line.
Implicit variables
Script implicit variables
The following variables are implicitly defined in the script global execution scope:
Name |
Description |
---|---|
|
The directory where the main workflow script is located (deprecated in favour of |
|
The directory where the workflow is run (requires version |
|
The directory where a module script is located for DSL2 modules or the same as |
|
Dictionary like object representing nextflow runtime information (see Nextflow metadata). |
|
Dictionary like object holding workflow parameters specifing in the config file or as command line options. |
|
The directory where the main script is located (requires version |
|
The directory where tasks temporary files are created. |
|
Dictionary like object representing workflow runtime information (see Runtime metadata). |
Configuration implicit variables
The following variables are implicitly defined in the Nextflow configuration file:
Name |
Description |
---|---|
|
The directory where the main workflow script is located (deprecated in favour of |
|
The directory where the workflow is run (requires version |
|
The directory where the main script is located (requires version |
Process implicit variables
The following variables are implicitly defined in the task
object of each process:
Name |
Description |
---|---|
|
The current task attempt |
|
The task unique hash Id. NOTE: This is only available for processes that run native code via |
|
The task index (corresponds to |
|
The current task name. NOTE: This is only available for processes that run native code via |
|
The current process name |
|
The task unique directory. NOTE: This is only available for processes that run native code via |
The task
object also contains the values of all process directives for the given task,
which allows you to access these settings at runtime. For examples:
process foo {
script:
"""
some_tool --cpus $task.cpus --mem $task.memory
"""
}
In the above snippet the task.cpus
report the value for the cpus directive and
the task.memory
the current value for memory directive depending on the actual
setting given in the workflow configuration file.
See Process directives for details.
Closures
Briefly, a closure is a block of code that can be passed as an argument to a function. Thus, you can define a chunk of code and then pass it around as if it were a string or an integer.
More formally, you can create functions that are defined as first class objects.
square = { it * it }
The curly brackets around the expression it * it
tells the script interpreter to treat this expression as code.
The it identifier is an implicit variable that represents the value that is passed to the function when it is invoked.
Once compiled the function object is assigned to the variable square
as any other variable assignments shown previously.
Now we can do something like this:
println square(9)
and get the value 81.
This is not very interesting until we find that we can pass the function square
as an argument to other functions or methods.
Some built-in functions take a function like this as an argument. One example is the collect
method on lists:
[ 1, 2, 3, 4 ].collect(square)
This expression says: Create an array with the values 1, 2, 3 and 4, then call its collect
method, passing in the
closure we defined above. The collect
method runs through each item in the array, calls the closure on the item,
then puts the result in a new array, resulting in:
[ 1, 4, 9, 16 ]
For more methods that you can call with closures as arguments, see the Groovy GDK documentation.
By default, closures take a single parameter called it
, but you can also create closures with multiple, custom-named parameters.
For example, the method Map.each()
can take a closure with two arguments, to which it binds the key and the associated value
for each key-value pair in the Map
. Here, we use the obvious variable names key
and value
in our closure:
printMapClosure = { key, value ->
println "$key = $value"
}
[ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ].each(printMapClosure)
Prints:
Yue = Wu
Mark = Williams
Sudha = Kumari
A closure has two other important features. First, it can access variables in the scope where it is defined, so that it can interact with them.
Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the place where it needs to be used.
As an example showing both these features, see the following code fragment:
myMap = ["China": 1 , "India" : 2, "USA" : 3]
result = 0
myMap.keySet().each( { result+= myMap[it] } )
println result
Learn more about closures in the Groovy documentation
Regular expressions
Regular expressions are the Swiss Army knife of text processing. They provide the programmer with the ability to match and extract patterns from strings.
Regular expressions are available via the ~/pattern/
syntax and the =~
and ==~
operators.
Use =~
to check whether a given pattern occurs anywhere in a string:
assert 'foo' =~ /foo/ // return TRUE
assert 'foobar' =~ /foo/ // return TRUE
Use ==~
to check whether a string matches a given regular expression pattern exactly.
assert 'foo' ==~ /foo/ // return TRUE
assert 'foobar' ==~ /foo/ // return FALSE
It is worth noting that the ~
operator creates a Java Pattern
object from the given string,
while the =~
operator creates a Java Matcher
object.
x = ~/abc/
println x.class
// prints java.util.regex.Pattern
y = 'some string' =~ /abc/
println y.class
// prints java.util.regex.Matcher
Regular expression support is imported from Java. Java’s regular expression language and API is documented in the Pattern Java documentation.
You may also be interested in this post: Groovy: Don’t Fear the RegExp.
String replacement
To replace pattern occurrences in a given string, use the replaceFirst
and replaceAll
methods:
x = "colour".replaceFirst(/ou/, "o")
println x
// prints: color
y = "cheesecheese".replaceAll(/cheese/, "nice")
println y
// prints: nicenice
Capturing groups
You can match a pattern that includes groups. First create a matcher object with the =~
operator.
Then, you can index the matcher object to find the matches: matcher[0]
returns a list representing the first match
of the regular expression in the string. The first list element is the string that matches the entire regular expression, and
the remaining elements are the strings that match each group.
Here’s how it works:
programVersion = '2.7.3-beta'
m = programVersion =~ /(\d+)\.(\d+)\.(\d+)-?(.+)/
assert m[0] == ['2.7.3-beta', '2', '7', '3', 'beta']
assert m[0][1] == '2'
assert m[0][2] == '7'
assert m[0][3] == '3'
assert m[0][4] == 'beta'
Applying some syntactic sugar, you can do the same in just one line of code:
programVersion = '2.7.3-beta'
(full, major, minor, patch, flavor) = (programVersion =~ /(\d+)\.(\d+)\.(\d+)-?(.+)/)[0]
println full // 2.7.3-beta
println major // 2
println minor // 7
println patch // 3
println flavor // beta
Removing part of a string
You can remove part of a String
value using a regular expression pattern. The first match found is
replaced with an empty String:
// define the regexp pattern
wordStartsWithGr = ~/(?i)\s+Gr\w+/
// apply and verify the result
('Hello Groovy world!' - wordStartsWithGr) == 'Hello world!'
('Hi Grails users' - wordStartsWithGr) == 'Hi users'
Remove the first 5-character word from a string:
assert ('Remove first match of 5 letter word' - ~/\b\w{5}\b/) == 'Remove match of 5 letter word'
Remove the first number with its trailing whitespace from a string:
assert ('Line contains 20 characters' - ~/\d+\s+/) == 'Line contains characters'
Files and I/O
Opening files
To access and work with files, use the file
method, which returns a file system object
given a file path string:
myFile = file('some/path/to/my_file.file')
The file
method can reference either files or directories, depending on what the string path refers to in the
file system.
When using the wildcard characters *
, ?
, []
and {}
, the argument is interpreted as a glob path matcher
and the file
method returns a list object holding the paths of files whose names match the specified pattern, or an
empty list if no match is found:
listOfFiles = file('some/path/*.fa')
Note
Two asterisks (**
) in a glob pattern works like *
but also searches through subdirectories.
By default, wildcard characters do not match directories or hidden files. For example, if you want to include hidden
files in the result list, add the optional parameter hidden
:
listWithHidden = file('some/path/*.fa', hidden: true)
Here are file
’s available options:
Name |
Description |
---|---|
glob |
When |
type |
Type of paths returned, either |
hidden |
When |
maxDepth |
Maximum number of directory levels to visit (default: no limit) |
followLinks |
When |
checkIfExists |
When |
Note
Nextflow also provides a files()
method, which is identical to file()
except that it always
returns a list, whereas file()
only returns a list if it matches multiple files.
Tip
If you are a Java geek, you might be interested to know that the file
method returns a
Path object, which allows
you to use the same methods you would use in a Java program.
See also: Channel.fromPath.
Basic read/write
Given a file variable, declared using the file
method as shown in the previous example, reading a file
is as easy as getting the value of the file’s text
property, which returns the file content
as a string value:
print myFile.text
Similarly, you can save a string value to a file by simply assigning it to the file’s text
property:
myFile.text = 'Hello world!'
Note
The above assignment overwrites any existing file contents, and implicitly creates the file if it doesn’t exist.
In order to append a string value to a file without erasing existing content, you can use the append
method:
myFile.append('Add this line\n')
Or use the left shift operator, a more idiomatic way to append text content to a file:
myFile << 'Add a line more\n'
Binary data can managed in the same way, just using the file property bytes
instead of text
. Thus, the following
example reads the file and returns its content as a byte array:
binaryContent = myFile.bytes
Or you can save a byte array data buffer to a file, by simply writing:
myFile.bytes = binaryBuffer
Warning
The above methods read and write the entire file contents at once, in a single variable or buffer. For this reason, when dealing with large files it is recommended that you use a more memory efficient approach, such as reading/writing a file line by line or using a fixed size buffer.
Read a file line by line
In order to read a text file line by line you can use the method readLines()
provided by the file object, which
returns the file content as a list of strings:
myFile = file('some/my_file.txt')
allLines = myFile.readLines()
for( line : allLines ) {
println line
}
This can also be written in a more idiomatic syntax:
file('some/my_file.txt')
.readLines()
.each { println it }
Warning
The method readLines()
reads the entire file at once and returns a list containing all the lines. For
this reason, do not use it to read big files.
To process a big file, use the method eachLine
, which reads only a single line at a time into memory:
count = 0
myFile.eachLine { str ->
println "line ${count++}: $str"
}
Advanced file reading operations
The classes Reader
and InputStream
provide fine control for reading text and binary files, respectively._
The method newReader
creates a Reader object
for the given file that allows you to read the content as single characters, lines or arrays of characters:
myReader = myFile.newReader()
String line
while( line = myReader.readLine() ) {
println line
}
myReader.close()
The method withReader
works similarly, but automatically calls the close
method for you when you have finished
processing the file. So, the previous example can be written more simply as:
myFile.withReader {
String line
while( line = it.readLine() ) {
println line
}
}
The methods newInputStream
and withInputStream
work similarly. The main difference is that they create an
InputStream object useful for writing binary
data.
Here are the most important methods for reading from files:
Name |
Description |
---|---|
getText |
Returns the file content as a string value |
getBytes |
Returns the file content as byte array |
readLines |
Reads the file line by line and returns the content as a list of strings |
eachLine |
Iterates over the file line by line, applying the specified closure |
eachByte |
Iterates over the file byte by byte, applying the specified closure |
withReader |
Opens a file for reading and lets you access it with a Reader object |
withInputStream |
Opens a file for reading and lets you access it with an InputStream object |
newReader |
Returns a Reader object to read a text file |
newInputStream |
Returns an InputStream object to read a binary file |
Read the Java documentation for Reader and InputStream classes to learn more about methods available for reading data from files.
Advanced file writing operations
The Writer
and OutputStream
classes provide fine control for writing text and binary files,
respectively, including low-level operations for single characters or bytes, and support for big files.
For example, given two file objects sourceFile
and targetFile
, the following code copies the
first file’s content into the second file, replacing all U
characters with X
:
sourceFile.withReader { source ->
targetFile.withWriter { target ->
String line
while( line=source.readLine() ) {
target << line.replaceAll('U','X')
}
}
}
Here are the most important methods for writing to files:
Name |
Description |
---|---|
setText |
Writes a string value to a file |
setBytes |
Writes a byte array to a file |
write |
Writes a string to a file, replacing any existing content |
append |
Appends a string value to a file without replacing existing content |
newWriter |
Creates a Writer object that allows you to save text data to a file |
newPrintWriter |
Creates a PrintWriter object that allows you to write formatted text to a file |
newOutputStream |
Creates an OutputStream object that allows you to write binary data to a file |
withWriter |
Applies the specified closure to a Writer object, closing it when finished |
withPrintWriter |
Applies the specified closure to a PrintWriter object, closing it when finished |
withOutputStream |
Applies the specified closure to an OutputStream object, closing it when finished |
Read the Java documentation for the Writer, PrintWriter and OutputStream classes to learn more about methods available for writing data to files.
List directory content
Let’s assume that you need to walk through a directory of your choice. You can define the myDir
variable
that points to it:
myDir = file('any/path')
The simplest way to get a directory list is by using the methods list
or listFiles
,
which return a collection of first-level elements (files and directories) of a directory:
allFiles = myDir.list()
for( def file : allFiles ) {
println file
}
Note
The only difference between list
and listFiles
is that the former returns a list of strings, and the latter
returns a list of file objects that allow you to access file metadata (size, last modified time, etc).
The eachFile
method allows you to iterate through the first-level elements only
(just like listFiles
). As with other each- methods, eachFiles
takes a closure as a parameter:
myDir.eachFile { item ->
if( item.isFile() ) {
println "${item.getName()} - size: ${item.size()}"
}
else if( item.isDirectory() ) {
println "${item.getName()} - DIR"
}
}
Several variants of the above method are available. See the table below for a complete list.
Name |
Description |
---|---|
eachFile |
Iterates through first-level elements (files and directories). Read more |
eachDir |
Iterates through first-level directories only. Read more |
eachFileMatch |
Iterates through files and dirs whose names match the given filter. Read more |
eachDirMatch |
Iterates through directories whose names match the given filter. Read more |
eachFileRecurse |
Iterates through directory elements depth-first. Read more |
eachDirRecurse |
Iterates through directories depth-first (regular files are ignored). Read more |
See also: Channel fromPath method.
Create directories
Given a file variable representing a nonexistent directory, like the following:
myDir = file('any/path')
the method mkdir
creates a directory at the given path, returning true
if the directory is created
successfully, and false
otherwise:
result = myDir.mkdir()
println result ? "OK" : "Cannot create directory: $myDir"
Note
If the parent directories do not exist, the above method will fail and return false
.
The method mkdirs
creates the directory named by the file object, including any nonexistent parent directories:
myDir.mkdirs()
Create links
Given a file, the method mklink
creates a file system link for that file using the path specified as a parameter:
myFile = file('/some/path/file.txt')
myFile.mklink('/user/name/link-to-file.txt')
Table of optional parameters:
Name |
Description |
---|---|
hard |
When |
overwrite |
When |
Copy files
The method copyTo
copies a file into a new file or into a directory, or copies a directory to a new
directory:
myFile.copyTo('new_name.txt')
Note
If the target file already exists, it will be replaced by the new one. Note also that, if the target is a directory, the source file will be copied into that directory, maintaining the file’s original name.
When the source file is a directory, all its content is copied to the target directory:
myDir = file('/some/path')
myDir.copyTo('/some/new/path')
If the target path does not exist, it will be created automatically.
Note
The copyTo
method mimics the semantics of the Linux command cp -r <source> <target>
, with the
following caveat: while Linux tools often treat paths ending with a slash (e.g. /some/path/name/
)
as directories, and those not (e.g. /some/path/name
) as regular files, Nextflow (due to its use of
the Java files API) views both these paths as the same file system object. If the path exists, it is
handled according to its actual type (i.e. as a regular file or as a directory). If the path does not
exist, it is treated as a regular file, with any missing parent directories created automatically.
Move files
You can move a file by using the method moveTo
:
myFile = file('/some/path/file.txt')
myFile.moveTo('/another/path/new_file.txt')
Note
When a file with the same name as the target already exists, it will be replaced by the source. Note also that, when the target is a directory, the file will be moved to (or within) that directory, maintaining the file’s original name.
When the source is a directory, all the directory content is moved to the target directory:
myDir = file('/any/dir_a')
myDir.moveTo('/any/dir_b')
Please note that the result of the above example depends on the existence of the target directory. If the target directory exists, the source is moved into the target directory, resulting in the path:
/any/dir_b/dir_a
If the target directory does not exist, the source is just renamed to the target name, resulting in the path:
/any/dir_b
Note
The moveTo
method mimics the semantics of the Linux command mv <source> <target>
, with the
same caveat as that given above for copyTo
.
Rename files
You can rename a file or directory by simply using the renameTo
file method:
myFile = file('my_file.txt')
myFile.renameTo('new_file_name.txt')
Delete files
The file method delete
deletes the file or directory at the given path, returning true
if the
operation succeeds, and false
otherwise:
myFile = file('some/file.txt')
result = myFile.delete()
println result ? "OK" : "Cannot delete: $myFile"
Note
This method deletes a directory only if it does not contain any files or sub-directories. To
delete a directory and all its contents (i.e. removing all the files and sub-directories it may
contain), use the method deleteDir
.
Check file attributes
The following methods can be used on a file variable created by using the file
method:
Name |
Description |
---|---|
getName |
Gets the file name e.g. |
getBaseName |
Gets the file name without its extension e.g. |
getSimpleName |
Gets the file name without any extension e.g. |
getExtension |
Gets the file extension e.g. |
getParent |
Gets the file parent path e.g. |
size |
Gets the file size in bytes |
exists |
Returns |
isEmpty |
Returns |
isFile |
Returns |
isDirectory |
Returns |
isHidden |
Returns |
lastModified |
Returns the file last modified timestamp i.e. a long as Linux epoch time |
For example, the following line prints a file name and size:
println "File ${myFile.getName() size: ${myFile.size()}"
Tip
The invocation of any method name starting with the get
prefix can be shortcut by
omitting the get
prefix and ()
parentheses. Therefore, writing myFile.getName()
is exactly the same as myFile.name
and myFile.getBaseName()
is the same as myFile.baseName
and so on.
Get and modify file permissions
Given a file variable representing a file (or directory), the method getPermissions
returns a
9-character string representing the file’s permissions using the
Linux symbolic notation
e.g. rw-rw-r--
:
permissions = myFile.getPermissions()
Similarly, the method setPermissions
sets the file’s permissions using the same notation:
myFile.setPermissions('rwxr-xr-x')
A second version of the setPermissions
method sets a file’s permissions given three digits representing,
respectively, the owner, group and other permissions:
myFile.setPermissions(7,5,5)
Learn more about File permissions numeric notation.
HTTP/FTP files
Nextflow provides transparent integration of HTTP/S and FTP protocols for handling remote resources
as local file system objects. Simply specify the resource URL as the argument of the file
object:
pdb = file('http://files.rcsb.org/header/5FID.pdb')
Then, you can access it as a local file as described in the previous sections:
println pdb.text
The above one-liner prints the content of the remote PDB file. Previous sections provide code examples showing how to stream or copy the content of files.
Note
Write and list operations are not supported for HTTP/S and FTP files.
Counting records
countLines
The countLines
methods counts the lines in a text files.
def sample = file('/data/sample.txt')
println sample.countLines()
Files whose name ends with the .gz
suffix are expected to be GZIP compressed and
automatically uncompressed.
countFasta
The countFasta
method counts the number of records in FASTA
formatted file.
def sample = file('/data/sample.fasta')
println sample.countFasta()
Files whose name ends with the .gz
suffix are expected to be GZIP compressed and
automatically uncompressed.
countFastq
The countFastq
method counts the number of records in a FASTQ
formatted file.
def sample = file('/data/sample.fastq')
println sample.countFastq()
Files whose name ends with the .gz
suffix are expected to be GZIP compressed and
automatically uncompressed.