Chemistry::File - Molecule file I/O base class
# As a convenient interface for several mol readers:
use Chemistry::File qw(PDB MDLMol); # load PDB and MDL modules
# or try to use every file I/O module installed in the system:
use Chemistry::File ':auto';
my $mol1 = Chemistry::Mol->read("file.pdb");
my $mol2 = Chemistry::Mol->read("file.mol");
# as a base for a mol reader:
package Chemistry::File::Myfile;
use base qw(Chemistry::File);
use Chemistry::Mol;
Chemistry::Mol->register_format("myfile", __PACKAGE__);
# override the read_mol method
sub read_mol {
my ($self, $fh, %opts) = shift;
my $mol_class = $opts{mol_class} || "Chemistry::Mol";
my $mol = $mol_class->new;
# ... do some stuff with $fh and $mol ...
return $mol;
}
# override the write_mol method
sub write_mol {
my ($self, $fh, $mol, %opts) = shift;
print $fh $mol->name, "\n";
# ... do some stuff with $fh and $mol ...
}
The main use of this module is as a base class for other molecule
file I/O modules (for example, Chemistry::File::PDB). Such modules should
override and extend the Chemistry::File methods as needed. You only need to
care about the methods here if if you are writing a file I/O module or if
you want a finer degree of control than what is offered by the simple read
and write methods in the Chemistry::Mol class.
From the user's point of view, this module can also be used as
shorthand for using several Chemistry::File modules at the same time.
use Chemistry::File qw(PDB MDLMol);
is exactly equivalent to
use Chemistry::File::PDB;
use Chemistry::File::MDLMol;
If you use the :auto keyword, Chemistry::File will autodetect and
load all the Chemistry::File::* modules installed in your system.
use Chemistry::File ':auto';
Before version 0.30, file I/O modules typically used only
parse_string, write_string, parse_file, and write_file, and they were
generally used as class methods. A file could contain one or more molecules
and only be read or written whole; reading it would return every molecule on
the file. This was problematic when dealing with large multi-molecule files
(such as SDF files), because all the molecules would have to be loaded into
memory at the same time.
While version 0.30 retains backward compatibility with that simple
model, it also allows a more flexible interface that allows reading one
molecule at a time, skipping molecules, and reading and writing file-level
information that is not associated with specific molecules. The following
diagram shows the global structure of a file according to the new model:
+-----------+
| header |
+-----------+
| molecule |
+-----------+
| molecule |
+-----------+
| ... |
+-----------+
| footer |
+-----------+
In cases where the header and the footer are empty, the model
reduces to the pre-0.30 version. The low-level steps to read a file are the
following:
$file = Chemistry::File::MyFormat->new(file => 'xyz.mol');
$file->open('<');
$file->read_header;
while (my $mol = $self->read_mol($file->fh, %opts)) {
# do something with $mol...
}
$self->read_footer;
The "read" method does all the
above automatically, and it stores all the molecules read in the mols
property.
All the methods below include a list of options
%opts at the end of the parameter list. Each class
implementing this interface may have its own particular options. However,
the following options should be recognized by all classes:
- mol_class
- A class or object with a "new" method
that constructs a molecule. This is needed when the user want to specify a
molecule subclass different from the default. When this option is not
defined, the module may use Chemistry::Mol or whichever class is
appropriate for that file format.
- format
- The name of the file format being used, as registered by
Chemistry::Mol->register_format.
- fatal
- If true, parsing errors should throw an exception; if false, they should
just try to recover if possible. True by default.
The class methods in this class (or rather, its derived classes)
are usually not called directly. Instead, use Chemistry::Mol->read,
write, print, parse, and file. These methods also work if called as instance
methods.
- $class->parse_string($s, %options)
- Parse a string $s and return one or more molecule
objects. This is an abstract method, so it should be provided by all
derived classes.
- $class->write_string($mol, %options)
- Convert a molecule to a string. This is an abstract method, so it should
be provided by all derived classes.
- $class->parse_file($file, %options)
- Reads the file $file and returns one or more
molecules. The default method slurps the whole file and then calls
parse_string, but derived classes may choose to override it.
$file can be a filehandle, a filename, or a scalar
reference. See "new" for details.
- $class->write_file($mol, $file, %options)
- Writes a file $file containing the molecule
$mol. The default method calls write_string first
and then saves the string to a file, but derived classes may choose to
override it. $file can be either a filehandle or a
filename.
- $class->name_is($fname, %options)
- Returns true if a filename is of the format corresponding to the class. It
should look at the filename only, because it may be called with
non-existent files. It is used to determine with which format to save a
file. For example, the Chemistry::File::PDB returns true if the file ends
in .pdb.
- $class->string_is($s, %options)
- Examines the string $s and returns true if it has
the format of the class.
- $class->file_is($file, %options)
- Examines the file $file and returns true if it has
the format of the class. The default method slurps the whole file and then
calls string_is, but derived classes may choose to override it.
- $class->slurp
- Reads a file into a scalar. Automatic decompression of gzipped files is
supported if the Compress::Zlib module is installed. Files ending in .gz
are assumed to be compressed; otherwise it is possible to force
decompression by passing the gzip => 1 option (or no decompression with
gzip => 0).
- $class->new(file => $file, opts => \%opts)
- Create a new file object. This method is usually called indirectly via the
Chemistry::Mol->file method. $file may be a
scalar with a filename, an open filehandle, or a reference to a scalar. If
a reference to a scalar is used, the string contained in the scalar is
used as an in-memory file.
Chemistry::File objects are derived from Chemistry::Obj and have
the same properties (name, id, and type), as well as the following ones:
- file
- The "file" as described above under
"new".
- fh
- The filehandle used for reading and writing molecules. It is opened by
"open".
- opts
- A hashref containing the options that are passed through to the old-style
class methods. They are also passed to the instance method to keep a
similar interface, but they could access them via
$self->opts anyway.
- mode
- '>' if the file is open for writing, '<' for reading, and false if
not open.
- mols
- "read" stores all the molecules that
were read in this property as an array reference.
"write" gets the molecules to write from
here.
These methods should be overridden, because they don't really do
much by default.
- $file->read_header
- Read whatever information is available in the file before the first
molecule. Does nothing by default.
- $file->read_footer
- Read whatever information is available in the file after the last
molecule. Does nothing by default.
- $self->slurp_mol($fh)
- Reads from the input string until the end of the current molecule and
returns the "slurped" string. It does not parse the string. It
returns undefined if there are no more molecules in the file. This method
should be overridden if needed; by default, it slurps until the end of the
file.
- $self->skip_mol($fh)
- Similar to slurp_mol, but it doesn't need to return anything except true
or false. It should also be overridden if needed; by default, it just
calls slurp_mol.
- $file->read_mol($fh, %opts)
- Read the next molecule in the input stream. It returns false if there are
no more molecules in the file. This method should be overridden by derived
classes; otherwise it will call slurp_mol and parse_string (for backwards
compatibility; it is recommended to override read_mol directly in new
modules).
Note: some old file I/O modules (written before the 0.30
interface) may return more than one molecule anyway, so it is
recommended to call read_mol in list context to be safe:
($mol) = $file->read_mol($fh, %opts);
- $file->write_header
- Write whatever information is needed before the first molecule. Does
nothing by default.
- $file->write_footer
- Write whatever information is needed after the last molecule. Does nothing
by default.
- $self->write_mol($fh, $mol, %opts)
- Write one molecule to $fh. By default and for
backward compatibility, it just calls
"write_string" and prints its return
value to $self->fh. New classes should override
it.
- $self->open($mode)
- Opens the file (held in $self->file) for
reading by default, or for writing if $mode eq
'>'. This method sets $self->fh
transparently regardless of whether $self->file
is a filename (compressed or not), a scalar reference, or a
filehandle.
- $self->close
- Close the file. For regular files this just closes the filehandle, but for
gzipped files it does some additional postprocessing. This method is
called automatically on object destruction, so it is not mandatory to call
it explicitly.
- $file->read
- Read the whole file. This calls open, read_header, read_mol until there
are no more molecules left, read_footer, and close. Returns a list of
molecules if called in list context, or the first molecule in scalar
context.
- $self->write
- Write all the molecules in $self->mols. It just
calls open, write_header, write_mol (per each molecule), write_footer, and
close.
The :auto feature may not be entirely portable, but it is known to
work under Unix and Windows (either Cygwin or ActiveState).
<https://github.com/perlmol/Chemistry-Mol>
Ivan Tubert-Brohman-Brohman <itub@cpan.org>
Copyright (c) 2005 Ivan Tubert-Brohman. All rights reserved. This
program is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.