The Internal Structure of Python Eggs#
STOP! This is not the first document you should read!
Eggs and their Formats#
A “Python egg” is a logical structure embodying the release of a specific version of a Python project, comprising its code, resources, and metadata. There are multiple formats that can be used to physically encode a Python egg, and others can be developed. However, a key principle of Python eggs is that they should be discoverable and importable. That is, it should be possible for a Python application to easily and efficiently find out what eggs are present on a system, and to ensure that the desired eggs’ contents are importable.
There are two basic formats currently implemented for Python eggs:
.egg
format: a directory or zipfile containing the project’s code and resources, along with anEGG-INFO
subdirectory that contains the project’s metadata.egg-info
format: a file or directory placed adjacent to the project’s code and resources, that directly contains the project’s metadata.
Both formats can include arbitrary Python code and resources, including static data files, package and non-package directories, Python modules, C extension modules, and so on. But each format is optimized for different purposes.
The .egg
format is well-suited to distribution and the easy
uninstallation or upgrades of code, since the project is essentially
self-contained within a single directory or file, unmingled with any
other projects’ code or resources. It also makes it possible to have
multiple versions of a project simultaneously installed, such that
individual programs can select the versions they wish to use.
The .egg-info
format, on the other hand, was created to support
backward-compatibility, performance, and ease of installation for system
packaging tools that expect to install all projects’ code and resources
to a single directory (e.g. site-packages
). Placing the metadata
in that same directory simplifies the installation process, since it
isn’t necessary to create .pth
files or otherwise modify
sys.path
to include each installed egg.
Its disadvantage, however, is that it provides no support for clean
uninstallation or upgrades, and of course only a single version of a
project can be installed to a given directory. Thus, support from a
package management tool is required. (This is why setuptools’ “install”
command refers to this type of egg installation as “single-version,
externally managed”.) Also, they lack sufficient data to allow them to
be copied from their installation source. easy_install can “ship” an
application by copying .egg
files or directories to a target
location, but it cannot do this for .egg-info
installs, because
there is no way to tell what code and resources belong to a particular
egg – there may be several eggs “scrambled” together in a single
installation location, and the .egg-info
format does not currently
include a way to list the files that were installed. (This may change
in a future version.)
Code and Resources#
The layout of the code and resources is dictated by Python’s normal import layout, relative to the egg’s “base location”.
For the .egg
format, the base location is the .egg
itself. That
is, adding the .egg
filename or directory name to sys.path
makes its contents importable.
For the .egg-info
format, however, the base location is the
directory that contains the .egg-info
, and thus it is the
directory that must be added to sys.path
to make the egg importable.
(Note that this means that the “normal” installation of a package to a
sys.path
directory is sufficient to make it an “egg” if it has an
.egg-info
file or directory installed alongside of it.)
Project Metadata#
If eggs contained only code and resources, there would of course be
no difference between them and any other directory or zip file on
sys.path
. Thus, metadata must also be included, using a metadata
file or directory.
For the .egg
format, the metadata is placed in an EGG-INFO
subdirectory, directly within the .egg
file or directory. For the
.egg-info
format, metadata is stored directly within the
.egg-info
directory itself.
The minimum project metadata that all eggs must have is a standard
Python PKG-INFO
file, named PKG-INFO
and placed within the
metadata directory appropriate to the format. Because it’s possible for
this to be the only metadata file included, .egg-info
format eggs
are not required to be a directory; they can just be a .egg-info
file that directly contains the PKG-INFO
metadata. This eliminates
the need to create a directory just to store one file. This option is
not available for .egg
formats, since setuptools always includes
other metadata. (In fact, setuptools itself never generates
.egg-info
files, either; the support for using files was added so
that the requirement could easily be satisfied by other tools, such
as distutils).
In addition to the PKG-INFO
file, an egg’s metadata directory may
also include files and directories representing various forms of
optional standard metadata (see the section on Standard Metadata,
below) or user-defined metadata required by the project. For example,
some projects may define a metadata format to describe their application
plugins, and metadata in this format would then be included by plugin
creators in their projects’ metadata directories.
Filename-Embedded Metadata#
To allow introspection of installed projects and runtime resolution of inter-project dependencies, a certain amount of information is embedded in egg filenames. At a minimum, this includes the project name, and ideally will also include the project version number. Optionally, it can also include the target Python version and required runtime platform if platform-specific C code is included. The syntax of an egg filename is as follows:
name ["-" version ["-py" pyver ["-" required_platform]]] "." ext
The “name” and “version” should be escaped using the to_filename()
function provided by pkg_resources
, after first processing them with
safe_name()
and safe_version()
respectively. These latter two
functions can also be used to later “unescape” these parts of the
filename. (For a detailed description of these transformations, please
see the “Parsing Utilities” section of the pkg_resources
manual.)
The “pyver” string is the Python major version, as found in the first
3 characters of sys.version
. “required_platform” is essentially
a distutils get_platform()
string, but with enhancements to properly
distinguish Mac OS versions. (See the get_build_platform()
documentation in the “Platform Utilities” section of the
pkg_resources
manual for more details.)
Finally, the “ext” is either .egg
or .egg-info
, as appropriate
for the egg’s format.
Normally, an egg’s filename should include at least the project name and version, as this allows the runtime system to find desired project versions without having to read the egg’s PKG-INFO to determine its version number.
Setuptools, however, only includes the version number in the filename
when an .egg
file is built using the bdist_egg
command, or when
an .egg-info
directory is being installed by the
install_egg_info
command. When generating metadata for use with the
original source tree, it only includes the project name, so that the
directory will not have to be renamed each time the project’s version
changes.
This is especially important when version numbers change frequently, and the source metadata directory is kept under version control with the rest of the project. (As would be the case when the project’s source includes project-defined metadata that is not generated from by setuptools from data in the setup script.)
Egg Links#
In addition to the .egg
and .egg-info
formats, there is a third
egg-related extension that you may encounter on occasion: .egg-link
files.
These files are not eggs, strictly speaking. They simply provide a way
to reference an egg that is not physically installed in the desired
location. They exist primarily as a cross-platform alternative to
symbolic links, to support “installing” code that is being developed in
a different location than the desired installation location. For
example, if a user is developing an application plugin in their home
directory, but the plugin needs to be “installed” in an application
plugin directory, running “setup.py develop -md /path/to/app/plugins”
will install an .egg-link
file in /path/to/app/plugins
, that
tells the egg runtime system where to find the actual egg (the user’s
project source directory and its .egg-info
subdirectory).
.egg-link
files are named following the format for .egg
and
.egg-info
names, but only the project name is included; no version,
Python version, or platform information is included. When the runtime
searches for available eggs, .egg-link
files are opened and the
actual egg file/directory name is read from them.
Each .egg-link
file should contain a single file or directory name,
with no newlines. This filename should be the base location of one or
more eggs. That is, the name must either end in .egg
, or else it
should be the parent directory of one or more .egg-info
format eggs.
As of setuptools 0.6c6, the path may be specified as a platform-independent
(i.e. /
-separated) relative path from the directory containing the
.egg-link
file, and a second line may appear in the file, specifying a
platform-independent relative path from the egg’s base directory to its
setup script directory. This allows installation tools such as EasyInstall
to find the project’s setup directory and build eggs or perform other setup
commands on it.
Standard Metadata#
In addition to the minimum required PKG-INFO
metadata, projects can
include a variety of standard metadata files or directories, as
described below. Except as otherwise noted, these files and directories
are automatically generated by setuptools, based on information supplied
in the setup script or through analysis of the project’s code and
resources.
Most of these files and directories are generated via “egg-info
writers” during execution of the setuptools egg_info
command, and
are listed in the egg_info.writers
entry point group defined by
setuptools’ own setup.py
file.
Project authors can register their own metadata writers as entry points
in this group (as described in the setuptools manual under “Adding new
EGG-INFO Files”) to cause setuptools to generate project-specific
metadata files or directories during execution of the egg_info
command. It is up to project authors to document these new metadata
formats, if they create any.
.txt
File Formats#
Files described in this section that have .txt
extensions have a
simple lexical format consisting of a sequence of text lines, each line
terminated by a linefeed character (regardless of platform). Leading
and trailing whitespace on each line is ignored, as are blank lines and
lines whose first nonblank character is a #
(comment symbol). (This
is the parsing format defined by the yield_lines()
function of
the pkg_resources
module.)
All .txt
files defined by this section follow this format, but some
are also “sectioned” files, meaning that their contents are divided into
sections, using square-bracketed section headers akin to Windows
.ini
format. Note that this does not imply that the lines within
the sections follow an .ini
format, however. Please see an
individual metadata file’s documentation for a description of what the
lines and section names mean in that particular file.
Sectioned files can be parsed using the split_sections()
function;
see the “Parsing Utilities” section of the pkg_resources
manual for
for details.
Dependency Metadata#
requires.txt
#
This is a “sectioned” text file. Each section is a sequence of
“requirements”, as parsed by the parse_requirements()
function;
please see the pkg_resources
manual for the complete requirement
parsing syntax.
The first, unnamed section (i.e., before the first section header) in
this file is the project’s core requirements, which must be installed
for the project to function. (Specified using the install_requires
keyword to setup()
).
The remaining (named) sections describe the project’s “extra”
requirements, as specified using the extras_require
keyword to
setup()
. The section name is the name of the optional feature, and
the section body lists that feature’s dependencies.
Note that it is not normally necessary to inspect this file directly;
pkg_resources.Distribution
objects have a requires()
method
that can be used to obtain Requirement
objects describing the
project’s core and optional dependencies.
setup_requires.txt
#
Much like requires.txt
except represents the requirements
specified by the setup_requires
parameter to the Distribution.
dependency_links.txt
#
A list of dependency URLs, one per line, as specified using the
dependency_links
keyword to setup()
. These may be direct
download URLs, or the URLs of web pages containing direct download
links. Please see the setuptools manual for more information on
specifying this option.
depends.txt
– Obsolete, do not create!#
This file follows an identical format to requires.txt
, but is
obsolete and should not be used. The earliest versions of setuptools
required users to manually create and maintain this file, so the runtime
still supports reading it, if it exists. The new filename was created
so that it could be automatically generated from setup()
information
without overwriting an existing hand-created depends.txt
, if one
was already present in the project’s source .egg-info
directory.
namespace_packages.txt
– Namespace Package Metadata#
A list of namespace package names, one per line, as supplied to the
namespace_packages
keyword to setup()
. Please see the manuals
for setuptools and pkg_resources
for more information about
namespace packages.
entry_points.txt
– “Entry Point”/Plugin Metadata#
This is a “sectioned” text file, whose contents encode the
entry_points
keyword supplied to setup()
. All sections are
named, as the section names specify the entry point groups in which the
corresponding section’s entry points are registered.
Each section is a sequence of “entry point” lines, each parseable using
the EntryPoint.parse
classmethod; please see the pkg_resources
manual for the complete entry point parsing syntax.
Note that it is not necessary to parse this file directly; the
pkg_resources
module provides a variety of APIs to locate and load
entry points automatically. Please see the setuptools and
pkg_resources
manuals for details on the nature and uses of entry
points.
The scripts
Subdirectory#
This directory is currently only created for .egg
files built by
the setuptools bdist_egg
command. It will contain copies of all
of the project’s “traditional” scripts (i.e., those specified using the
scripts
keyword to setup()
). This is so that they can be
reconstituted when an .egg
file is installed.
The scripts are placed here using the distutils’ standard
install_scripts
command, so any #!
lines reflect the Python
installation where the egg was built. But instead of copying the
scripts to the local script installation directory, EasyInstall writes
short wrapper scripts that invoke the original scripts from inside the
egg, after ensuring that sys.path includes the egg and any eggs it
depends on. For more about script wrappers, see the section below on
Installation and Path Management Issues.
Zip Support Metadata#
native_libs.txt
#
A list of C extensions and other dynamic link libraries contained in
the egg, one per line. Paths are /
-separated and relative to the
egg’s base location.
This file is generated as part of bdist_egg
processing, and as such
only appears in .egg
files (and .egg
directories created by
unpacking them). It is used to ensure that all libraries are extracted
from a zipped egg at the same time, in case there is any direct linkage
between them. Please see the Zip File Issues section below for more
information on library and resource extraction from .egg
files.
eager_resources.txt
#
A list of resource files and/or directories, one per line, as specified
via the eager_resources
keyword to setup()
. Paths are
/
-separated and relative to the egg’s base location.
Resource files or directories listed here will be extracted
simultaneously, if any of the named resources are extracted, or if any
native libraries listed in native_libs.txt
are extracted. Please
see the setuptools manual for details on what this feature is used for
and how it works, as well as the Zip File Issues section below.
zip-safe
and not-zip-safe
#
These are zero-length files, and either one or the other should exist.
If zip-safe
exists, it means that the project will work properly
when installed as an .egg
zipfile, and conversely the existence of
not-zip-safe
means the project should not be installed as an
.egg
file. The zip_safe
option to setuptools’ setup()
determines which file will be written. If the option isn’t provided,
setuptools attempts to make its own assessment of whether the package
can work, based on code and content analysis.
If neither file is present at installation time, EasyInstall defaults
to assuming that the project should be unzipped. (Command-line options
to EasyInstall, however, take precedence even over an existing
zip-safe
or not-zip-safe
file.)
Note that these flag files appear only in .egg
files generated by
bdist_egg
, and in .egg
directories created by unpacking such an
.egg
file.
top_level.txt
– Conflict Management Metadata#
This file is a list of the top-level module or package names provided by the project, one Python identifier per line.
Subpackages are not included; a project containing both a foo.bar
and a foo.baz
would include only one line, foo
, in its
top_level.txt
.
This data is used by pkg_resources
at runtime to issue a warning if
an egg is added to sys.path
when its contained packages may have
already been imported.
(It was also once used to detect conflicts with non-egg packages at installation time, but in more recent versions, setuptools installs eggs in such a way that they always override non-egg packages, thus preventing a problem from arising.)
SOURCES.txt
– Source Files Manifest#
This file is roughly equivalent to the distutils’ MANIFEST
file.
The differences are as follows:
The filenames always use
/
as a path separator, which must be converted back to a platform-specific path whenever they are read.The file is automatically generated by setuptools whenever the
egg_info
orsdist
commands are run, and it is not user-editable.
Although this metadata is included with distributed eggs, it is not actually used at runtime for any purpose. Its function is to ensure that setuptools-built source distributions can correctly discover what files are part of the project’s source, even if the list had been generated using revision control metadata on the original author’s system.
In other words, SOURCES.txt
has little or no runtime value for being
included in distributed eggs, and it is possible that future versions of
the bdist_egg
and install_egg_info
commands will strip it before
installation or distribution. Therefore, do not rely on its being
available outside of an original source directory or source
distribution.
Other Technical Considerations#
Zip File Issues#
Although zip files resemble directories, they are not fully
substitutable for them. Most platforms do not support loading dynamic
link libraries contained in zipfiles, so it is not possible to directly
import C extensions from .egg
zipfiles. Similarly, there are many
existing libraries – whether in Python or C – that require actual
operating system filenames, and do not work with arbitrary “file-like”
objects or in-memory strings, and thus cannot operate directly on the
contents of zip files.
To address these issues, the pkg_resources
module provides a
“resource API” to support obtaining either the contents of a resource,
or a true operating system filename for the resource. If the egg
containing the resource is a directory, the resource’s real filename
is simply returned. However, if the egg is a zipfile, then the
resource is first extracted to a cache directory, and the filename
within the cache is returned.
The cache directory is determined by the pkg_resources
API; please
see the set_cache_path()
and get_default_cache()
documentation
for details.
The Extraction Process#
Resources are extracted to a cache subdirectory whose name is based
on the enclosing .egg
filename and the path to the resource. If
there is already a file of the correct name, size, and timestamp, its
filename is returned to the requester. Otherwise, the desired file is
extracted first to a temporary name generated using
mkstemp(".$extract",target_dir)
, and then its timestamp is set to
match the one in the zip file, before renaming it to its final name.
(Some collision detection and resolution code is used to handle the
fact that Windows doesn’t overwrite files when renaming.)
If a resource directory is requested, all of its contents are recursively extracted in this fashion, to ensure that the directory name can be used as if it were valid all along.
If the resource requested for extraction is listed in the
native_libs.txt
or eager_resources.txt
metadata files, then
all resources listed in either file will be extracted before the
requested resource’s filename is returned, thus ensuring that all
C extensions and data used by them will be simultaneously available.
Extension Import Wrappers#
Since Python’s built-in zip import feature does not support loading
C extension modules from zipfiles, the setuptools bdist_egg
command
generates special import wrappers to make it work.
The wrappers are .py
files (along with corresponding .pyc
and/or .pyo
files) that have the same module name as the
corresponding C extension. These wrappers are located in the same
package directory (or top-level directory) within the zipfile, so that
say, foomodule.so
will get a corresponding foo.py
, while
bar/baz.pyd
will get a corresponding bar/baz.py
.
These wrapper files contain a short stanza of Python code that asks
pkg_resources
for the filename of the corresponding C extension,
then reloads the module using the obtained filename. This will cause
pkg_resources
to first ensure that all of the egg’s C extensions
(and any accompanying “eager resources”) are extracted to the cache
before attempting to link to the C library.
Note, by the way, that .egg
directories will also contain these
wrapper files. However, Python’s default import priority is such that
C extensions take precedence over same-named Python modules, so the
import wrappers are ignored unless the egg is a zipfile.
Installation and Path Management Issues#
Python’s initial setup of sys.path
is very dependent on the Python
version and installation platform, as well as how Python was started
(i.e., script vs. -c
vs. -m
vs. interactive interpreter).
In fact, Python also provides only two relatively robust ways to affect
sys.path
outside of direct manipulation in code: the PYTHONPATH
environment variable, and .pth
files.
However, with no cross-platform way to safely and persistently change
environment variables, this leaves .pth
files as EasyInstall’s only
real option for persistent configuration of sys.path
.
But .pth
files are rather strictly limited in what they are allowed
to do normally. They add directories only to the end of sys.path
,
after any locally-installed site-packages
directory, and they are
only processed in the site-packages
directory to start with.
This is a double whammy for users who lack write access to that
directory, because they can’t create a .pth
file that Python will
read, and even if a sympathetic system administrator adds one for them
that calls site.addsitedir()
to allow some other directory to
contain .pth
files, they won’t be able to install newer versions of
anything that’s installed in the systemwide site-packages
, because
their paths will still be added after site-packages
.
So EasyInstall applies two workarounds to solve these problems.
The first is that EasyInstall leverages .pth
files’ “import” feature
to manipulate sys.path
and ensure that anything EasyInstall adds
to a .pth
file will always appear before both the standard library
and the local site-packages
directories. Thus, it is always
possible for a user who can write a Python-read .pth
file to ensure
that their packages come first in their own environment.
Second, when installing to a PYTHONPATH
directory (as opposed to
a “site” directory like site-packages
) EasyInstall will also install
a special version of the site
module. Because it’s in a
PYTHONPATH
directory, this module will get control before the
standard library version of site
does. It will record the state of
sys.path
before invoking the “real” site
module, and then
afterwards it processes any .pth
files found in PYTHONPATH
directories, including all the fixups needed to ensure that eggs always
appear before the standard library in sys.path, but are in a relative
order to one another that is defined by their PYTHONPATH
and
.pth
-prescribed sequence.
The net result of these changes is that sys.path
order will be
as follows at runtime:
The
sys.argv[0]
directory, or an empty string if no script is being executed.All eggs installed by EasyInstall in any
.pth
file in eachPYTHONPATH
directory, in order first byPYTHONPATH
order, then normal.pth
processing order (which is to say alphabetical by.pth
filename, then by the order of listing within each.pth
file).All eggs installed by EasyInstall in any
.pth
file in each “site” directory (such assite-packages
), following the same ordering rules as for the ones onPYTHONPATH
.The
PYTHONPATH
directories themselves, in their original orderAny paths from
.pth
files found onPYTHONPATH
that were not eggs installed by EasyInstall, again following the same relative ordering rules.The standard library and “site” directories, along with the contents of any
.pth
files found in the “site” directories.
Notice that sections 1, 4, and 6 comprise the “normal” Python setup for
sys.path
. Sections 2 and 3 are inserted to support eggs, and
section 5 emulates what the “normal” semantics of .pth
files on
PYTHONPATH
would be if Python natively supported them.
For further discussion of the tradeoffs that went into this design, as
well as notes on the actual magic inserted into .pth
files to make
them do these things, please see also the following messages to the
distutils-SIG mailing list:
http://mail.python.org/pipermail/distutils-sig/2006-February/006026.html
http://mail.python.org/pipermail/distutils-sig/2006-March/006123.html
Script Wrappers#
EasyInstall never directly installs a project’s original scripts to a script installation directory. Instead, it writes short wrapper scripts that first ensure that the project’s dependencies are active on sys.path, before invoking the original script. These wrappers have a #! line that points to the version of Python that was used to install them, and their second line is always a comment that indicates the type of script wrapper, the project version required for the script to run, and information identifying the script to be invoked.
The format of this marker line is:
"# EASY-INSTALL-" script_type ": " tuple_of_strings "\n"
The script_type
is one of SCRIPT
, DEV-SCRIPT
, or
ENTRY-SCRIPT
. The tuple_of_strings
is a comma-separated
sequence of Python string constants. For SCRIPT
and DEV-SCRIPT
wrappers, there are two strings: the project version requirement, and
the script name (as a filename within the scripts
metadata
directory). For ENTRY-SCRIPT
wrappers, there are three:
the project version requirement, the entry point group name, and the
entry point name. (See the “Automatic Script Creation” section in the
setuptools manual for more information about entry point scripts.)
In each case, the project version requirement string will be a string
parseable with the pkg_resources
modules’ Requirement.parse()
classmethod. The only difference between a SCRIPT
wrapper and a
DEV-SCRIPT
is that a DEV-SCRIPT
actually executes the original
source script in the project’s source tree, and is created when the
“setup.py develop” command is run. A SCRIPT
wrapper, on the other
hand, uses the “installed” script written to the EGG-INFO/scripts
subdirectory of the corresponding .egg
zipfile or directory.
(.egg-info
eggs do not have script wrappers associated with them,
except in the “setup.py develop” case.)
The purpose of including the marker line in generated script wrappers is to facilitate introspection of installed scripts, and their relationship to installed eggs. For example, an uninstallation tool could use this data to identify what scripts can safely be removed, and/or identify what scripts would stop working if a particular egg is uninstalled.