pgreplay - PostgreSQL log file replayer for performance tests
pgreplay [parse options] [replay options]
[-d level] [infile]
pgreplay -f [parse options] [-o outfile]
[-d level] [infile]
pgreplay -r [replay options] [-d level]
[infile]
pgreplay reads a PostgreSQL log file (not a WAL
file), extracts the SQL statements and executes them in the same order and
relative time against a PostgreSQL database cluster. A final report gives
you a useful statistical analysis of your workload and its execution.
In the first form, the log file infile is replayed at the
time it is read.
With the -f option, pgreplay will not execute the
statements, but write them to a ‘replay file’ outfile
that can be replayed with the third form.
With the -r option, pgreplay will execute the
statements in the replay file infile that was created by the second
form.
If the execution of statements gets behind schedule, warning
messages are issued that indicate that the server cannot handle the load in
a timely fashion. The idea is to replay a real-world database workload as
exactly as possible.
To create a log file that can be parsed by pgreplay, you
need to set the following parameters in postgresql.conf:
- log_min_messages=error (or more)
log_min_error_statement=log (or more)
log_connections=on
log_disconnections=on
log_line_prefix='%m|%u|%d|%c|' (if you don't use CSV logging)
log_statement='all'
lc_messages must be set to English (encoding does not matter)
bytea_output=escape (from version 9.0 on, only if you want to replay
the log on 8.4 or earlier)
The database cluster against which you replay the SQL statements
must be a clone of the database cluster that generated the logs from the
time immediately before the logs were generated.
pgreplay is useful for performance tests, particularly in
the following situations:
- *
- You want to compare the performance of your PostgreSQL application on
different hardware or different operating systems.
- *
- You want to upgrade your database and want to make sure that the new
database version does not suffer from performance regressions that affect
you.
Moreover, pgreplay can give you some feeling as to how your
application might scale by allowing you to try to replay the workload
at a higher speed. Be warned, though, that 500 users working at double speed
is not really the same as 1000 users working at normal speed.
- -c
- Specifies that the log file is in 'csvlog' format (highly recommended) and
not in 'stderr' format.
- -b timestamp
- Only log entries greater or equal to that timestamp will be parsed. The
format is YYYY-MM-DD HH:MM:SS.FFF like in the log file. An optional
time zone part will be ignored.
- -e timestamp
- Only log entries less or equal to that timestamp will be parsed. The
format is YYYY-MM-DD HH:MM:SS.FFF like in the log file. An optional
time zone part will be ignored.
- -q
- Specifies that a backslash in a simple string literal will escape the
following single quote. This depends on configuration options like
standard_conforming_strings and is the default for server version
9.0 and less.
- -h hostname
- Host name where the target database cluster is running (or directory where
the UNIX socket can be found). Defaults to local connections.
This works just like the -h option of psql.
- -p port
- TCP port where the target database cluster can be reached.
- -W password
- By default, pgreplay assumes that the target database cluster is
configured for trust authentication. With the -W option you
can specify a password that will be used for all users in the
cluster.
- -s factor
- Speed factor for replay, by default 1. This can be any valid positive
floating point number. A factor less than 1 will replay the
workload in ‘slow motion’, while a factor greater
than 1 means ‘fast forward’.
- -E encoding
- Specifies the encoding of the log file, which will be used as client
encoding during replay. If it is omitted, your default client encoding
will be used.
- -j
- If all connections are idle, jump ahead to the next request instead of
sleeping. This will speed up replay. Execution delays will still be
reported correctly, but replay statistics will not contain the idle
time.
- -o outfile
- specifies the replay file where the statements will be written for later
replay.
- -d level
- Specifies the trace level (between 1 and 3). Increasing levels will
produce more detailed information about what pgreplay is
doing.
- -v
- Prints the program version and exits.
- PGHOST
- Specifies the default value for the -h option.
- PGPORT
- Specifies the default value for the -p option.
- PGCLIENTENCODING
- Specifies the default value for the -E option.
pgreplay can only replay what is logged by PostgreSQL. This
leads to some limitations:
- *
- COPY statements will not be replayed, because the copy data are not
logged.
- *
- Fast-path API function calls are not logged and will not be replayed.
Unfortunately, this includes the Large Object API.
- *
- Since the log file is always in the server encoding (which you can specify
with the -E switch of pgreplay), all SET
client_encoding statements will be ignored.
- *
- Since the preparation time of prepared statements is not logged (unless
log_min_messages is debug2 or more), these statements will
be prepared immediately before they are first executed during replay.
- *
- Because the log file contains only text, query parameters and return
values will always be in text and never in binary format. If you use
binary mode to, say, transfer large binary data, pgreplay can cause
significantly more network traffic than the original run.
- *
- Sometimes, if a connection takes longer to complete, the session ID
unexpectedly changes in the PostgreSQL log file. This causes
pgreplay to treat the session as two different ones, resulting in
an additional connection. This is arguably a bug in PostgreSQL.
Written by Laurenz Albe
<laurenz.albe@wien.gv.at>.