pcp2spark - pcp-to-spark metrics exporter
pcp2spark [-5CGHIjLnrRvV?] [-4 action]
[-8|-9 limit] [-a archive]
[--archive-folio folio] [-A align]
[-b|-B space-scale] [-c config]
[--container container] [--daemonize] [-e
derived] [-g server] [-h host] [-i
instances] [-J rank] [-K spec] [-N
predicate] [-O origin] [-p port]
[-P|-0 precision]
[-q|-Q count-scale]
[-s samples] [-S
starttime] [-t
interval] [-T endtime]
[-y|-Y time-scale]
metricspec [...]
pcp2spark is a customizable performance metrics exporter
tool from PCP to Apache Spark. Any available performance metric, live or
archived, system and/or application, can be selected for exporting using
either command line arguments or a configuration file.
pcp2spark acts as a bridge which provides a network socket
stream on a given address/port which an Apache Spark worker task can connect
to and pull the configured PCP metrics from pcp2spark exporting them
using the streaming extensions of the Apache Spark API.
pcp2spark is a close relative of pmrep(1). Please
refer to pmrep(1) for the metricspec description accepted on
pcp2spark command line and pmrep.conf(5) for description of
the pcp2spark.conf configuration file overall syntax, this page
describes pcp2spark specific options and configuration file
differences with pmrep.conf(5). pmrep(1) also lists some usage
examples of which most are applicable with pcp2spark as well.
Only the command line options listed on this page are supported,
other options recognized by pmrep(1) are not supported.
Options via environment values (see pmGetOptions(3))
override the corresponding built-in default values (if any). Configuration
file options override the corresponding environment variables (if any).
Command line options override the corresponding configuration file options
(if any).
A general setup for making use of pcp2spark would involve
the user configuring pcp2spark for the PCP metrics to export followed
by starting the pcp2spark application. The pcp2spark
application will then wait and listen on the given address/port for a
connection from an Apache Spark worker thread to be started. The worker
thread will then connect to pcp2spark.
When an Apache Spark worker thread has connected pcp2spark
will begin streaming PCP metric data to Apache Spark until the worker thread
completes or the connection is interrupted. If the connectionis interrupted
or the socket is closed from the Apache Spark worker thread pcp2spark
will exit.
For an example Apache Spark worker job which will connect to an
pcp2spark instance on a given address/port and pull in PCP metric
data please see the example provided in the PCP examples directory for
pcp2spark (often provided by the PCP development package) or the
online version at
https://github.com/performancecopilot/pcp/blob/master/src/pcp2spark/.
pcp2spark uses a configuration file with overall syntax
described in pmrep.conf(5). The following options are common with
pmrep.conf: version, source, speclocal,
derived, header, globals, samples,
interval, type, type_prefer, ignore_incompat,
names_change, instances, live_filter, rank,
limit_filter, limit_filter_force, invert_filter,
predicate, omit_flat, precision,
precision_force, count_scale, count_scale_force,
space_scale, space_scale_force, time_scale,
time_scale_force. The output option is recognized but ignored
for pmrep.conf compatibility.
spark_server (string)
Specify the address on which pcp2spark will listen
for connections from an Apache Spark worker thread. Corresponding command line
option is -g. Default is 127.0.0.1.
spark_port (integer)
Specify the port to run pcp2spark on.
Corresponding command line option is -p. Default is 44325.
The available command line options are:
- -0 precision, --precision-force=precision
- Like -P but this option will override per-metric
specifications.
- -4 action, --names-change=action
- Specify which action to take on receiving a metric names change
event during sampling. These events occur when a PMDA discovers new
metrics sometime after starting up, and informs running client tools like
pcp2spark. Valid values for action are update
(refresh metrics being sampled), ignore (do nothing - the default
behaviour) and abort (exit the program if such an event
happens).
- -5, --ignore-unknown
- Silently ignore any metric name that cannot be resolved. At least one
metric must be found for the tool to start.
- -8 limit, --limit-filter=limit
- Limit results to instances with values above/below limit. A
positive integer will include instances with values at or above the limit
in reporting. A negative integer will include instances with values at or
below the limit in reporting. A value of zero performs no limit filtering.
This option will not override possible per-metric specifications.
See also -J and -N.
- -9 limit, --limit-filter-force=limit
- Like -8 but this option will override per-metric
specifications.
- -a archive,
--archive=archive
- Performance metric values are retrieved from the set of Performance
Co-Pilot (PCP) archive log files identified by the argument
archive, which is a comma-separated list of names, each of which
may be the base name of an archive or the name of a directory containing
one or more archives.
- --archive-folio=folio
- Read metric source archives from the PCP archive folio created by
tools like pmchart(1) or, less often, manually with
mkaf(1).
- -A align,
--align=align
- Force the initial sample to be aligned on the boundary of a natural time
unit align. Refer to PCPIntro(1) for a complete description
of the syntax for align.
- -b scale,
--space-scale=scale
- Unit/scale for space (byte) metrics, possible values include
bytes, Kbytes, KB, Mbytes, MB, and so
forth. This option will not override possible per-metric
specifications. See also pmParseUnitsStr(3).
- -B scale,
--space-scale-force=scale
- Like -b but this option will override per-metric
specifications.
- -c config,
--config=config
- Specify the config file to use. The default is the first found of:
./pcp2spark.conf,
$HOME/.pcp2spark.conf,
$HOME/pcp/pcp2spark.conf, and
$PCP_SYSCONF_DIR/pcp2spark.conf. For
details, see the above section and pmrep.conf(5).
- --container=container
- Fetch performance metrics from the specified container, either
local or remote (see -h).
- -C, --check
- Exit before reporting any values, but after parsing the configuration and
metrics and printing possible headers.
- --daemonize
- Daemonize on startup.
- -e derived,
--derived=derived
- Specify derived performance metrics. If derived starts with
a slash (``/'') or with a dot (``.'') it will be interpreted as a derived
metrics configuration file, otherwise it will be interpreted as comma- or
semicolon-separated derived metric expressions. For details see
pmLoadDerivedConfig(3) and pmRegisterDerived(3).
- -g server,
--spark-server=server
- Spark server to send the metrics to.
- -G,
--no-globals
- Do not include global metrics in reporting (see
pmrep.conf(5)).
- -h host,
--host=host
- Fetch performance metrics from pmcd(1) on host, rather than
from the default localhost.
- -H,
--no-header
- Do not print any headers.
- -i instances,
--instances=instances
- Report only the listed instances from current instances (if
present, see also -j). By default all instances, present and
future, are reported. This is a global option that is used for all metrics
unless a metric-specific instance definition is provided as part of a
metricspec. By default single-valued ``flat'' metrics without
multiple instances are still reported as usual, use -v to change
this. Please refer to pmrep(1) for more details on this
option.
- -I,
--ignore-incompat
- Ignore incompatible metrics. By default incompatible metrics (that is,
their type is unsupported or they cannot be scaled as requested) will
cause pcp2spark to terminate with an error message. With this
option all incompatible metrics are silently omitted from reporting. This
may be especially useful when requesting non-leaf nodes of the PMNS tree
for reporting.
- -j,
--live-filter
- Perform instance live filtering. This allows capturing all filtered
instances even if processes are restarted at some point (unlike without
live filtering). Performing live filtering over a huge amount of instances
will add some internal overhead so a bit of user caution is advised. See
also -n.
- -J rank,
--rank=rank
- Limit results to highest/lowest rank instances of set-valued
metrics. A positive integer will include highest valued instances in
reporting. A negative integer will include lowest valued instances in
reporting. A value of zero performs no ranking. See also -8.
- -K spec,
--spec-local=spec
- When fetching metrics from a local context (see -L), the -K
option may be used to control the DSO PMDAs that should be made
accessible. The spec argument conforms to the syntax described in
pmSpecLocalPMDA(3). More than one -K option may be
used.
- -L,
--local-PMDA
- Use a local context to collect metrics from DSO PMDAs on the local host
without PMCD. See also -K.
- -n,
--invert-filter
- Perform ranking before live filtering. By default instance live filtering
(when requested, see -j) happens before instance ranking (when
requested, see -J). With this option the logic is inverted and
ranking happens before live filtering.
- -N predicate,
--predicate=predicate
- Specify a comma-separated list of predicate filter reference
metrics. By default ranking (see -J) happens for each metric
individually. With predicates, ranking is done only for the specified
predicate metrics. When reporting, rest of the metrics sharing the same
instance domain (see PCPIntro(1)) as the predicate will
include only the highest/lowest ranking instances of the corresponding
predicate.
So for example, using proc.memory.rss (resident memory size
of process) as the predicate metric together with
proc.io.total_bytes and mem.util.used as metrics to be
reported, only the processes using most/least (as per -J) memory will
be included when reporting total bytes written by processes. Since
mem.util.used is a single-valued metric (thus not sharing the same
instance domain as the process-related metrics), it will be reported as
usual.
- -O origin,
--origin=origin
- When reporting archived metrics, start reporting at origin within
the time window (see -S and -T). Refer to PCPIntro(1)
for a complete description of the syntax for origin.
- -p port,
--spark-port=port
- Spark server port.
- -P precision,
--precision=precision
- Use precision for numeric non-integer output values. The default is
to use 3 decimal places (when applicable). This option will not
override possible per-metric specifications.
- -q scale,
--count-scale=scale
- Unit/scale for count metrics, possible values include count x
10^-1, count, count x 10, count x 10^2, and so
forth from 10^-8 to 10^7. (These values are currently
space-sensitive.) This option will not override possible per-metric
specifications. See also pmParseUnitsStr(3).
- -Q scale,
--count-scale-force=scale
- Like -q but this option will override per-metric
specifications.
- -r, --raw
- Output raw metric values, do not convert cumulative counters to rates.
This option will override possible per-metric specifications.
- -R,
--raw-prefer
- Like -r but this option will not override per-metric
specifications.
- -s samples,
--samples=samples
- The argument samples defines the number of samples to be retrieved
and reported. If samples is 0 or -s is not specified,
pcp2spark will sample and report continuously (in real time mode)
or until the end of the set of PCP archives (in archive mode). See also
-T.
- -S starttime,
--start=starttime
- When reporting archived metrics, the report will be restricted to those
records logged at or after starttime. Refer to PCPIntro(1)
for a complete description of the syntax for starttime.
- -t interval,
--interval=interval
- The default update interval may be set to something other than the
default 1 second. The interval argument follows the syntax
described in PCPIntro(1), and in the simplest form may be an
unsigned integer (the implied units in this case are seconds). See also
the -T option.
- -T endtime,
--finish=endtime
- When reporting archived metrics, the report will be restricted to those
records logged before or at endtime. Refer to PCPIntro(1)
for a complete description of the syntax for endtime.
When used to define the runtime before pcp2spark will exit,
if no samples is given (see -s) then the number of reported
samples depends on interval (see -t). If samples is
given then interval will be adjusted to allow reporting of
samples during runtime. In case all of -T, -s, and
-t are given, endtime determines the actual time
pcp2spark will run.
- -v,
--omit-flat
- Omit single-valued ``flat'' metrics from reporting, only consider
set-valued metrics (i.e., metrics with multiple values) for reporting. See
-i and -I.
- -V, --version
- Display version number and exit.
- -y scale,
--time-scale=scale
- Unit/scale for time metrics, possible values include
nanosec, ns, microsec, us, millisec,
ms, and so forth up to hour, hr. This option will
not override possible per-metric specifications. See also
pmParseUnitsStr(3).
- -Y scale,
--time-scale-force=scale
- Like -y but this option will override per-metric
specifications.
- -?, --help
- Display usage message and exit.
Environment variables with the prefix PCP_ are used to
parameterize the file and directory names used by PCP. On each installation,
the file /etc/pcp.conf contains the local values for these variables.
The $PCP_CONF variable may be used to specify an alternative
configuration file, as described in pcp.conf(5).
For environment variables affecting PCP tools, see
pmGetOptions(3).
mkaf(1), PCPIntro(1), pcp(1),
pcp2elasticsearch(1), pcp2graphite(1), pcp2influxdb(1),
pcp2json(1), pcp2xlsx(1), pcp2xml(1),
pcp2zabbix(1), pmcd(1), pminfo(1), pmrep(1),
pmGetOptions(3), pmSpecLocalPMDA(3),
pmLoadDerivedConfig(3), pmParseUnitsStr(3),
pmRegisterDerived(3), LOGARCHIVE(5), pcp.conf(5),
pmns(5) and pmrep.conf(5).