likwid-pin - pin a sequential or threaded application to dedicated
    processors
likwid-pin [-vhSpqim] [-V <verbosity>]
    [-c/-C <corelist>] [-s <skip_mask>]
    [-d <delim>]
likwid-pin is a command line application to pin a
    sequential or multithreaded application to dedicated processors. It can be
    used as replacement for taskset. Opposite to taskset no affinity mask but
    single processors are specified. For multithreaded applications based on the
    pthread library the pthread_create library call is overloaded through
    LD_PRELOAD and each created thread is pinned to a dedicated processor as
    specified in core_list .
Per default every generated thread is pinned to the core in the
    order of calls to pthread_create it is possible to skip single
    threads.
The OpenMP implementations of GCC and ICC compilers are explicitly
    supported. Clang's OpenMP backend should also work as it is built on top of
    Intel's OpenMP runtime library. Others may also work likwid-pin sets
    the environment variable OMP_NUM_THREADS for you if not already
    present. It will set as many threads as present in the pin expression. Be
    aware that with pthreads the parent thread is always pinned. If you create
    for example 4 threads with pthread_create and do not use the parent
    process as worker you still have to provide num_threads+1 processor
    ids.
likwid-pin supports different numberings for pinning. See
    section CPU EXPRESSION for details.
For applications where first touch policy on NUMA systems cannot
    be employed likwid-pin can be used to turn on interleave memory
    placement. This can significantly speed up the performance of memory bound
    multithreaded codes. All NUMA nodes the user pinned threads to are used for
    interleaving.
  - -h,--help
 
  - prints a help message to standard output, then exits.
 
  - -v,--version
 
  - prints version information to standard output, then exits.
 
  - -V, --verbose
    <level>
 
  - verbose output during execution for debugging. 0 for only errors, 1 for
      informational output, 2 for detailed output and 3 for developer
    output
 
  - -c,-C <cpu
    expression>
 
  - specify a numerical list of processors. The list may contain multiple
      items, separated by comma, and ranges. For example 0,3,9-11. Other format
      are available, see the CPU EXPRESSION section.
 
  - -s, --skip
    <skip_mask>
 
  - Specify skip mask as HEX number. For each set bit the corresponding thread
      is skipped.
 
  - -S,--sweep
 
  - All ccNUMA memory domains belonging to the specified thread list will be
      cleaned before the run. Can solve file buffer cache problems on
    Linux.
 
  - -p
 
  - prints the available thread domains for logical pinning
 
  - -i
 
  - set NUMA memory policy to interleave involving all NUMA nodes involved in
      pinning
 
  - -m
 
  - set NUMA memory policy to membind involving all NUMA nodes involved in
      pinning
 
  - -d <delim>
 
  - usable with -p to specify the CPU delimiter in the cpulist
 
  - -q,--quiet
 
  - silent execution without output
    
  
 
  - 1.
 
  - The most intuitive CPU selection method is a comma-separated list of
      hardware thread IDs. An example for this is 0,2 which schedules the
      threads on hardware threads 0 and 2. The physical numbering
      also allows the usage of ranges like 0-2 which results in the list
      0,1,2.
 
  - 2.
 
  - The CPUs can be selected by their indices inside of an affinity domain.
      The affinity domain is optional and if not given, Likwid assumes the
      domain 'N' for the whole node. The format is
      L:<indexlist> for selecting the CPUs inside of domain
      'N' or L:<domain>:<indexlist> for selecting the
      CPUs inside the given domain. Assuming an virtual affinity domain
      'P' that contains the CPUs 0,4,1,5,2,6,3,7. After sorting it
      to have physical hardware threads first we get: 0,1,2,3,4,5,6,7.
      The logical numbering L:P:0-2 results in the selection 0,1,2
      from the physical hardware threads first list.
 
  - 3.
 
  - The expression syntax enables the selection according to an selection
      function with variable input parameters. The format is either
      E:<affinity domain>:<numberOfThreads> to use the first
      <numberOfThreads> threads in affinity domain <affinity domain>
      or E:<affinity
      domain>:<numberOfThreads>:<chunksize>:<stride> to
      use <numberOfThreads> threads with <chunksize> threads
      selected in row while skipping <stride> threads in affinity domain
      <affinity domain>. Examples are E:N:4:1:2 for selecting the
      first four physical CPUs on a system with 2 hardware thread per CPU core
      or E:P:4:2:4 for choosing the first two threads in affinity domain
      P, skipping 2 threads and selecting again two threads. The
      resulting CPU list for virtual affinity domain P is
    0,4,2,6
 
  - 3.
 
  - The last format schedules the threads not only in a single affinity domain
      but distributed them evenly over all available affinity domains of the
      same kind. In contrast to the other formats, the selection is done using
      the physical hardware threads first and then the virtual hardware threads
      (aka SMT threads). The format is <affinity domain without
      number>:scatter like M:scatter to schedule the threads
      evenly in all available memory affinity domains. Assuming the two socket
      domains S0 = 0,4,1,5 and S1 = 2,6,3,7 the expression
      S:scatter results in the CPU list 0,2,1,3,4,6,5,7
    
  
 
  - 1.
 
  - For standard pthread application:
 
  - likwid-pin -c
    0,2,4-6 ./myApp
 
  
The parent process is pinned to processor 0 which is likely to be
    thread 0 in ./myApp. Thread 1 is pinned to processor 2, thread 2 to
    processor 4, thread 3 to processor 5 and thread 4 to processor 6. If more
    threads are created than specified in the processor list, these threads are
    pinned to processor 0 as fallback.
  - 2.
 
  - For selection of CPUs inside of a CPUset only the logical numbering is
      allowed. Assuming CPUset 0,4,1,5:
 
  - likwid-pin
    -c L:1,3 ./myApp
 
  
This command pins ./myApp on CPU 4 and the thread
    started by ./myApp on CPU 5
  - 3.
 
  - A common use-case for the numbering by expression is pinning of an
      application on the Intel Xeon Phi coprocessor with its 60 cores each
      having 4 SMT threads.
 
  - likwid-pin
    -c E:N:60:1:4 ./myApp
 
  
This command schedules one applicationn thread per physical CPU
    core for ./myApp.
The detection of shepard threads works for Intel's/LLVM OpenMP
    runtime (>=12.0), for GCC's OpenMP runtime as well as for PGI's OpenMP
    runtime. If you encounter problems with pinning, please set a proper skip
    mask to skip the not-detected shepard threads. Intel OpenMP runtime
    11.0/11.1 requires to set a skip mask of 0x1.
Written by Thomas Gruber <thomas.roehl@googlemail.com>.
Report Bugs on
  <https://github.com/RRZE-HPC/likwid/issues>.