DOKK Library

Persistent Software Transactional Memory in Haskell

Authors André Brinkmann Nicolas Krauter Patrick Raaf Peter Braam Reza Salkhordeh Sebastian Erdweg

License CC-BY-4.0

Persistent Software Transactional Memory in Haskell
NICOLAS KRAUTER∗ and PATRICK RAAF∗ , Johannes Gutenberg University, Germany
PETER BRAAM, University of Oxford, United Kingdom
REZA SALKHORDEH, Johannes Gutenberg University, Germany
SEBASTIAN ERDWEG, Johannes Gutenberg University, Germany
ANDRÉ BRINKMANN, Johannes Gutenberg University, Germany
Emerging persistent memory in commodity hardware allows byte-granular accesses to persistent state at
memory speeds. However, to prevent inconsistent state in persistent memory due to unexpected system
failures, different write-semantics are required compared to volatile memory. Transaction-based library
solutions for persistent memory facilitate the atomic modification of persistent data in languages where
memory is explicitly managed by the programmer, such as C/C++. For languages that provide extended
capabilities like automatic memory management, a more native integration into the language is needed to
maintain the high level of memory abstraction. It is shown in this paper how persistent software transactional
memory (PSTM) can be tightly integrated into the runtime system of Haskell to atomically manage values
of persistent transactional data types. PSTM has a clear interface and semantics extending that of software
transactional memory (STM). Its integration with the language’s memory management retains features like
garbage collection and allocation strategies, and is fully compatible with Haskell’s lazy execution model. Our
PSTM implementation demonstrates competitive performance with low level libraries and trivial portability
of existing STM libraries to PSTM. The implementation allows further interesting use cases, such as persistent
memoization and persistent Haskell expressions.
CCS Concepts: • Hardware → Non-volatile memory; • Software and its engineering → Runtime en-
vironments; Concurrency control; Multithreading; Memory management; Garbage collection; Allocation
/ deallocation strategies.
Additional Key Words and Phrases: persistent transaction support, non-volatile heap
ACM Reference Format:
Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann.
2021. Persistent Software Transactional Memory in Haskell. Proc. ACM Program. Lang. 5, ICFP, Article 63
(August 2021), 29 pages.

Most applications need to protect their internal state against data corruption. For example, a
financial application has to ensure that all acknowledged transactions of their customers are
correctly reflected in the corresponding balances even in case of a system crash. However, such a
crash results in irrevocable data loss if the application only resides in volatile memory and persistent
∗ Both   authors contributed equally to this research.

Authors’ addresses: Nicolas Krauter; Patrick Raaf, Johannes Gutenberg University, Mainz, Germany,,; Peter Braam, University of Oxford, Oxford, United Kingdom,; Reza Salkhordeh, Jo-
hannes Gutenberg University, Mainz, Germany,; Sebastian Erdweg, Johannes Gutenberg Uni-
versity, Mainz, Germany,; André Brinkmann, Johannes Gutenberg University, Mainz, Germany,


This work is licensed under a Creative Commons Attribution 4.0 International License.
© 2021 Copyright held by the owner/author(s).

                                      Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:2             Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

storage therefore has to be used to enable fail-safe execution. Persistent storage has been accessed,
until recently, mostly through the I/O path of the operating system which has high latencies and
primarily offers block granular access [Breuer 2003] for logging and synchronization.
   Emerging persistent memory (PM) in commodity hardware is byte-addressable, offers latencies
comparable to DRAM, and storage capacities that can significantly exceed DRAM. Persistent
memory can be transparently integrated into existing software architectures to protect application
state through the file system interface [Moti et al. 2021; Rao et al. 2014; Xu and Swanson 2016].
However, the full potential of PM can only be leveraged if applications directly access and modify
persistent data in-place so that data does not have to be synchronized between the volatile and
the persistence domain [Rudoff 2017]. Direct access to persistent state requires different write-
semantics compared to volatile memory to be able to recover from system crashes. Programmers
have to support power fail-safe atomic updates, correctly manage the interplay between volatile and
persistent memory, and steer cache flushes. In the last decade, a large number of libraries have been
developed to alleviate the unique challenges of PM [Brown and Avni 2016; Coburn et al. 2011; Liu
et al. 2017; Rudoff 2017; Volos et al. 2011]. These libraries offer suitable abstractions for languages
like C/C++, which support direct memory operations. However, library-based approaches are less
applicable in languages providing automatic memory management.
   In languages with automatic memory management, PM libraries cannot modify the underlying
runtime system to support PM and thus must implement their own separate memory model [Wu
et al. 2018]. Library users therefore need to explicitly identify and manage objects and data structures
that reside in PM and ensure their referential integrity. In contrast, the direct integration with
the language can sustain the original memory abstraction for PM, effectively reducing the burden
on the programmer. While considerable effort has been taken for imperative languages like Java
[Shull et al. 2019; Wu et al. 2020, 2018], the integration of PM into purely functional programming
languages has not yet been investigated.
   This paper focuses on the lazy, purely functional programming language Haskell [Hudak et al.
1992] and evaluates the idiomatic integration of PM into the Haskell language runtime. Transactions
offer proper semantics to express consistent state transitions [Haerder and Reuter 1983] and are
therefore used by many frameworks and library abstractions for PM. Haskell is an ideal candidate
for integration of persistent transactions as it already offers a well-accepted implementation of
software transactional memory (STM) in its runtime system [Harris et al. 2005] that enables the
concurrency-safe modification of memory contents. Haskell’s focus on pure functions limits the
scope of potential in-place modifications to specific mutable memory objects, even in its runtime
implementation. These can be controlled efficiently by individually tailored lightweight logging
solutions. Moreover, expressions of Haskell applications are represented in memory as a graph that
contains functions as well as data, which is reduced lazily in a call-by-need fashion. By automatically
persisting computations, execution state can be transparently shared across runs.
   We have extended STM transactions and their semantics in Haskell to provide atomicity, con-
sistency, isolation and durability, i.e. the well-known ACID guarantees, for PM. Our persistent
software transactional memory (PSTM) solution enables composable transactions in which persis-
tent Haskell variables can be created or modified. PSTM offers the same concurrency guarantees
as STM, provides almost the same interface as STM, and provides easy migration of STM code
to PSTM. PSTM ensures the recoverability of transactions in case of power-failures by correctly
ordering cache flushes to the persistence domain and lightweight logging of pointer updates of
persistent variables. Our solution is fully compatible with Haskell’s lazy execution model and PSTM
can be used to store arbitrary expressions in PM. In particular, unevaluated function applications
stored in PM are efficiently reduced on demand and retain their execution state across runs. This

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                               63:3

leads to interesting capabilities like persistent memoization and inherently restartable persistent
   Our PSTM solution is fully integrated into GHC’s runtime system, as it needs to understand the
difference between volatile DRAM and PM to support Haskell’s execution model as a first principle.
We have designed a hybrid heap where PM is provided as an additional, automatically managed
heap region. Memory allocation and garbage collection have been adapted within the PM region in
a way that the lifetime of objects must not be explicitly handled by the programmer but is defined
by their reachability.
   Our performance evaluation of PSTM shows a slowdown of persistent transactions compared
to a volatile STM implementation between 1.3𝑥 and 2.8𝑥 on the same memory technology. This
slowdown slightly increases when PSTM is running on today’s Intel Optane PM, which still has
increased access latencies compared to DRAM. A comparison with persistent C/C++ libraries, i.e.
Mnemosyne [Volos et al. 2011], PMDK [Rudoff 2017], Romolus [Correia et al. 2018] and OneFile
[Ramalhete et al. 2019], shows that the optimistic synchronization of PSTM outperforms dedicated
persistent libraries in the presence of many concurrent write transactions. Using two common
memoization libraries, we compared the performance of persistent graph reduction with their
volatile evaluation.
   Although our approach specifically targets the integration into the Haskell language, our concepts
can be applied to other (functional) programming languages. Limited mutability of well-defined
memory structures and reachability-based garbage collection can significantly reduce (transaction)
logging overheads and effectively allows to hide the specifics of PM from the programmer, e.g. by
automatically persisting related objects in a self-contained manner. Graph-based lazy execution
models can be adapted to automatically persist computation state across runs using only atomic
pointer updates.
   In summary, our work makes the following contributions:

    • Transparent PSTM extension of Haskell’s STM mechanism to provide power-fail safe persis-
      tence while maintaining composability
    • Integration of PM into Haskell’s heap design to enable the automatic memory management
      of persistent structures
    • Extending Haskell’s laziness capabilities to PM
    • Exemplary adaptation of existing libraries to PSTM and integration of persistent memoization
    • Performance evaluation and comparison of PSTM

The remaining paper is structured as follows: Section 2 presents the technical and methodological
background of this work. Section 3 discusses the design of the PSTM API and Section 4 presents its
implementation. Section 5 presents the results, including performance measurements of PSTM and
a comparison with library-based approaches. A summary is given in Section 6.

We will introduce a Haskell language-level abstraction to interact with PM. However, the devel-
opment of transparent persistent programming abstractions to interact with non-volatile storage
technologies has a long history of active research, which is described in the following section.
Subsequently, we discuss the peculiarities of PM as well as related work aiming to simplify its
usage. As our solution is integrated into the runtime system of the Glasgow Haskell Compiler [Hall
et al. 1992], we will give a basic overview of Haskell memory management and the principles of
the transaction mechanism STM.

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:4             Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

2.1    Persistent Programming
Programs that need to access persistent data mostly rely on dedicated I/O instructions that move
data between the persistent media and their volatile in-memory workspace. Explicitly interacting
with storage mechanisms outside the control of the application might result in hard to maintain code
bases, as correct handling of interfacing, typing, and concurrency have to be resolved. Persistent
programming languages can offer a much higher abstraction when interacting with persistent data
[Atkinson et al. 1982; Bläser 2007; Hosking and Chen 1999; Morrison et al. 2000]. They often rely on
the concept of orthogonal persistence introduced by Atkinson et al. [1983] and further described in
more detail by Atkinson and Morrison [1995]. It was derived from the language design principles
of correspondence, abstraction, and data type completeness by Tennent [1977] and Strachey [2000]
to offer the transparent integration of persistence within the language. Orthogonal persistence is
defined by three key principles:
   (1) The principle of persistence independence
   (2) The principle of data type orthogonality
   (3) The principle of persistence identification
Persistence independence relieves the programmer from explicitly managing data movement. In-
stead, data is moved automatically between transient and persistent storage when needed and the
interaction with persistent data resembles that of volatile data. Data type orthogonality means that
there should be no distinct types for volatile and persistent data to ensure compatibility. Persistence
identification relates to the ability of the system to identify objects that should be kept persistent.
Objects should survive as long as they can be referred to by the application. Usually, these are
identified by the reachability from a set of persistent roots. However, the concept of orthogonal
persistence is only an abstraction for the programmer. Internally, all data that needs to outlast
possible power-failures needs to be explicitly moved through the I/O path. For this, data often
needs to be transformed in such a way that consecutive runs can still interpret it. For example,
pointer swizzling [Kemper and Kossmann 1993], i.e. converting direct memory references to stable
identifiers, needs to be performed. This process is called serialization and needs to be designed in a
way suitable for the data structures.
   Persistent programming techniques were also investigated in functional programming languages,
e.g. by Harper [1985] who introduced a persistent heap to Standard ML. McNally and Davie [1991]
modelled a language that combines lazy functional programming with orthogonal persistence to re-
member executed computations persistently. Quintela and Sánchez [2001] presented an orthogonal
persistent implementation of Haskell also supporting laziness. Although orthogonal persistence
allows well-formed abstractions to interact with persistent storage, it does not necessary guarantee
consistency in terms of atomicity of interrelated operations or concurrency. Marquez et al. [2000]
have shown for Java that this issue can be resolved when orthogonal persistence is combined with
ACID transactions.

2.2    Persistent Memory
The term persistent memory covers a set of hardware technologies which offer storage-like persis-
tence and memory-like performance [Freitas and Wilcke 2008]. Presently, PM is provided in form
of DIMMs that reside on the memory bus. This allows the persistent modification of memory using
byte granular load/store instructions.
   Various PM technologies are under development, including phase-change memory [Raoux et al.
2008], spin-torque RAM [Kultursay et al. 2013], racetrack memory [Parkin et al. 2008], battery-
backed DRAM solutions, or Intel’s Optane DC persistent memory [Hady et al. 2017]. These solutions
differ in density and performance and cover different application areas. The widely available Intel

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                               63:5

Optane DC technology targets the server market and enables higher memory densities than DRAM
but its access latencies are also increased, e.g., by a factor of 4x for reads [Weiland et al. 2019].
   Application-layer access to PM can be established via the I/O path using a file system or by
directly mapping PM into the virtual address space of applications [Rudoff 2017]. The file system
approach offers the standard OS file operations and the consistency guarantees of the file system
apply [Dulloor et al. 2014; Moti et al. 2021; Xu et al. 2017]. Alternatively, mapping PM into the
virtual memory allows applications to directly use load and store instructions to access data with
no page cache involved and thus no further synchronization required. While this offers the fastest
possible access to PM, it also requires the application itself to ensure consistency, e.g., by correctly
flushing transient CPU caches to the persistence domain.
   Intel’s Persistent Memory Development Kit (PMDK) [Rudoff 2017] has been created to simplify
application development for PM. The low-level library libpmem within PMDK maps persistent
files into an application’s address space and provides low-level operations that follow the write
semantics of PM. Programmers can develop optimized PM algorithms which use explicit cache
flushes or memory fences without relying on hardware specific assembler instructions. Current
CPU architectures only support atomic writes up to the size of 8 bytes. Consistent modifications of
larger regions in PM therefore require additional techniques like logging. libpmemobj works on
top of libpmem and enables transactions that support arbitrary large atomic updates of multiple
structures by maintaining an undo log. As the virtual address mapping can change across runs
traditional pointers might become invalid. libpmemobj preserves the referential integrity of data in
PM by 128-bit persistent pointers with position-independent offsets that are translated at runtime.
   Other libraries operate at a similar abstraction level and offer scalability by integrating STM-
based transaction mechanisms [Felber et al. 2008; Gottschlich and Connors 2007; Herlihy et al.
2003; Saha et al. 2006]. The C library Mnemosyne [Volos et al. 2011] enables fine grained PM
transactions by relying on tinySTM [Felber et al. 2008] for concurrency control. OneFile uses a
redo-log that utilizes dual word compare-and-swap (CAS) operations [Ramalhete et al. 2019]. The
write-sets of transactions are shared across threads, so that they can be processed cooperatively.
Romulus allows consistent modifications by using two replicas of each data structure [Correia et al.
2018]. Modifications are performed in place on the first replica and are synchronized on commit
with the second. OneFile and Romulus only control memory accesses to variables wrapped in a
specific persistent class. However, they still require to explicitly allocate objects which should
reside in PM. Furthermore, all referenced structures from a persistent class have to be persistent to
be recoverable in case of a crash. Language-level abstractions like AutoPersist in Java by Shull et al.
[2019] significantly reduce the burden for the programmer. They leave the responsibility for moving
interrelated objects to PM to the runtime system, so that a self-contained state is automatically
ensured. AutoPersist and the C++ library NV-Heaps [Coburn et al. 2011] automatically garbage
collect data structures based on their reachability, effectively preventing the programmer from
persistent memory space leaks.

2.3   Haskell and Its Memory Management
Haskell is a functional programming language whose expressions operate on immutable data by
transforming data to new data. This is in contrast to the imperative instruction set of modern
processors which use statements to mutate data in-place. It is therefore the responsibility of
the Haskell compiler and runtime system to map the language’s execution model to the actual
architectures. We here focus on the popular Glasgow Haskell Compiler (GHC) [Hall et al. 1992].

2.3.1 Heap Layout. GHC represents a Haskell program as a graph in the heap that is transformed
during its execution by the abstract Spineless Tagless G-Machine, a graph reduction machine that

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:6                  Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

consists of a set of registers, a stack, and a heap [Peyton Jones 1992]. Haskell uses a non-strict
semantic and the graph contains data as well as unevaluated function applications, i.e. computations,
which can be lazily reduced as needed. While there are many different types of graph objects the
most important types for our discussion are:
     • Constructors correspond to values of immutable Haskell data constructors. They are stored
       in weak-head normal form (WHNF), which means that they cannot be reduced further as an
       outermost expression, but could still contain references to unevaluated expressions inside
       the graph.
     • Functions represent Haskell functions and their environment, i.e. their free variables, which
       are variables not bound to an input argument. The variables bound to arguments are applied
       explicitly via the stack.
     • Thunks resemble reducible function applications. Evaluation yields the corresponding result
       in WHNF. Thunks are replaced by indirections that refer to the result. During evaluation,
       they might be replaced by blackholes, which prevent other threads from evaluating the
       same thunk again.
   Heap objects in Haskell, called closures, have a standardized memory layout independent from
the stored data [GHC Team 2020b]. A closure contains data and references to other closures in
a payload section. Its structure is described in the closures info table. Each closure points to a
block of code associated with it, which is called entry code. For thunks this code describes the
sequence of instructions needed for their evaluation. The evaluation might result in the allocation
of new closures. Info tables and the entry code are shared amongst closures of the same type. They
are generated at compile time and reside in the binary which is mapped at runtime. This static
region also contains closures for expressions known at compile time which do not require dynamic
allocation, such as top level definitions. However, they share the layout of heap objects.
   Figure 1a shows a simplified example of a Haskell expression and its representation in memory.
Here 𝑥 = [1..] is the infinite list with only the first element, 1, being evaluated. The list consists of
all previously described closure types. The head of the list is a constructor that points to the first
element as well as a thunk, i.e. the unevaluated tail of the list. The thunk points to the function
𝜆𝑥 → 𝑥 + 1 that is used during evaluation to create a new list element with the predecessor as
argument. Figure 1b shows the list after the evaluation of the next item by forcing 𝑦. The thunk is
replaced by an indirection to the result, a new constructor pointing to the element as well as to a
new thunk representing the new tail of the list. It can be observed that 𝑦 evaluates to a reference to
the list element 2 instead of generating a new closure. This concept, called sharing, is heavily used
in GHC and possible because of the immutability of values.

                                                                      x                                 let y = x !! 1
            x                              let x = [1..]                        Cons                    print y

                                                                          Int      1            Ind
                Int      1       Thunk   empty

                                                                                         Cons                            \x → x+1
                                                           \x → x+1

                                                                                   Int      2         Thunk      empty

                (a) Partially evaluated infinite list                     (b) List after forcing the next item

       Fig. 1. Example of GHC’s memory representation and lazy graph reduction on an infinite list

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                               63:7

2.3.2 Allocation and Garbage Collection. During the reduction of the closure graph new closures
are created while others become unreachable. GHC’s memory management subsystem provides
functionality for closure allocation and recovers unused memory through a garbage collector.
Allocation is realized through a layered approach. Fixed-sized megablocks are allocated from the
operating system. Megablocks are internally divided in smaller regions, called blocks. Blocks are
described by metadata residing at the beginning of the megablock [GHC Team 2020a]. A Haskell
program is decoupled from these allocators; closures are created using a bump allocation strategy
in a per-thread region called nursery. It is provided by the runtime system as a contiguous region
of blocks where memory is claimed from within the entry code by simply incrementing a pointer.
   GHC uses a stop-the-world generational copying garbage collector to reclaim unreachable
closures. It traverses the heap starting from a set of roots of the generation that should be collected
and retains all reachable closures by moving them to a newly allocated memory region. Afterwards,
the blocks of the old one are reclaimed [Marlow et al. 2008]. In contrast, the recently introduced
Alligator collector by Gamari and Dietz [2020] inspired by the work of Ueno and Ohori [2016]
uses a mark-and-sweep based collection strategy, which allows to reclaim occupied space on a
per-closure basis concurrently to the mutator. It is integrated cooperatively with the regular copying
collector and can be chosen as an alternative collection strategy for the oldest generation. Reachable
closures are retained in place, significantly reducing copy overheads for the most long-living objects.
Alligator provides an own allocator for its non-moving heap on top of the block layer. It divides the
memory region into segments with fixed-sized sub-blocks. A sub-block stores exactly one closure
and differently sized closures are supported by segments with different block sizes.
2.3.3 Software Transactional Memory. Haskell focuses on immutable data structures, whereas
some problems are more easily expressed when using mutable state. To revisit the example from
the introduction of a financial application, the account balances might be stored as mutable state.
Threads processing transfers concurrently need to be isolated from each other to ensure consistency.
   STM enables programmers to express consistent state transitions through transactions [Herlihy
and Moss 1993]. If a thread needs to modify a set of variables atomically, the modifications are
declared in a transaction block. From the view of all other threads these changes happen atomically.
    data TVar a
    newTVar ::        a → STM (TVar a)
    readTVar :: TVar a → STM a
    writeTVar :: TVar a → a → STM ()
    atomically :: STM a → IO a
                                    Listing 1. Excerpt of Haskell’s STM API

   Haskell’s implementation of STM provides the high level API shown in Listing 1 [Peyton Jones
et al. 1996]. The mutability is limited to transactional variables called TVars. STM provides its own
monad, which allows the composition of operations that create, read, or modify TVars. Additionally,
it only allows revertible actions. This is for the reason that a transaction might try to commit
multiple times because used variables are changed by other threads in the meantime. STM can
only undo modifications to TVars, so the scope of STM is limited to these changes. This excludes
operations such as printing to a screen or storing some data on disk. atomically is used to perform
a series of STM actions atomically with respect to concurrency. The account example from before
can be implemented as shown in Listing 2. An account with an initial balance is created using
createAccount. Funds are moved between accounts with transfer.
GHC’s runtime provides the mechanisms to execute transactions in an atomic manner. A transaction
is first executed in thread local state by maintaining a local log containing the transactions read and

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:8              Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

       type Acc = TVar Int

       createAccount :: Int → IO Acc
       createAccount b = atomically $ newTVar b

       transfer :: Acc → Acc → Int → IO ()
       transfer a1 a2 amount = atomically $ do
                                   b1 <− readTVar a1
                                   b2 <− readTVar a2
                                   writeTVar a1 (b1 − amount)
                                   writeTVar a2 (b2 + amount)
        Listing 2. Example of a simple STM transaction transfering 𝑎𝑚𝑜𝑢𝑛𝑡 between accounts 𝑎1 and 𝑎2

write set. When either readTVar or writeTVar are called, an entry is added that tracks the current
as well as the potential new value of the variable. Finally, STM tries to commit the transaction in
order to update the global state. The involved variables are locked and their global current value is
checked to match the expected value. When no conflicts are detected, the global TVars are updated
and the locks are released. In createAccount no conflicts can occur as only new, thread-local
TVars are created. When transfer transactions are executed concurrently, conflicts are detected in
the validation phase. A transaction is restarted if the global state has changed during thread-local
   Libraries such as TCache [Corona 2017] allow STM transactions to interact with persistent data.
Transactions operate on a global volatile cache that is synchronized with a persistent backend.
However, TCache has not been designed for byte-addressable memory and it relies on serialization.
This restricts the possible values which can be stored in the backend and limits the lazy evaluation
capabilities of Haskell. TCache requires an explicit name for every persistent variable to allow
identification across runs. This also requires the adaptation of applications to explicitly interact
with persistent state, e.g. by managing the lifetime.

Compared to languages that allow explicit memory modifications, the high abstraction of functional
programming languages from the underlying memory requires a different approach to integrate
byte-addressable PM idiomatically. PM allows the direct modification of memory contents from
within the application without the need for any synchronization between the transient working
space and the persistent backend. Haskell nevertheless enforces the purity of values and has no
assignment operator to directly modify data. In this section we will present the design of PSTM
and its interface which aim to provide a concise abstraction of transactions to consistently interact
with PM while relieving programmers from HW-related details. In our design, PSTM serves as the
main programming model to persist and retrieve Haskell expressions and its relation to orthogonal
persistence will be described subsequently.

3.1      Key Design Principles
We identified the following key principles (KPs) during the design of this high-level abstraction:
   (1) Composability: Persistent transactions should be composable with other persistent and
       volatile transactions.
   (2) Direct Access: PM’s byte-addressable access semantics should be leveraged to avoid unnec-
       essary synchronization between volatile and persistent memory.

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                                63:9

  (3) Power-Fail Safety: Persistent data stored in PM should always be recoverable. PSTM needs
      to preserve two invariants at any point in time:
        (i) Modifications of data in the persistent region of the heap always ensure consistent state
       (ii) There are no unrecoverable pointers from PM to the volatile heap, as they lose their
            meaning across runs.
  (4) Identification: Persistent data needs to be retrievable in consecutive runs by providing a
      possibility of discovery.
  (5) Lifetime: An automatic memory management should preserve persisted data across runs
      while memory for structures which are no longer reachable should be automatically reclaimed.
  (6) Safety: The PSTM abstraction should allow to express consistent state transitions while
      reducing the potential of programming errors.
  (7) Compatibility: The native execution model of Haskell should be preserved as much as
  (8) Security: Persistent data has to be protected against unauthorized access and modification.

3.2     Persistent Transaction Interface
These considerations are reflected in our design of the Persistent Software Transactional Memory
(PSTM) interface, an extension of Haskell’s STM implementation. PSTM closely resembles the
interface of STM and programmers can simply define what variables inside a transaction should be
persisted rather than how to persist them. Listing 3 shows our interface extensions to STM.

      data PTVar a
      getRoot :: a → IO (PTVar a)

      newPTVar :: a → STM (PTVar a)
      readPTVar :: PTVar a → STM a
      writePTVar :: PTVar a → a → STM ()

                                    Listing 3. PSTM API extension of STM

   We introduce new persistent transactional variables (PTVars) analogous to STM’s TVars, but
residing in PM. Existing programs can be easily adapted to use PSTM by exchanging TVars with
PTVars without changing the semantics of the program. The use of the STM monad also for PTVars
enables the composition of joint atomic transactions including volatile TVars and persistent PTVars
(KP 1). All modifications of PTVars in a single transaction are performed power-fail safe (KP 3) and
the transitive persistence of newly assigned expressions is automatically ensured (invariant 3i).
   Expressions are persisted by transitively copying them into PM along with all volatile parts of
their environment when the IO action atomically is called to commit the transaction. As this
process is hidden from the programmer, unexpected overheads can emerge during the deep copy
of possibly unevaluated expressions to PM. Currently, this is the default trade-off we have chosen
in order to preserve the abstraction from the memory layer. For more fine-grained control over
structures entering PM, deep evaluation can be optionally enforced by the programmer to prevent
performance impacts for expressions that are comprised of large environments.
   readPTVar establishes a direct reference to an expression in PM (KP 2) which can be passed
outside the transaction by using atomically. By differentiating between TVars and PTVars at type
level, the programmer can be certain whether interacting with persistent or volatile state (KP 6).

                                 Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:10            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

          main = do
              root <− getRoot Nothing                                volatile heap              persistent heap
              rval <− atomically $ readPTVar root                                        root
              x <− case rval of                                            x                     PTVar1         Just
                  Nothing → atomically $ do
                                 pv2 <− newPTVar 5                                               PTVar2           5
                                 writePTVar root (Just pv2)
                                 return pv2                                y               PTVar3     Nothing
                  Just pv2 → return pv2
              y <− atomically $ newPTVar Nothing

                    Fig. 2. Creation of structures spanning volatile and persistence domain.

3.3     Identification of Persisted Data across Runs
PSTM must offer a possibility to retrieve variables across runs (KP 4). The PM region is obtained
by the application via a file on a filesystem which allows direct mapping and ensures valid access
permissions (KP 8). The root PTVar returned by the getRoot function provides the necessary
anchor to persistent data. It is assumed to act as an always existing reference into the persistent
heap while its value might need initialization. The programmer can flexibly define the type of value
attached to the root variable. This is done by passing an initial expression to the getRoot function
that determines the structure of the value, e.g. a list of other PTVars. Once it is assigned by either a
previous program run that used the same PM region or the first call of the function, further usages
refer to the already set root and the passed initialization parameter is ignored. Our general advise
is to use getRoot just once during the initialization phase of the program in order to obtain the
existing root or initialize a new root otherwise.
   The automated memory management reclaims all structures in PM which are no longer reachable
from either the volatile heap or the persistent root. That is, all persistent data that should be
accessible across program runs must be made reachable from the root (KP 5).
   Retrieving data structures across runs is a trivial task for programs that persist a single or only
few data structures. In the example in Figure 2, PTVar1 is returned by getRoot. PTVar2 can be
reached from it in consecutive runs. PTVar3 is also stored in PM and is directly referenced by y (KP
2), but it is not available in succeeding runs as it has no connection to PTVar1. In theory, persistent
expressions that are only referenced from the volatile expressions could be moved back to the

      getPTVar :: String → IO (Maybe (PTVar Int))
      getPTVar key = do root <− getRoot Map.empty
                        atomically $ do
                            map <− readPTVar root
                            return (Map.lookup key map)

      insertPTVar:: String → PTVar Int → IO ()
      insertPTVar key tv = do root <− getRoot Map.empty
                              atomically $ do
                                  map <− readPTVar root
                                  let map' = Map.insert key tv map
                                  writePTVar root map'
              Listing 4. Providing named access to PTVars by storing a map in the root variable

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                             63:11

volatile heap. However, by keeping them in PM the persistent heap can also function as a volatile
heap extension (PTVar3 in our example). PSTM currently only supports multiple runs of the same
binary for implementation reasons (see Section 4.1), thus static type checking is sufficient as no
external application can alter the types defined during compile time. In our example, the root PTVar
has the static type Maybe (PTVar Int) across runs.
  As PTVar2 was recovered directly through PTVar1 no explicit persistent identifiers were required.
However, more complex applications might need additional mapping structures to reload an
arbitrary set of PTVars between runs. Complex mappings can be flexibly built on top of the root
variable. For example, we can provide named access to PTVars by storing a Map String (PTVar
Int) in the root variable as shown in Listing 4.

3.4 Persistent Laziness
With PSTM even functions and unevaluated expressions can be assigned to PTVars and thus
be stored in PM, e.g. infinite lists. Our design is fully compatible with the lazy execution model
of Haskell by providing the corresponding runtime support (KP 7). Persistent expressions are
reduced as needed and are directly replaced by their results without any programmer intervention
required. This enables the development of applications that are inherently restartable and remember
execution state across runs.
   Listing 5 shows a simplified time integration implementation based on a particle simulation.
In the first run (when the root variable is []), we load the initial particle state, possibly from an
external configuration file. We then generate an unevaluated infinite list of steps using iterate and
store it in the root variable. After initialization, we request the result of the 1000th time step such
that all preceding time steps are executed lazily and their results are persisted. If the application is
interrupted in between, the evaluated particle configurations of the already performed time steps
are kept in the list and the computation resumes from the last executed step seamlessly.
   In general, storing unevaluated expressions in PM can trigger multiple copy processes, one for the
unevaluated structure and one additional for its result value after evaluation. In some circumstances
it can be beneficial for experienced programmers to enforce only strictly evaluated expressions
to be stored in PM. However, our data structure benchmarks (see Section 5.2) have shown that
unevaluated structures can also have more compact memory representations and thus offload
work from the transaction mechanism. Whether strict evaluation is more efficient or not highly

    doStep :: [Particle] → [Particle]
    doStep pl = ...

    main = do
          root <− getRoot []
          rootv <− atomically $ readPTVar root
          when ( rootv == []) $ do
              initialPState <− loadInitialConfig
              let stepList = iterate doStep initialPState
              atomically $ writePTVar root stepList

          −− retrieve list of steps (of type [[ Particle]])
          steps <− atomically $ readPTVar root
          print ( steps !! 1000)
               Listing 5. Example of a persistent and restartable time integration application.

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:12            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

depends on the computational overhead of the lazily evaluated structures compared to the copy
overhead, whether they will definitely be used, and on the memory footprint of the unevaluated vs.
the evaluated structure.

3.5     PSTM and Orthogonal Persistence
We describe our solution as nearly achieving orthogonal persistence. Due to direct access to per-
sistent expressions and transparent movement of values into PM, no explicit synchronization is
required by the programmer and thus persistence independence is provided. Moreover, by providing
automatic memory management to manage the lifetime based on reachability from either the
volatile heap or the persistent root variable, persistence identification is guaranteed. However, there
exist certain data types that we currently do not support to be persisted, weakening data type
orthogonality. These are mutable variable types in PM distinct from PTVars, such as MutVars (mu-
table references), mutable Arrays, or MVars which are variables, used for communication between
concurrent threads. These could be integrated by implementing specialised update mechanisms
that guarantee power-fail safe consistency. For MVars that are used for blocking of threads albeit
the question arises how reasonable it would be to persist them. However, they could be released
by the recovery mechanism to prevent them from blocking across runs. We focused this work on
transactional variables which are more universal and offer higher consistency guarantees. We have
chosen a rather explicit approach to persistence by differentiating between TVars and PTVars. In
principal, they could be expressed by the same type and truly orthogonal persistence would be
achieved by maintaining their persistence in correspondence to their reachability from the root
variable and move them to PM transparently.

The runtime system (RTS) of Haskell controls the execution of Haskell programs, provides transac-
tion support, and performs automatic memory management. The Haskell RTS did not include any
support for PM before the introduction of our PSTM extensions. We have therefore extended the
STM implementation to support persistent transactions and GHC’s memory management by an
additional heap that resides in PM. To fully support Haskell’s call-by-need semantics, we allow the
lazy evaluation of persistent expressions in PM. Finally, a new recovery step ensures consistent PM
state across program runs.

4.1     Persistent Heap and Automatic Memory Management
We introduce a hybrid heap design which extends the volatile heap with a persistent region. It
resides in a file passed as a runtime parameter (or created in the first run) which is directly mapped
into the virtual address space and can be accessed across runs. At program startup, the volatile and
the persistent heap are mapped to fixed, subsequent address ranges. The fixed mapping ensures
that references in PM retain their validity across runs and eliminates the need for offset-based
pointer resolution at runtime.
   A Haskell program is represented as a graph in volatile memory. Most closures, i.e. constructors
and functions, in this graph are immutable. Only variables and unevaluated expressions, i.e. thunks,
are mutated in-place. Assignments in transactions trigger the automatic copy process of these
expressions to PM and allow to replicate parts of the volatile graph in PM. All memory objects in
PM are stored in their native closure layout and can be directly accessed, so that data does not have
to be serialized before it is copied to PM. This native closure layout contains pointers to code in the
statically-linked binary, which currently restricts PM structures to be used by a single application
[Berthold 2010; Yang et al. 2015]. To support multiple binaries the code needs to be shared between
them, e.g. through dynamic linking, and the compatibility of Haskell data types across binaries

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                             63:13

must be ensured as they are not stable across recompilation. One solution could be the approach of
the Unison language [Chiusano 2020], which leverages hash based identifiers for definitions and
data declarations. These are stable across renaming and recompilation and allow sharing of the
code base between applications. When resolving these issues, distinct binaries that act on the same
PM region could be supported by including dynamic type-checkers [Abadi et al. 1991; Cheney and
Hinze 2002; Connor 1991]. The transaction mechanism and memory management on the same
region nevertheless would need to be externalized if concurrent usage is intended.
   Haskell programs constantly create new subgraphs of closures while others become unreachable.
We provide automatic memory management for closures residing in PM to ensure that the user does
not have to explicitly manage their lifetime. The Haskell RTS uses a stop-the-world generational
copying garbage collector for its volatile heap as default. Repetitive copying of data structures in
PM, which possibly outlive multiple runs, could induce a large overhead. We adapted the recently
introduced nonmoving Alligator collector [Gamari and Dietz 2020] to mitigate these overheads.
The nonmoving collector is integrated with the generational copying collector and operates on the
memory region containing the oldest generation. A separate thread is used to concurrently garbage
collect a snapshot taken during the stop-the-world phase of the generational copying collector. The
snapshot state contains the collection roots, i.e. the closures in the nonmoving region that were
found to be reachable by younger generations.
   As a proof-of-concept, we integrated the PM heap as part of this oldest generation. A closure
in PM is alive if it is reachable from either the persistent root or structures in the volatile heap.
Garbage collection of the PM heap is triggered with the regular collection cycles. However, the
nonmoving collector currently does not limit the heap size, whereas the size of the PM file is limited
at runtime to prevent it from growing indefinitely and persistently occupy large amounts of PM. We
introduced an additional triggering mechanism when running out of persistent heap space. Haskell
threads that require PM allocation are then suspended until the triggered nonmoving garbage
collection finishes. In future versions, they could cooperate with the collector thread to speed up
the collection and reduce the resulting mutator pause times.
   We introduce a new PM heap allocator extending the design of Alligator’s class-based allocator
which serves allocation requests in the nonmoving heap. The PM heap is structured in correspon-
dence to the layout assumed by the RTS, i.e. block descriptors at the beginning of each megablock
boundary are provided. It is further divided into segments that provide sub-blocks used to allocate
individual closures. Each segment has a state that describes its occupancy. Our allocator resides
in PM and the allocator state needs to outlive one run. Internally, states are represented as linked
lists, e.g., the global list of free segments, a thread-local list of partially filled segments or the list
of segments currently under collection. To prevent memory space leaks while moving segments,
we use a redo log in PM that stores state changes so that they can be recovered in the event
of a crash. The liveness of closures within a segment is described by a bitmap in the segment’s
metadata and updated during collection. Keeping those liveness bitmaps of the segments power
fail-safe consistent would result in large logging and flushing overheads in the collectors mark
phase. Instead, we decide to restore valid bitmap states during an initial heap traversal during
recovery (see 4.5).

4.2 Persistent Transaction Support
Figure 3 shows the execution of a PSTM transaction inside the RTS. Figure 3a shows the initial
memory state. The thread-local execution of the transaction creates a log that captures the read and
write set of the transaction using references to the respective closure graphs in memory, shown in
Figure 3b. In these phases the native STM mechanism is leveraged and there is no difference for
PSTM, except for the possibility to create new PTVars in PM.

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:14            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

               atomically $ do                               variable   current new
                    writePTVar ptv1 v1'
                    readPTVar ptv2                            &ptv1      &v1                   v1'
                                                              &ptv2      &v2
               volatile Heap
               persistent Heap

                    ptv1                 v1                      ptv1             v1

                    ptv2                 v2                      ptv2             v2

                           (a) initial state                       (b) local execution (volatile)

                  variable     current new
                   &ptv1         &v1     &v1'          v1'
                   &ptv2         &v2      &v2
                                                                                           volatile Heap
                                                                                         persistent Heap
                                                        Redo log
                     ptv1                v1             PTVar val              ptv1             v1
                                        v1'                                                     v1'

                     ptv2                v2                                    ptv2             v2

                               (c) copy-and-log phase                            (d) after commit

Fig. 3. Example PSTM transaction: changes are prepared locally in the volatile heap in (a) and (b). In (c) the
valid transaction commits. Values are moved to PM and changes are redo-logged. In (d), the previous values
(grey) can be reclaimed by the automatic memory management.

   A transaction enters the validation phase after its thread local execution (step between Figures
3b and 3c). The validation step locks all involved variables and compares them to the log’s read
set to ensure that these variables have not been modified by concurrent transactions. STM locks a
TVar by overwriting its global value field with a pointer to the thread local log containing its actual
value. If the validation fails, all involved TVars are unlocked by recovering the value from the log
and the thread local execution is restarted. We do not persist the transaction log; thus pointing
from the value field of a PTVar to it would result in data losses between runs and the actual value
cannot be recovered. We have introduced a separate lock field inside PTVars that ensures that the
global value is never overwritten to allow recovery after power failures. To choose the appropriate
locking scheme TVars and PTVars are distinguished based on their memory location.
   A transaction may commit if the validation succeeds. We have extended the commit logic of STM
by a copy-and-log phase for all involved PTVars to ensure that they always reference values in PM
(Invariant 3ii). We transitively copy all reachable volatile closures to PM if a newly assigned value
resides in the volatile heap (Figure 3c). The copy process can be truncated every time it reaches
closures which already reside in PM as these closures may only point to other persistent closures.
Postponing the copy process to the commit phase avoids unnecessary copies of thread-local state
due to re-execution. However, some closures might be currently under evaluation by another thread
and might be blocked by a blackhole. These cannot be copied power-fail safe in the current RTS

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                             63:15

implementation, as they neither contain a reference to the original thunk computation to revert
them, nor represent a valid result and thus have no meaning across runs. Instead, we restart the
transaction if a blackhole is encountered and block on it until the pending computation is finished.
Similarly, the transaction is restarted when the copy process fails for allocation reasons and a
garbage collection is required.
   The following four steps need to be performed to update PTVars in a power-fail safe manner
(Invariant 3i):
  (1) Copy newly assigned values of all involved PTVars transitively into PM and record the
      mapping of modified PTVars to new PM replicas in a redo-log
  (2) Activate the redo-log; from this point all PTVars are guaranteed to receive the update
  (3) Update all modified PTVars to their new value
  (4) Invalidate the redo-log
Before each step, all modifications of the previous step need to reach the persistence domain. We use
libpmem’s (which is part of Intel PMDK) optimized operations to asynchronously flush all changes
performed by the prior step from the CPU cache to PM. This is followed by a memory fence to ensure
a consistent ordering with writes of succeeding steps. Note that we only log updates of PTVars, but
do not require to control updates of their values as they are immutable, except for unevaluated
computations that are controlled separately. The commit phase is completed by unlocking all TVars
and PTVars. As we use redo-logging of the write-set we can unlock read-only variables directly
after the validation phase without violating transaction ordering as an optimization to improve
concurrency. Figure 3d shows the memory state after the transaction. Unreachable structures in
the volatile or persistent heap, including those created by transaction restarts, will be reclaimed by
the automatic memory management.
   Concurrency-safety is subject to the original STM mechanism and controlled per transactional
variable identified by its memory address. PTVars are created directly in PM and their memory
addresses do not change (e.g. by moving them from the volatile heap to PM). Moreover, TVars as
well as PTVars are locked during validation and commit phases. Flushing PTVar lock fields is not
required (in fact, they could also be volatile), as a consistent view on them is ensured by cache
coherence at runtime and they are reverted on crash recovery. Thus, the concurrency control is
not impaired. PTVars are not locked by overwriting the global value field. This does not impair
atomicity as the validation of local execution is only performed when all locks are held, thus the
read set is compared with an atomic view of the global state. Regarding power fail-safe atomicity,
Step 1 does not change any variable directly, while Step 2 activates the log in an atomic operation
and contains all information to reflect the updates in case of power failures and the transaction
thus is guaranteed to commit. Although the new values are not necessarily reachable for automatic
memory management as they are not integrated into the closure graph, no snapshot can be taken for
the garbage collector while a transaction commits and thus the new values survive until successful
commit or potential recovery. If Step 3 is interrupted, reassignment of any value during recovery
does not collide with other changes, as logically PTVars were not released and no other transaction
was allowed to modify them before the log was disabled in Step 4. Thus updates by PSTM are
durably consistent and the ACID guarantees are fulfilled.

4.3   Unevaluated Expressions in PM
Unevaluated Haskell expressions are represented as thunks. Figure 4 shows the evaluation of thunks
in PM which are performed orthogonal to the transaction mechanism. The entry code of a thunk
defines how the result value should be computed. For each thunk that needs to be evaluated, a
special stack frame, called update frame, is pushed onto the stack (Figure 4a). It is responsible for

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:16               Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

         volatile heap        persistent heap                                    volatile heap    persistent heap
                                                               UpdF                                   Thunk
             UpdF                Thunk
                                                               UpdF              Thunk
                                                                                                 Cons   5
                             Cons     5                                  Thunk        \x → x+1

                                                                                                      \x → x*2+1
                                      \x → x*2+1
                                                                      Cons   5     \x → x*2

            (a) Unevaluated PM thunk                   (b) Volatile intermediate structures created during evaluation
                      volatile heap       persistent heap                        volatile heap   persistent heap
            Stack                                                       Stack
             UpdF                             Thunk                                                   Ind

                           Ind                                                           Ind

                                          Cons     5
                         Cons 11                                                     Cons 11       Cons 11
                                                 \x → x*2+1


        (c) Intermediate thunks were updated to result           (d) The PM thunk was updated by its update frame

Fig. 4. Example of persistent thunk evaluation: when a thunk is evaluated, an update frame is pushed (a)
and the thunk’s entry code is entered. This might create short-lived intermediate structures (b), which are
reduced subsequently (c). Finally, the PM thunk’s update frame is again the top-most stack frame and the
evaluated result is copied and integrated into the graph (d).

integrating the evaluation result into the closure graph and updating the thunk with an indirection
to its result value.
   We have extended the logic of update frames to support lazy evaluation for persistent thunks.
PM thunks can create many very short lived intermediate thunks which represent sub-expressions
as illustrated in Figure 4b. We store results of intermediate thunks (Figure 4c) in fast DRAM and
only the final results are copied (and flushed) transitively into PM (Invariant 3ii). Thunks in PM can
then be updated with an indirection (Figure 4d) through a single and power-safe pointer update.
Our automatic memory management supports to shortcut these indirections using atomic pointer
updates allowing them to be reclaimed.
   Thunks describe pure function applications and they can be evaluated multiple times by concur-
rent threads without changing their result. However, thunk headers can be replaced by a blackhole
header to reduce compute overheads. This happens outside of the control of the update frame. The
original thunk header would, without further protection, be lost if a power failure occurs while it is
blackholed. We therefore store the original thunk info pointer as undo-information in front of the
closure when copying the thunk to PM. Thus, blackholes can be reverted to thunks in succeeding

4.4      Sharing of Structures in PM
In Haskell, the duplicate creation of the same (sub-)expression can be avoided by sharing heap
structures. Top-level definitions are shared globally while definitions through let are shared in

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                                63:17

their appropriate scope. If a shared expression is a thunk, it is evaluated once and all subsequent
usages refer to the result. PSTM often implicitly copies volatile expressions to PM by assigning
them to PTVars. This can lead to multiple evaluations of thunks as a volatile as well as a persistent
version exists. However, Haskell’s purity ensures that this only creates additional work and all
evaluations lead to the same result.
   We preserve sharing within a given PSTM transaction, i.e. volatile sub-expressions reachable
from multiple PTVars in the same transaction are only copied once. Internally, we use a map that
tracks the location of all newly copied values. This also allows us to copy cyclic structures as we can
trace whether a closure was already copied. We do not preserve sharing across multiple transactions
to avoid maintaining a global mapping of all closures and their PM replicas. Efficiently sustaining
such a mapping across runs would indeed be challenging as closures can be dynamically allocated
and are identified by their memory addresses, which are not unique between runs. Here, proper
indexing mechanisms to compare and find closure (sub-)graphs that contain the same expressions
are required. Assigning the same volatile value in distinct transactions therefore creates multiple
copies of this value. The programmer can prevent this overhead by assigning a volatile value to a
PTVar in a separate transaction to retrieve a direct reference before assigning its, now persisted,
value to other PTVars in following transactions.
   Care needs to be taken for top-level definitions which are allocated statically and their memory
locations are thus known at compile-time. Resulting static closures are created as part of the binary
in a section called static region. In contrast to references between heap closures, they are not
necessarily referenced via other closures’ payloads as an optimization of the compiler. Instead, the
entry code of a closure can hold direct memory references to these static closures [Ağacan 2020].
We copy a heap closure as well as all closures referenced from it to PM and reflect the previous
relation by updating the payload pointers. However, entry code references cannot be easily updated
on a per-closure basis as done for payload pointers as the same entry code might be shared between
multiple closures. This is illustrated in Figure 5, where a volatile heap structure as well as its PM
replica is shown. Both still point to the same entry code and consequently to the same static closure.
This is not an issue for immutable static closures such as constructors and functions as they are
available in every run. However, also mutable static thunks, called constant applicative forms (CAFs),
exist. CAFs are normally replaced by indirections into the volatile heap during their evaluation.
Already evaluated results of CAFs are therefore lost in succeeding runs when the static region is
reloaded from the binary. We copy CAFs or their results which are reachable from a persistent
closure to PM and update the original CAF to an indirection to the new PM CAF (illustrated by

                                  volatile Heap                             persistent Heap
                                         fun / thunk                               fun / thunk
                                                   ...                                        ...

                                                closure                       closure               caf

                                            &itbl2       ...               &itbl2       ...

                                                                               static closure
                          ...    itbl1     code1 ...           itbl2   code2 ...                     ...
                                static region

Fig. 5. CAFs are copied to PM if reachable from a PTVar and replaced with a persistent indirection (red

                                   Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:18               Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

the red arrow). The mapping between the original CAF and the PM CAF is held persistently to
allow its recovery across runs. Moving CAFs to PM ensures that effectively only one version, either
volatile or persistent, exists and thus the sharing of static closures is preserved.

4.5     Recovery Mechanism
We perform recovery during runtime system initialization. PSTM logs can be applied in any order
to redo uncommitted changes as each PTVar is at most contained in one log. Note that there is no
need to revert interrupted copy processes. Structures are either copied completely before they are
made reachable in the live PM graph or are automatically garbage collected. If the program crashes
while a segment state is changed by our memory management, i.e. while it is moved from one state
list to another, the move operation is completed after the crash.
   Since the locks of PTVars are part of the PTVar structure, they must be reverted to a valid
state which is done in a traversal of the PM heap starting from the root. Here, also interrupted
computations which are still blocked by blackholes are reverted to the original thunk. Additionally,
all CAFs that are reachable from the persistent root are restored. This process has an overhead
comparable to the mark phase of the garbage collector as each closure is traversed once. In fact,
the process is also used to set the segments liveness bitmaps as we do not rely on flushing them to
PM consistently. Running the recovery mechanism could be often avoided as an optimization by
identifying graceful application shut-downs. This would require to track the CAFs reachable from
PM and flush the valid segment bitmaps before shutdown.

In this section, we first compare PSTM with publicly available library approaches that offer persistent
transactions and analyze the overhead of garbage collecting PM structures. Next, we empirically
measure the overhead of persistent compared to volatile STM transactions. Additionally, the
cost of persistent lazy evaluation is measured based on persistent memoization. We provide our
implementation of PSTM, all benchmarks as well as pre-built artifacts as open source1

5.1     Methodology
The performance measurements have been conducted on a single-socket server comprised of a
10-core, 20-thread Intel Xeon Gold 5215 processor with 192GB DRAM and 6x128GB Intel Optane
DC persistent memory running CentOS 8. PSTM has been integrated into the GHC 8.9 branch
nonmoving-compact [Gamari and Ağacan 2019]. It provides an implementation of the volatile
nonmoving garbage collection strategy described by Gamari and Dietz [2020]. Benchmarks for
STM were run using a slightly adapted version of the more recent GHC 8.10.1 that allows capturing
STM commit statistics with an activated nonmoving collector.
   As we want to ensure best comparability, our evaluation is based on data structures and trans-
action mechanisms that are peer-reviewed open source projects. The Haskell benchmarks are
based on the benchmark suite for STM by Yates and Scott [2017, 2019], which we adapted to also
work with PSTM. The suite provides concurrent tree set implementations, including a red-black
tree (RBTREE), a randomized binary search tree (TREAP), and a hash array mapped trie (HAMT).
Additionally, we modified an existing concurrent hash table (HT) Haskell library [Robinson 2019]
to use PSTM instead of STM.
    • Red-Black Tree. Red-Black trees are self-balancing binary search trees that allow for lookup,
       insert and delete operations in O (log 𝑛). By labeling each node with either black or red it
       can be ensured that the longest path in the tree is at most twice as long as the shortest path.
1 Repository   can be found at

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                             63:19

        Rebalancing is performed during insert and delete operations ensuring that no two adjacent
        nodes are both red and that each path from root to leaf contains the same number of black
        nodes. In addition to the color, each node contains a key-value pair as well as pointers to the
        parent and the two child nodes.
      • Treap. A treap is a combined binary tree and heap data structure. A node in the tree consists
        out of two (P)TVars modelling the modifiable relation to the child nodes, as well as a key-value
        pair and a priority. While the key is used to maintain the left-to-right ordering in the tree, the
        priorities are assigned randomly and define at which depth a node has to be placed. Insertion
        of nodes at random heights leads to rotations that are expected to maintain logarithmic
        height, independent of the insertion order. Since the priority defining the height is obtained
        randomly and each lookup of a key starts from the root, it is likely that a transaction inserting
        a node nearby the root invalidates the read sets of concurrent transactions that also rely on
        that path, thus leading to conflicts.
      • Hashed Array Mapped Trie. A Hashed Array Mapped Trie (HAMT) is the combination of
        array mapped tries and hash tables. Each key is first hashed. The (hashed) key-value pair is
        then inserted into a trie [De La Briandais 1959; Mäsker et al. 2019]. The trie has two different
        kinds of nodes ś levels and leaves. A level represents an n-bit chunk of a key. Leaves contain
        key-value pairs. The hashing of the key ensures a well-balanced trie which tends to get more
        sparse towards the leaves. For a given maximum key size the trie has a constant depth. By
        mapping it onto an array this sparsity can be leveraged to allow for more economical usage
        of available memory.
      • Hash Table. A Hash Table uses a hash function to map keys directly to values. Given a
        key and the hash function an index is computed which points into an array of buckets.
        Buckets are implemented as a mutable linked list. Inserts and deletes only create writes in the
        corresponding bucket which keeps transaction logs small. For supporting arbitrary numbers
        of key-value pairs dynamic resizing of the array is desirable, as it keeps the mean size of the
        linked lists small. However, this induces large copy overheads.
   The data structures have been initialized with 50, 000 elements out of a key space of 100, 000
elements to ensure comparability with the benchmarks of Yates and Scott [2019]. Worker threads
perform lookup, insert, and delete operations. The lookup rate (𝑝𝑙𝑢 ) defines the ratios of the
respective operations. Inserts and deletes are performed with equal probability, given as (1 − 𝑝𝑙𝑢 )/2.
Thus, the number of elements in the structure is expected to remain constant. All benchmarks
were averaged over 5 runs. Each run has been restricted to a runtime of 60 seconds, except for
Mnemosyne where we were restricted to 3 seconds as we observed an unstable behavior for longer
runtimes. The relative standard error of the mean (SEM) was (far) below 3% for most of our results
and we include error bars for larger deviations. The experiments also include results for PSTM
and other persistent libraries completely running on DRAM to understand the impact of faster
PM technologies coming up in the future. PSTM and the other libraries running on DRAM still
perform all operations according to the standard PSTM protocol, whereas of course the data on
DRAM cannot be recovered in case of a crash.

5.2    Comparison with Library Based Approaches
We compare our approach with the libraries OneFile [Ramalhete et al. 2019], Romulus [Correia
et al. 2018], PMDK [Rudoff 2017], and Mnemosyne [Volos et al. 2011]. We used a benchmark suite
by Ramalhete et al. [2019] that offers a red-black tree and a hash table implementation for OneFile,
PMDK and Romulus. It ensures concurrency-safety of PMDK transactions by using a global lock.
While specialized implementations can leverage more fine-grained synchronization approaches,

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:20            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

             (a) RBTREE lookup ratios on PM                          (b) HT lookup ratio scaling on PM

           (c) RBTREE thread scaling (𝑝𝑙𝑢 =50%)                       (d) HT thread scaling (𝑝𝑙𝑢 =50%)

           (e) RBTREE thread scaling (𝑝𝑙𝑢 =90%)                       (f) HT thread scaling (𝑝𝑙𝑢 =90%)

   Fig. 6. Red-black tree (left) set and hash table (right) transaction throughput (Tx/s) scaling behavior.

the abstraction of the global locking approach is comparable to STM. For Mnemosyne we used
the red-black tree and hash table implementations of the vacation example of STAMP [Minh et al.
2008; Nalli 2018]. We have not been able to compare PSTM with Java-based implementations, as
AutoPersist and Espresso have not been available as publicly available repositories. We adapted all
benchmarks to the workload used by Yates and Scott [2019] described in the previous section. While
the implementations differ across languages, the well-defined operations on red-black trees and
hash tables ensure the same workload. As Mnemosyne has internal limits on the transaction read
and write set size, we used a fixed number of 6000 buckets for all hash table benchmarks. OneFile
provides two concurrency approaches called OneFilePTM-LF and OneFilePTM-WF. Both show very
similar results for these benchmarks and we therefore only present the results of OneFilePTM-LF
as OneFile. The same applies to RomulusLog and RomulusLR where we only show RomulusLR as
Romulus. The library approaches reclaim memory objects explicitly when they are replaced. To

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                             63:21

allow a fair comparison, we thus limit the heap size to 1500 Mb for PSTM to enforce a timely
collection of unreachable structures in PM. The mutator threads are blocked if the mutator outruns
the collector until space was reclaimed. However, we also include an unlimited configuration with
a large PM heap file where garbage is only collected at regular collection cycles. Here, the heap
is allowed to grow over time to show PSTM’s scaling potential when lower garbage collection
overheads occur.
   Figures 6 (a) and (b) show the impact of different lookup ratios on the transaction throughput
(Tx/s) for 20 concurrent threads for RBTREE and HT. These graphs only show results created on
Optane DC PM, as the DRAM results have shown very similar trends on this log-scale projection.
PSTM shows comparable performance to the library solutions and outperforms many approaches
for lookup ratios lower than 80%. Due to the optimistic synchronization approach of (P)STM by
concurrently preparing transactions in local state, a higher degree of parallelism at high update rates
can be achieved. The throughput of the other approaches decreases more drastically when many
write transactions are involved. Moreover, both shown data structures leverage lazy evaluation
for PSTM which allows to offload both, copy and computation overhead, from the transaction
mechanism. For RBTREE, we measured that approximately 1.2𝑥 more bytes were copied to PM
during lazy thunk updates than during the transaction commit phases, while thunk updates were
responsible for 2/3 of the bytes written to PM for HT. Updates of the RBTREE can cause rebalancing
steps and therefore many write operations per transaction. Only Mnemosyne shows a similar
performance compared to PSTM for the RBTREE when writes are involved, as it relies on a similar
redo logging mechanism. However, Mnemosyne needs to log all changes as well as allocation as it
does not integrate with an automatic memory management that prevents memory space leaks. PSTM
only has to use a redo log to allow multiple power-fail safe pointer updates of PTVars. Therefore
PSTM unlimited shows that PSTM can outperform Mnemosyne for write intensive workloads due
to more light-weight transaction logging when garbage collection impacts are reduced. Romulus
does not allow concurrent writes in transactions and the parallelism therefore degrades. OneFile
allows to cooperatively process the write set of one transaction with multiple threads but this also
induces a communication overhead. However, when only few variables are involved and writes are
sufficiently small, as for HT, OneFile and Romulus can outperform Mnemosyne. Although PMDK
transactions are executed sequentially, the used undo logging mechanism which needs to capture
the state prior to modification also introduces severe overheads on write transactions.
   Figure 6 (c) and (e) show the thread scaling behavior of the red-black trees for 50% and 90%
lookup rates. Generally, only Mnemosyne and PSTM show significant benefits from parallelism.
At 50% lookup rate, PSTM scales well for low thread counts while the throughput improvement
slows down for higher thread counts. For thread counts above 10 hyper-threading is used. Also,
the garbage collection impact increases noticeable with the number of threads for the blocking
version as will be shown in more detail in the next section. In contrast to the library approaches,
PSTM’s throughput diverges more between DRAM and Optane DC the more threads are involved.
This is due to bandwidth limitations of Optane DC as its maximum write bandwidth can only be
maintained for writes of 256 bytes and above [Izraelevitz et al. 2019]. Future work could further
optimize the allocator to match optimal write characteristics of the underlying PM hardware. For
90% lookup ratio this divergence is decreased, as less write transactions are involved. The impact
of garbage collection for PSTM becomes visible for 20 concurrent worker threads. The garbage
collection is performed by a dedicated OS thread and the throughput of the worker threads is
therefore impacted when the maximum hardware thread count is reached.
   Figures 6 (d) and 6(f) show the respective thread scaling results of the hash table. For a lookup
rate of 50% it is noticeable that the performance of Mnemosyne is in the same order of magnitude
as sequentialized PMDK transactions. Here, the benefit from parallelism is lower as the commit

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:22            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

phase of the transaction mechanism imposes an overhead for few writes while conflicts are unlikely.
PSTM’s overhead scales with the number of variables that participate in the transaction. OneFile
and Romulus scale with the likelihood of concurrent write transactions, which increases if more
threads participate.

5.3 Garbage Collection Impact
The previous results show large variance in throughput for HT for PSTM induced by blocking
times of the garbage collector. The nonmoving collector runs concurrently to the mutator which
can modify mutable memory objects, such as PTVars, while the collector marks the reachable
closures. In this phase mutator modifications are tracked in a so-called update remembered set to
remember the existing references at snapshot time. They are marked in a subsequent stop-the-world
phase after completion of the concurrent mark phase, resulting in sync times [Gamari and Dietz
2020]. When the maximum heap size is reached, as in the blocking runs, an additional collection
is triggered and threads requesting allocation in PM are blocked until space has been reclaimed,
resulting in block times.
   Figure 7 reports the sync and block times for the RBTREE and HT benchmarks as a fraction of
the runtime for the Optane DC runs. The sync times for the RBTREE unlimited benchmark are
relatively low, while they are more severe for HT unlimited. The sync time is largely dependent
on how many closures in the update remembered set have been missed during the concurrent

                                                     (a) RBTREE

                                                        (b) HT

Fig. 7. Blocking times of the nonmoving garbage collector for the RBTREE (top) and HT (bottom) benchmark
with and without extra blocking to enforce a limited PM heap size.

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                             63:23

mark phase and thus need to be marked retrospectively. For RBTREE many updates of PTVars
are caused by rebalancing where most values are still reachable afterwards and thus already have
been traced. Contrary, updating transactions of HT are shorter as less variables are involved and
values that were reachable at snapshot time can become unreachable more quickly. Moreover, the
higher number of thunk updates leads to more thunk references that were missed and must be
traced in the sync phase. It can be observed that the highest sync times for HT unlimited occur at a
lookup ratio of 50%, where they take almost half of the run time. This is due to the high number of
thunks newly written to PM which are then forced, i.e. executed, by the many concurrent read
transactions. However, also high deviations in sync times are induced by fluctuating numbers of
garbage collection cycles. As shown by the thread scaling results the sync time overhead increases
linear with the number of threads, except for 20 threads where the maximum hardware thread count
is reached. In the blocking configurations the overall blocking times are more predictable, as they
scale with both, the thread count and the lookup ratio. The mutator outruns the garbage collector
even for low thread counts in the thread scaling examples as the collection of the nonmoving heap
is performed single-threaded. An exception is HT at the lookup ratio of 90% where the amount
of allocations is low enough that no additional blocking is required for up to 10 threads despite
the high rate of missed closures during concurrent marking. As the blocking avoids the update
of references by the mutator, it reduces the sync times as seen for HT. The higher sync times for
RBTREE were induced by an increased number of enforced collections. In total, mutator pause
times of up to 48% of the runtime were observed for RBTREE and 73% for HT respectively.
   The nonmoving collector was not designed for maintaining many short-lived structures. However,
we see the potential to significantly reduce mutator blocking times in future versions by allowing
suspended threads to cooperate in the collection.

5.4 Persistence Overhead vs. STM
In this section we compare PSTM transactions to volatile STM transactions to estimate the overheads
of the power-fail safe update mechanism and migration of values to PM. We evaluated two different
data structures, a TREAP and a HAMT (Figure 8). We compared STM with PSTM in the unlimited
configuration where we do not introduce additional blocking as the nonmoving collector of the

        (a) TREAP ratio scaling        (b) TREAP thread scaling (𝑝𝑙𝑢 =50%)        (c) TREAP commit ratio

        (d) HAMT ratio scaling         (e) HAMT thread scaling (𝑝𝑙𝑢 =50%)         (f) HAMT commit ratio

                        Fig. 8. Comparison of HAMT and TREAP for STM and PSTM.

                                Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:24            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

GHC version used for the volatile STM benchmarks does not allow to limit the heap size. We
observed that for STM benchmarks the mutator can outperform the garbage collector resulting in
heap usages of up to 40 GB.
   While the TREAP uses strict evaluation, the HAMT leverages laziness. PSTM and STM show
similar scaling behaviors for varying read ratios and numbers of threads. Across the different read
ratios PSTM showed slowdowns between 1.4𝑥 and 1.7𝑥 on the same memory technology (1.8𝑥
and 2.1𝑥 on PM) for TREAP at lookup ratios lower than 90%. For 100% read ratio all transaction
mechanisms show very similar throughput. Slowdowns between 1.4𝑥 and 2.4𝑥 on DRAM (2.4𝑥
and 4.3𝑥 on PM) can be observed for HAMT using PSTM. Here, PSTM slightly outperforms STM
when only read transactions are executed. For STM, the data structure gradually ages, i.e. is moved
iteratively to older generations by the generational garbage collector. This results in repetitive
copy overheads until it is finally copied to the nonmoving heap, whereas PSTM moves the data
structure to the nonmoving PM heap on initialization. The measured overheads of PSTM when
writes are involved are partly induced by the need of moving values from the volatile heap to PM
in an on-commit copy-phase and the required flushing. Additionally, a write-ahead redo-log is
created in PM holding references to the new values of PTVars.
   PSTM induces a higher overhead on HAMT than on TREAP, as HAMT works on immutable
arrays rather than on single values. It therefore copies more bytes on updates. Averaged across
lookup ratios, HAMT copies 107 bytes for each write transaction, while TREAP only copies 44
bytes. Copying more bytes prolongs the commit phase and thereby increases the probability
of conflicts between transactions, inducing additional overheads. Figure 8(c) and (f) show the
respective successful commit rates. TREAP generally shows a higher conflict rate than HAMT,
because an update that triggers a rebalancing of the tree close to the root can easily invalidate
all concurrent transactions operating on the same branch. However, for TREAP the overheads of
PSTM on the commit rate is negligible, as most updating transactions only operate on nodes of one
specific branch not limiting concurrent transactions on other branches. The successful commit rate
of TREAP shows a maximum decrease of 4.4% on DRAM (7.6% on PM) for PSTM in comparison
with STM. For HAMT, the prolonged commit phase has a higher impact on the number of conflicts.
HAMT’s compact representation leads to less independent paths for concurrent transactions. A
node is an array of other nodes that is rewritten when a new node should be added. This invalidates
all transactions involving any paths that share this node. This array is generated in the volatile
heap for STM and a single pointer update is sufficient, while PSTM needs to replicate the arrays at
commit time, blocking a whole set of paths for a longer time. The conflict rate therefore increases
by 18% (39% on PM) for HAMT when using PSTM.
   We additionally evaluated the overheads of PSTM to STM for RBTREE and HT. PSTM has shown
slowdowns between 1.3𝑥 and 1.9𝑥 on DRAM (1.5𝑥 and 2.7𝑥 on PM) for RBTREE and 1.5𝑥 and 2.8𝑥
on DRAM (2.2𝑥 and 4.4𝑥 on PM) for HT, respectively. The scaling behavior of PSTM resembled
that of STM.

5.5     Persistent Memoization
Haskell’s laziness is used in many libraries and is also supported by PSTM. A possible application
of PSTM is persistent lazy memoization. Memoization provides a trade-off between computation
time and memory consumption of a program. Previously computed results are stored and returned
for the same input argument to a function instead of recomputing it. Memoization in Haskell is
most commonly implemented using MemoTrie [Elliot 2008] and MemoCombinators [Palmer 2013].
MemoTrie uses a trie to store previous results. Depending on the function type, MemoCombinators
can select from multiple data structures also including a trie for memoization.

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                               63:25

    fib ::    ( Int → Integer) → Int → Integer
    fib f    0 =0
    fib f    1 =1
    fib f    n = f ( n − 1) + f ( n − 2)

    memofib :: IO ( Int → Integer)
    memofib = do tv <− getRoot $ fix ( memo . fib)
               f <− atomically $ readPTVar tv
                              Listing 6. Example of persistent lazy memoization.

Listing 6 shows a memoization example using MemoTrie. The higher order function 𝑚𝑒𝑚𝑜 takes
a function and returns the memoizing equivalent. By assigning the memoizing function to a
persistent variable, the function as well as its data structure holding the memoized results are
persisted. Fibonacci heavily benefits from the reuse of previously computed results as each number
recursively depends on all smaller Fibonacci numbers. Here, the fix function is used to additionally
benefit from previously computed values in the same call.
   Table 1 shows the wall-clock time required for computing the 150,000th Fibonacci number using
MemoTrie and MemoCombinators. All recursive calls map to the same memoized function, thus all
computed Fibonacci numbers are shared in the same call to the function. As the libraries heavily
rely on lazy evaluation, the measurements serve as an estimation for the cost of thunk evaluation
in PM. Persistent memoization shows slowdowns of 2.9𝑥 (4.5𝑥 on PM) for MemoTrie and 3.5𝑥 (4.8𝑥
on PM) for MemoCombinators when compared to the volatile counterpart. This cost amortizes
quickly when the memoized values are reused across a few runs instead of recomputing them.
   The high number of average bytes copied per thunk update is a result of the large Fibonacci
numbers (fib(150,000) has 31,348 decimal digits). The higher runtime of MemoCombinators is
explained by the higher number of individual thunk updates, each requiring the evacuation of
results from the volatile to the PM heap. Individual updates are costly because they need flushing
and memory fencing. While 150,000 thunk updates account for memoizing the Fibonacci numbers,
the remaining updates result from lazy evaluation of the data structures used for memoization and
only copy few bytes. Also, the garbage collection time for MemoCombinators is slightly increased.

Table 1. Comparison of the required runtime (left) and the copy effort (right) of computing the 150,000th
Fibonacci number for the memoization libraries MemoTrie (MTrie) and MemoCombinators (MComb).

               Persistent Heap      Volatile Heap                           thunk       copy        average bytes /
               DRAM       PM           DRAM                                 updates     size        thunk update
 MTrie         2,42 s    3,73 s         0,83 s                 MTrie        384 467     967 MB      2636,2 B
 MComb         3,28 s    4,48 s         0,93 s                 MComb        899 990     982 MB      1144,1 B

Persistent memory enables the direct modification of persistent data structures without serialization
overhead. Providing transactions in persistent memory is challenging due to the small granularity
of atomic updates (8 bytes). Although several transaction libraries are available, they require
significant effort from programmers to use them. In this paper, we have demonstrated that a PSTM
abstraction to interact with PM at the language level can be offered by extending Haskell’s STM
mechanism. Its simple API preserves the performance benefits of PM while hiding most of its
details. PSTM is fully compatible with Haskell’s lazy evaluation semantics, thus allowing easy
adaptation of existing code. The evaluation of STM libraries transformed to support PSTM on real

                                  Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:26            Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

hardware has shown that PSTM offers higher or the same performance compared to low-level
library approaches. We further added persistence to two volatile memoization libraries. PSTM
currently only supports single binary execution as raw function pointers are stored in PM. Future
versions could be adapted to allow multiple binaries to interact with the same persistent memory
root and support for multiple persistent roots could be introduced to further segment the persistence
domain. Additionally, garbage collector pause times could be significantly reduced by allowing
blocked threads to cooperate in nonmoving collection cycles, further increasing the throughput of
the transaction mechanism.

We would like to thank Simon Peyton-Jones and Jeremy Gibbons for the discussions that helped to
develop the ideas on which our research is based. Thanks also to the community of the GHC IRC
channel, which frequently provided very helpful support for compiler-related questions. Addition-
ally, we are grateful for all the feedback and advice from the reviewers that helped to improve our
work. Finally, we would like to thank the artifact reviewers for taking the time to reproduce our
results and therefore to ensure the correctness and usability of our implementation.

Martín Abadi, Luca Cardelli, Benjamin C. Pierce, and Gordon D. Plotkin. 1991. Dynamic Typing in a Statically Typed
   Language. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 2 (1991), 237ś268.
Malcolm P. Atkinson, Peter J. Bailey, Kenneth Chisholm, W. Paul Cockshott, and Ronald Morrison. 1983. An Approach to
   Persistent Programming. Comput. J. 26, 4 (1983), 360ś365.
Malcolm P. Atkinson, Kenneth Chisholm, and W. Paul Cockshott. 1982. PS-algol: an Algol with a persistent heap. ACM
   SIGPLAN Notices 17, 7 (1982), 24ś31.
Malcolm P. Atkinson and Ronald Morrison. 1995. Orthogonally Persistent Object Systems. VLDB J. 4, 3 (1995), 319ś401.
Ömer S. Ağacan. 2020. The problem with adding functions to compact regions. Retrieved February 24, 2021 from https:
Jost Berthold. 2010. Orthogonal Serialisation for Haskell. In Implementation and Application of Functional Languages - 22nd
   International Symposium, IFL 2010, Alphen aan den Rijn, The Netherlands, September 1-3, 2010, Revised Selected Papers
   (Lecture Notes in Computer Science, Vol. 6647). Springer, 38ś53.
Luc Bläser. 2007. Persistent Oberon: A Programming Language with Integrated Persistence. In 5th Asian Symposium on
   Programming Languages and Systems (APLAS), Singapore, November 29-December 1, Vol. 4807. 71ś85.
Peter T. Breuer. 2003. A Formal Model for the Block Device Subsystem of the Linux Kernel. In Formal Methods and Software
   Engineering, Proceedings of the 5th International Conference on Formal Engineering Methods (ICFEM), Singapore, November
   5-7. 599ś619.
Trevor Brown and Hillel Avni. 2016. PHyTM: Persistent Hybrid Transactional Memory. Proceedings of the VLDB Endowment
   10, 4 (2016), 409ś420.
James Cheney and Ralf Hinze. 2002. A Lightweight Implementation of Generics and Dynamics. In Proceedings of the
   2002 ACM SIGPLAN Workshop on Haskell. Association for Computing Machinery, New York, NY, USA, 90ś104. https:
Paul Chiusano. 2020. How Unison reduces ecosystem churn. Retrieved February 24, 2021 from
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011.
   NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the
   16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
   Newport Beach, CA, USA, March 5-11. 105ś118.
Richard C. H. Connor. 1991. Types and polymorphism in persistent programming systems. Ph.D. Dissertation. University of St
Alberto G. Corona. 2017. TCache: A Transactional cache with user-defined persistence. Retrieved February 24, 2021 from

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                                    63:27

Andreia Correia, Pascal Felber, and Pedro Ramalhete. 2018. Romulus: Efficient Algorithms for Persistent Transactional
   Memory. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), Vienna, Austria,
   July 16-18, 2018. ACM, 271ś282.
Rene De La Briandais. 1959. File Searching Using Variable Length Keys. In Papers Presented at the the March 3-5, 1959,
   Western Joint Computer Conference (IRE-AIEE-ACM ’59 (Western)). ACM, San Francisco, California, 295ś298. https:
Subramanya Rao Dulloor, Sanjay Kumar, Anil S. Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff
   Jackson. 2014. System software for persistent memory. In 9th Eurosys Conference, Amsterdam, The Netherlands, April
   13-16. 15:1ś15:15.
Conal Elliot. 2008. Elegant memoization with functional memo tries. Retrieved February 24, 2021 from
Pascal Felber, Christof Fetzer, and Torvald Riegel. 2008. Dynamic performance tuning of word-based software transactional
   memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP),
   Salt Lake City, UT, USA, February 20-23. 237ś246.
Richard F. Freitas and Winfried W. Wilcke. 2008. Storage-class memory: The next storage system technology. IBM Journal
   of Research & Development 52, 4-5 (2008), 439ś448.
Ben Gamari and Ömer Sinan Ağacan. 2019. COMPACT_NFDATA support for the nonmoving collector. Retrieved February 24,
   2021 from
Ben Gamari and Laura Dietz. 2020. Alligator collector: a latency-optimized garbage collector for functional programming
   languages. In ACM SIGPLAN International Symposium on Memory Management (ISMM), June 16, 2020. 87ś99. https:
GHC Team. 2020a. The Block Allocator. Retrieved February 24, 2021 from
GHC Team. 2020b. GHC Commentary: The Layout of Heap Objects. Retrieved February 24, 2021 from https://gitlab.haskell.
Justin E. Gottschlich and Daniel A. Connors. 2007. DracoSTM: A Practical C++ Approach to Software Transactional
   Memory. In Proceedings of the 2007 Symposium on Library-Centric Software Design, Montreal, Canada, October 21. 52ś66.
Frank T. Hady, Annie P. Foong, Bryan Veal, and Dan Williams. 2017. Platform Storage Performance With 3D XPoint
   Technology. Proc. IEEE 105, 9 (2017), 1822ś1833.
Theo Haerder and Andreas Reuter. 1983. Principles of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15, 4
   (1983), 287ś317.
Cordelia V. Hall, Kevin Hammond, Will Partain, Simon L. Peyton Jones, and Philip Wadler. 1992. The Glasgow Haskell
   Compiler: A Retrospective. In Proceedings of the Glasgow Workshop on Functional Programming, Ayr, Scotland, UK, 6-8
   July. 62ś71.
Robert Harper. 1985. Modules and Persistence in Standard ML. In Data Types and Persistence. Edited Papers from the
   Proceedings of the First Workshop on Persistent Objects, Appin, Scotland, UK, August 1985 (Topics in Information Systems),
   Malcolm P. Atkinson, Peter Buneman, and Ronald Morrison (Eds.). Springer, 21ś30.
Tim Harris, Simon Marlow, Simon L. Peyton Jones, and Maurice Herlihy. 2005. Composable memory transactions. In
   Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), June 15-17, 2005,
   Chicago, IL, USA. 48ś60.
Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. 2003. Software transactional memory for
   dynamic-sized data structures. In Proceedings of the 22nd ACM Symposium on Principles of Distributed Computing (PODC),
   Boston, Massachusetts, USA, July 13-16. 92ś101.
Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. In
   Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA), San Diego, CA, USA, May 1993.
Antony L. Hosking and Jiawan Chen. 1999. PM3: An Orthogonal Persistent Systems Programming Language - Design,
   Implementation, Performance. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB), September
   7-10, Edinburgh, Scotland, UK. 587ś598.
Paul Hudak, Simon L. Peyton Jones, Philip Wadler, Brian Boutel, Jon Fairbairn, Joseph H. Fasel, María M. Guzmán, Kevin
   Hammond, John Hughes, Thomas Johnsson, Richard B. Kieburtz, Rishiyur S. Nikhil, Will Partain, and John Peterson.
   1992. Report on the Programming Language Haskell, A Non-strict, Purely Functional Language. ACM SIGPLAN Notices
   27, 5 (1992).
Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu,
   Subramanya R. Dulloor, Jishen Zhao, and Steven Swanson. 2019. Basic Performance Measurements of the Intel Optane
   DC Persistent Memory Module. CoRR abs/1903.05714 (2019).

                                   Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
63:28             Nicolas Krauter, Patrick Raaf, Peter Braam, Reza Salkhordeh, Sebastian Erdweg, and André Brinkmann

Alfons Kemper and Donald Kossmann. 1993. Adaptable Pointer Swizzling Strategies in Object Bases. In Proceedings of the
   Ninth International Conference on Data Engineering, April 19-23, 1993, Vienna, Austria. IEEE Computer Society, 155ś162.
Emre Kultursay, Mahmut T. Kandemir, Anand Sivasubramaniam, and Onur Mutlu. 2013. Evaluating STT-RAM as an energy-
   efficient main memory alternative. In 2012 IEEE International Symposium on Performance Analysis of Systems & Software,
   Austin, TX, USA, 21-23 April, 2013. IEEE Computer Society, 256ś267.
Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei Wu, Weimin Zheng, and Jinglei Ren. 2017. DudeTM:
   Building Durable Transactions with Decoupling for Persistent Memory. In Proceedings of the 22nd International Conference
   on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Xi’an, China, April 8-12. 329ś343.
Simon Marlow, Tim Harris, Roshan P. James, and Simon L. Peyton Jones. 2008. Parallel generational-copying garbage
   collection with a block-structured heap. In Proceedings of the 7th International Symposium on Memory Management
   (ISMM), Tucson, AZ, USA, June 7-8. 11ś20.
Alonso Marquez, Stephen Blackburn, Gavin Mercer, and John N. Zigman. 2000. Implementing Orthogonally Persistent Java.
   In Persistent Object Systems, 9th International Workshop, POS-9, Lillehammer, Norway, September 6-8, 2000, Revised Papers
   (Lecture Notes in Computer Science, Vol. 2135), Graham N. C. Kirby, Alan Dearle, and Dag I. K. Sjùberg (Eds.). Springer,
Markus Mäsker, Tim Süß, Lars Nagel, Lingfang Zeng, and André Brinkmann. 2019. Hyperion: Building the Largest In-
   memory Search Tree. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), Amsterdam,
   The Netherlands, June 30 - July 5. 1207ś1222.
David J. McNally and Antony J. T. Davie. 1991. Two Models For Integrating Persistence and Lazy Functional Languages.
   ACM SIGPLAN Notices 26, 5 (1991), 43ś52.
Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. STAMP: Stanford Transactional Applications
   for Multi-Processing. In 4th International Symposium on Workload Characterization (IISWC), Seattle, Washington, USA,
   September 14-16, 2008. IEEE Computer Society, 35ś46.
Ron Morrison, Richard Connor, Graham Kirby, David Munro, Malcolm Atkinson, Quintin Cutts, Alfred Brown, and Alan
   Dearie. 2000. The Napier88 Persistent Programming Language and Environment. 98ś154.
Nafiseh Moti, Frederic Schimmelpfennig, Reza Salkhordeh, David Klopp, Toni Cortes, Ulrich Rückert, and André Brinkmann.
   2021. Simurgh: A Fully Decentralized and Secure NVMM User Space File System. In Proceedings of the International
   Conference for High Performance Computing, Networking, Storage and Analysis (SC) (accepted for publication), St. Louis,
   MO, USA, November 14-19, 2021.
Sanketh Nalli. 2018. GCC port of TM system Mnemosyne. Retrieved February 24, 2021 from
Luke Palmer. 2013. Data.MemoCombinators: Combinators for building memo tables. Retrieved February 24, 2021 from
Stuart S. P. Parkin, Masamitsu Hayashi, and Luc Thomas. 2008. Magnetic Domain-Wall Racetrack Memory. Science 320,
   5873 (2008), 190ś194.
Simon L. Peyton Jones. 1992. Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-Machine.
   Journal of Functional Programming 2, 2 (1992), 127ś202.
Simon L. Peyton Jones, Andrew D. Gordon, and Sigbjùrn Finne. 1996. Concurrent Haskell. In 23rd ACM SIGPLAN-SIGACT
   Symposium on Principles of Programming Languages (POPL), St. Petersburg Beach, Florida, USA, January 21-24. 295ś308.
Juan J. Quintela and Juan J. Sánchez. 2001. Persistent Haskell. In Computer Aided Systems Theory - EUROCAST 2001, Las
   Palmas de Gran Canaria, Spain, February 19-23, 2001, Revised Papers (Lecture Notes in Computer Science, Vol. 2178), Roberto
   Moreno-Díaz, Bruno Buchberger, and José Luis Freire (Eds.). Springer, 657ś667.
Pedro Ramalhete, Andreia Correia, Pascal Felber, and Nachshon Cohen. 2019. OneFile: A Wait-Free Persistent Transactional
   Memory. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Portland, OR,
   USA, June 24-27, 2019. IEEE, 151ś163.
Dulloor Subramanya Rao, Sanjay Kumar, Anil S. Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff
   Jackson. 2014. System software for persistent memory. In Ninth Eurosys Conference 2014, EuroSys 2014, Amsterdam, The
   Netherlands, April 13-16, 2014, Dick C. A. Bulterman, Herbert Bos, Antony I. T. Rowstron, and Peter Druschel (Eds.).
   ACM, 15:1ś15:15.
Simone Raoux, Geoffrey W. Burr, Matthew J. Breitwisch, Charles T. Rettner, Yi-Chou Chen, Robert M. Shelby, Martin Salinga,
   Daniel Krebs, Shih-Hung Chen, Hsiang-Lan Lung, and Chung Hon Lam. 2008. Phase-change random access memory: A
   scalable technology. IBM Journal of Research and Development 52, 4-5 (2008), 465ś480.

Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.
Persistent Software Transactional Memory in Haskell                                                                  63:29

Peter Robinson. 2019. concurrent-hashtable: Thread-safe hash tables for multi-cores! Retrieved February 24, 2021 from
Andy Rudoff. 2017. Persistent Memory Programming. login Usenix Mag. 42, 2 (2017).
Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Ben Hertzberg. 2006. McRT-STM: a high
   performance software transactional memory system for a multi-core runtime. In Proceedings of the ACM SIGPLAN
   Symposium on Principles and Practice of Parallel Programming (PPoPP), New York, New York, USA, March 29-31. 187ś197.
Thomas Shull, Jian Huang, and Josep Torrellas. 2019. AutoPersist: an easy-to-use Java NVM framework based on reachability.
   In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Phoenix,
   AZ, USA, June 22-26. 316ś332.
Christopher Strachey. 2000. Fundamental Concepts in Programming Languages. High. Order Symb. Comput. 13, 1/2 (2000),
Robert D. Tennent. 1977. Language Design Methods Based on Semantic Principles. Acta Informatica 8 (1977), 97ś112.
Katsuhiro Ueno and Atsushi Ohori. 2016. A fully concurrent garbage collector for functional programs on multicore
   processors. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP), Nara,
   Japan, September 18-22. 421ś433.
Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: lightweight persistent memory. In Proceedings of
   the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
   Newport Beach, CA, USA, March 5-11. 91ś104.
Michèle Weiland, Holger Brunst, Tiago Quintino, Nick Johnson, Olivier Iffrig, Simon D. Smart, Christian Herold, Antonino
   Bonanni, Adrian Jackson, and Mark Parsons. 2019. An early evaluation of Intel’s optane DC persistent memory module
   and its impact on high-performance scientific applications. In Proceedings of the International Conference for High
   Performance Computing, Networking, Storage and Analysis (SC), Denver, Colorado, USA, November 17-19. 76:1ś76:19.
Mingyu Wu, Haibo Chen, Hao Zhu, Binyu Zang, and Haibing Guan. 2020. GCPersist: an efficient GC-assisted lazy persistency
   framework for resilient Java applications on NVM. In VEE ’20: 16th ACM SIGPLAN/SIGOPS International Conference on
   Virtual Execution Environments, virtual event, Lausanne, Switzerland, March 17. 1ś14.
Mingyu Wu, Ziming Zhao, Haoyu Li, Heting Li, Haibo Chen, Binyu Zang, and Haibing Guan. 2018. Espresso: Brewing Java
   For More Non-Volatility with Non-volatile Memory. In Proceedings of the 23rd International Conference on Architectural
   Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 24-28. 70ś83.
Jian Xu and Steven Swanson. 2016. NOVA: A Log-Structured File System for Hybrid Volatile/Non-Volatile Main Memories.
   login Usenix Mag. 41, 3 (2016).
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson,
   and Andy Rudoff. 2017. NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System. In Proceedings of the
   26th Symposium on Operating Systems Principles (SOSP), Shanghai, China, October 28-31. 478ś496.
Edward Z. Yang, Giovanni Campagna, Ömer S. Agacan, Ahmed El-Hassany, Abhishek Kulkarni, and Ryan R. Newton.
   2015. Efficient communication and collection with compact normal forms. In Proceedings of the 20th ACM SIGPLAN
   International Conference on Functional Programming, ICFP 2015, Vancouver, BC, Canada, September 1-3, 2015. ACM, 362ś374.
Ryan Yates and Michael L. Scott. 2017. Improving STM performance with transactional structs. In Proceedings of the
   10th ACM SIGPLAN International Symposium on Haskell, Oxford, United Kingdom, September 7-8, 2017. ACM, 186ś196.
Ryan Yates and Michael L. Scott. 2019. Leveraging hardware TM in Haskell. In Proceedings of the 24th ACM SIGPLAN
   Symposium on Principles and Practice of Parallel Programming (PPoPP), Washington, DC, USA, February 16-20, 2019. ACM,

                                   Proc. ACM Program. Lang., Vol. 5, No. ICFP, Article 63. Publication date: August 2021.