Regularity Lemmas and Combinatorial Algorithms

Authors Nikhil Bansal, R. Ryan Williams,
Plaintext
                            T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94
                                       www.theoryofcomputing.org

                      S PECIAL ISSUE IN HONOR OF R AJEEV M OTWANI



                           Regularity Lemmas and
                          Combinatorial Algorithms
                                 Nikhil Bansal                 Ryan Williams
                                  Received: May 25, 2010; published: April 1, 2012.




      Abstract: We present new combinatorial algorithms for Boolean matrix multiplication
      (BMM) and preprocessing a graph to answer independent set queries. We give the first
      asymptotic improvements on combinatorial algorithms for dense BMM in many years,
      improving on the “Four Russians” O(n3 /(w log n)) bound for machine models with wordsize
      w. (For a pointer machine, we can set w = log n.) The algorithms utilize notions from
      Regularity Lemmas for graphs in a novel way.
          • We give two randomized combinatorial algorithms for BMM. The first algorithm is
            essentially a reduction from BMM to the Triangle Removal Lemma. The best known
            bounds for the Triangle Removal Lemma only imply an O (n3 log β )/(β w log n) time
                                                                                             

            algorithm for BMM where β = (log? n)δ for some δ > 0, but improvements on the
            Triangle Removal Lemma would yield corresponding runtime improvements. The
            second algorithm applies the Weak Regularity Lemma of Frieze and Kannan along
            with several information compression ideas, running in O n3 (log log n)2 /(log n)9/4 )
                                               close to 1. When w ≥ log n, it can be implemented
            time with probability exponentially
            in O n3 (log log n)/(w log n)7/6 ) time. Our results immediately imply improved com-
            binatorial methods for CFG parsing, detecting triangle-freeness, and transitive closure.

ACM Classification: F.2.2
AMS Classification: 68Q25
Key words and phrases: Boolean matrix multiplication, regularity lemma, combinatorial algorithm,
independent set queries


 2012 Nikhil Bansal and Ryan Williams
 Licensed under a Creative Commons Attribution License                                DOI: 10.4086/toc.2012.v008a004
                                        N IKHIL BANSAL AND RYAN W ILLIAMS

           • Using Weak Regularity, we also give an algorithm for answering queries of the form
             is S ⊆ V an independent set? in a graph. Improving on prior work, we show how
             to randomly preprocess a graph in O(n2+ε ) time (for all ε > 0) so that with high
             probability, all subsequent batches of log n independent     set queries can be answered
             deterministically in O n2 (log log n)2 /((log n)5/4 ) time. When w ≥ log n, w queries
                                                                  

             can be answered in O n2 (log log n)2 /((log n)7/6 ) time. In addition to its several
                                                                   

             applications, this problem is interesting in that it is not known how to do better than
             O(n2 ) using “algebraic” methods.


1     Introduction
Szemerédi’s Regularity Lemma is one of the most remarkable results of graph theory, having many diverse
uses and applications. In computer science, regularity notions have been used extensively in property
and parameter testing [4, 5, 15, 48, 12], approximation algorithms [28, 29, 20], and communication
complexity [34]. In this paper we show how regularity can lead to faster combinatorial algorithms for
basic problems.
     Boolean matrix multiplication (BMM) is among the most fundamental problems in computer science.
It is a key subroutine in the solution of many other problems such as transitive closure [25], context-free
grammar parsing [59], all-pairs path problems [21, 31, 53, 55], and triangle detection [35].
     There have been essentially two lines of theoretical research on BMM. Algebraic algorithms, begin-
ning with Strassen’s Õ(nlog2 7 ) algorithm [56] and ending (so far) with Vassilevska Williams’ Õ(n2.373 )
algorithm [60], reduce the Boolean problem to ring matrix multiplication and give ingenious methods for
the ring version by utilizing cancellations. In particular, multiplication-efficient algorithms are found for
multiplying small matrices over an arbitrary ring, and these algorithms are applied recursively. There have
been huge developments in this direction over the years, with many novel ideas (cf. [45] for an overview
of early work, and [19, 18] for a more recent and promising approach). However, these algorithms
(including Strassen’s) have properties (lack of locality, extra space usage, and leading constants) that may
can them less desirable in practice.1
     The second line of work on Boolean matrix multiplication has studied so-called combinatorial
algorithms, the subject of the present paper. Combinatorial algorithms for matrix multiplication exploit
redundancies that arise from construing matrices as graphs, often invoking word parallelism, lookup
tables, and Ramsey-theoretic arguments. Although the term combinatorial algorithm has been used in
many of the references cited, there is no general criterion for what is “combinatorial” about them: the term
is mainly just a way of distinguishing those approaches which are different from the algebraic approach
originating with Strassen. Hence for the purposes of this paper, we simply think of a combinatorial
algorithm as one that does not call an oracle for ring matrix multiplication. These algorithms are
considered to be more practical, but fewer advances have been made. All algorithms for the dense
case [43, 8, 51, 9, 50, 10, 62] are loosely based on the “Four Russians” approach of Arlazarov, Dinic,
Kronrod, and Faradzhev [8] from 1970, which runs in O(n3 /(w log n)) on modern computational models,
    1 For this reason, some practical implementations of Strassen’s algorithm switch to standard (or “Four Russians”) multipli-

cation when the submatrices are sufficiently small. For more discussion on the (im)practicality of Strassen’s algorithm and
variants, cf. [40, 17, 2].


                              T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                            70
                           R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

where w is the maximum of log n and the wordsize.2 Given its importance, we shall briefly describe
the approach here. The algorithm partitions the first matrix into n × ε log n submatrices, and the second
matrix into ε log n × n submatrices. For each n × ε log n submatrix, we compute a table with 2ε log n = nε
entries where each entry is a n × 1 vector corresponding to the union of some subset of columns of the
submatrix. With this table one can multiply each n × ε log n and ε log n × n submatrix together in O(n2 )
time. An additional w-factor can be saved by storing the n-bit entries in the table as a collection of n/w
words, or a log-factor is saved by storing the outputs as a collection of n/ log n pointers to nodes encoding
log n bit strings in a graph, cf. [50, 10, 62]. To date, this is still the fastest known combinatorial algorithm
for dense matrices.
     Many works (including [1, 21, 40, 52, 47, 41, 16]) have commented on the dearth of better combinato-
rial algorithms for BMM. As combinatorial algorithms can often be generalized in ways that the algebraic
ones cannot (e.g., to work over interesting semirings), the lack of progress does seem to be a bottleneck,
even for problems that appear to be more difficult. For instance, the best known algorithm for the general
all-pairs shortest paths problem [16] is combinatorial and runs in O(n3 · poly(log log n)/ log2 n) time –
essentially the same time as Four Russians. Some progress on special cases of BMM has been made: for
instance, in the sparse case where one matrix has m  n2 nonzeros, there is an O(mn log(n2 /m)/(w log n))
time algorithm [24, 13]. See [44, 52, 41] for a sampling of other partial results. The search for practical
and fast Boolean matrix multiplication is still ongoing.


2     Our results
In this paper we present what are arguably the first concrete improvements on combinatorial algorithms
for dense BMM since the 70’s. Our approach opens a new line of attack on the problem by connecting
the complexity of BMM to modern topics in graph theory, such as the Weak Regularity Lemma and the
efficiency of Triangle Removal Lemmas.

2.1    Triangle Removal Lemmas and BMM
A Triangle Removal Lemma [49, 33] states that there is a function f satisfying limx→0 f (x) = 0 such
that for every graph with at most εn3 triangles, we can efficiently find f (ε)n2 edges that hit all triangles.
This lemma is one of the many deep consequences of Szemerédi’s Regularity Lemma [57]. We prove
that good removal lemmas imply faster Boolean matrix multiplication. Let w be the wordsize (typically
w = Θ(log n)).
Theorem 2.1. Suppose there is an O(T (n, ε)) time algorithm that, for every graph G = (V, E) with at
most εn3 triangles, returns a set S ⊆ E with |S| ≤ f (ε)n2 such that G0 = (V, E \ S) is triangle-free. Then
there is a randomized algorithm for Boolean matrix multiplication that returns the correct answer with
high probability and runs in time
                                       f (ε)n3 log(1/ f (ε)) n2
                                                                             
                                                                            3
                         O T (n, ε) +                       + · log n + εn .
                                              w log n          ε
   2 Historical Note: The algorithm in [8] was originally stated to run in O(n3 / log n) time. Similar work of Moon and

Moser [43] from 1966 shows that the inverse of a matrix over GF(2) needs exactly Θ(n2 / log n) row operations on n-bit vectors,
providing an upper and lower bound. On a RAM, their algorithm runs in O(n3 /(w log n)) time.


                              T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                            71
                                         N IKHIL BANSAL AND RYAN W ILLIAMS

    Unfortunately the best known upper bound for f is f (ε) = O(1/(log? 1/ε)δ ) for some δ > 0 (cf. Sec-
                     √
tion 3.1). For ε = 1/ n, we obtain a very modest runtime improvement over Four Russians. However no
major impediment is known (like that proven by Gowers for the full Regularity Lemma [32])√ for obtaining
                                                                                      −O( log(1/ε))
a much better f for triangle removal. The best known lower bound on f (ε) is only 2                 , due to
Rusza and Szemerédi [49]. Given a set S ⊆ [n] with no arithmetic progression of length three, Ruzsa and
Szemerédi construct a graph G0 with O(n) nodes and O(n|S|) edges whose edge set can be partitioned
into n|S| edge-disjoint triangles (and there are no other triangles). The best known constructions
                                                                                         √          of such
arithmetic progression free sets, due to Behrend [11] and Elkin [23], have |S| ≈ n1−Θ(1/ log n) . So, in the
case of G0 we have
                                  √                                     √            √
           ε = |S|/n2 = 1/(n2Θ( log n) ) and f (ε) = |S|/n = 1/2Θ( log n) ≥ 2−Θ( log(1/ε)) .
                                                                                                √
Hence, with improved Triangle Removal Lemmas, Theorem 2.1 could still imply an n3 / exp(Θ( log n))
time bound for BMM.
    Recently, Jacob Fox [26] has given a new proof of the Triangle Removal Lemma which improves the
function f . However, the algorithm implicit in his proof requires detecting if a given graph partition is
“superregular,” which we do not know how to do in sub-cubic time (which is necessary for Theorem 2.1
to apply). Turning his proof into an efficient algorithm is an interesting open problem.

2.2    Weak Regularity and BMM
Our second algorithm for BMM gives a more concrete improvement, relying on the Weak Regularity
Lemma of Frieze and Kannan [28, 29] along with several other combinatorial ideas.

Theorem 2.2. There is a randomized combinatorial algorithm for Boolean matrix multiplication that
runs in Ô(n3 /(log2.25 n)) (worst-case) time on a pointer machine, and computes the product with high
probability.3 More precisely, for any n × n Boolean matrices A and B, the algorithm computes their
Boolean product with probability 1 − exp(−n), and takes time O(n3 (log log n)2 /(log2.25 n)). On a RAM
with wordsize w ≥ log n, the algorithm can be implemented in O(n3 (log log n)/(w log7/6 n)) time.

    These new algorithms are interesting not so much for their quantitative improvements, but because
they show some further improvement. Some researchers believed that O(n3 /(w log n)) would be the end
of the line for algorithms not based on algebraic methods. This belief was quantified by Angluin [7]
and Savage [51], who proved in the mid 70’s that for a straight-line program model which includes Four
Russians, Ω(n3 /(w log n)) operations are indeed required.4

2.3    Preprocessing for fast independent set queries
Finally, we show how our approach can improve the solution of problems that seem beyond the reach of
algebraic methods, and give a partial derandomization of some applications of BMM. In the independent
   3 The Ô notation suppresses poly(log log n) factors.
    4 More precisely, they proved that Boolean matrix multiplication requires Θ(n2 / log n) bitwise OR operations on n-bit vectors,

in a straight-line program model where each line is a bitwise OR of some subset of vectors in the matrices and a subset of
previous lines in the program, and each row of the matrix product appears as the result of some line of the program.


                               T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                               72
                        R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

set query problem, we wish to maintain a data structure (with polynomial preprocessing time and space)
that can quickly answer if a subset S ⊆ V is independent. It is not known how to solve this problem
faster than O(n2 ) using “Strassenesque” methods. Previously it was known that one could answer one
independent set query in O(n2 / log2 n) [62] (or O(n2 /(w log n)) with wordsize w). We show the following
result.

Theorem 2.3. For all ε ∈ (0, 1/2), we can preprocess a graph G in O(n2+ε ) time such that with high
probability, all subsequent batches of log n independent set queries on G can be answered deterministically
in O(n2 (log log n)2 /(ε(log n)5/4 )) time. On the word RAM with w ≥ log n, we can answer w independent
set queries in O(n2 (log log n)/(ε(log n)7/6 )) time.

    That is, the O(n2+ε ) preprocessing is randomized, but the algorithm which answers batches of
queries is deterministic, and these answers will always be correct with high probability. Recent work has
shown that the preprocessing can be made deterministic as well (cf. the conclusion of this paper). The
independent set query problem of Theorem 2.3 has several interesting applications; the last three were
communicated to us by Avrim Blum [14].

   1. Triangle Detetection in Graphs. The query algorithm immediately implies a triangle detection
      algorithm that runs in O(n3 (log log n)/(log n)9/4 ) time, or O(n3 (log log n)/(w(log n)7/6 )) time. (A
      graph is triangle-free if and only if all vertex neighborhoods are independent sets.)

   2. Partial Match Retrieval. The query problem can also model a special case of partial match
      retrieval. Let Σ = {σ1 , . . . , σk }, and let ? ∈
                                                       / Σ. Imagine we are given a collection of m vectors
      v1 , . . . , vm of length n over Σ ∪ {?} such that every v j has only two components from Σ (the rest
      of the components are all ?’s). A series of vectors q ∈ Σn arrive one at a time, and we want
      to determine if q “matches” some v j , i. e., there is a j such that for all i = 1, . . . , n, v j [i] = q[i]
      whenever v j [i] 6= {?}. To formulate this problem as an independent set query problem, make a
      graph with kn nodes in equal-sized parts V1 , . . . ,Vk . Put the edge (i, i0 ) ∈ Va ×Vb iff there is a vector
      v` in the collection such that v` [i] = σa and v` [i0 ] = σb . A query vector q corresponds to asking if
      Sq = k`=1 {i ∈ V` | q[ j] = σ` } is an independent set in the graph.
               S


   3. Preprocessing 2-CNF Formulas. We can also a preprocess 2-CNF formula F on n variables,
      in order to quickly evaluate F on arbitrary assignments. Make a graph with 2n nodes, one for
      each possible literal in F. For each clause (`i ∨ ` j ) in F, put an edge between nodes ¬`i and
      ¬` j in the graph. Now given a variable assignment A : {0, 1}n → {0, 1}, observe that the set
      SA = {x | A(`) = 1} ∪ {¬x | A(x) = 0} is independent if and only if A satisfies F.

   4. Answering 3-SUM Queries. Independent set queries can solve a query version of the well-known
      3-SUM problem [30]. The 3-SUM problem asks: given two sets A and B of n elements each, are
      there two elements in A that add up to some element in B? The assumption that 3-SUM cannot
      be solved much faster than the trivial O(n2 ) bound has been used to show hardness for many
      computational geometry problems [30], as well as lower bounds on data structures [46].
      A natural query version of the problem is: given two sets A and B of n integers each, preprocess
      them so that for any query set S ⊆ A, one can quickly answer whether two elements in S sum to an

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                     73
                                       N IKHIL BANSAL AND RYAN W ILLIAMS

         element in B. Make a graph with a node for each integer in A, and an edge between two integers in
         A if their sum is an element in B: this gives exactly the independent set query problem.


3      Preliminaries
The Boolean semiring is the semiring on {0, 1} with OR as addition and AND as multiplication. For
Boolean matrices A and B, A ∨ B is the componentwise OR of A and B, A ∧ B is the componentwise AND,
and A ? B is the (Boolean) matrix product over the Boolean semiring. When it is clear from the context,
sometimes we omit the ? and write AB for the product.
     Since the running times of our algorithms involve polylogarithmic terms, we must make the computa-
tional model precise. Unless otherwise specified, we assume a standard word RAM with wordsize w. That
is, accessing a memory location takes O(1) time, and we can perform simple operations (such as addition,
componentwise AND and XOR, but not multiplication) on w-bit numbers in O(1) time. Typically,
speedups in combinatorial algorithms come from either exploiting some combinatorial substructure, by
preprocessing and doing table lookups, or by some “word tricks” which utilize the bit-level parallelism of
the machine model. In our results, we explicitly state the dependence of the word size, denoted by w. The
reader may assume w = Θ(log n) for convenience. In fact all algorithms in this paper can be implemented
on a pointer machine under this constraint.
     We now describe some of the tools we need.

3.1     Regularity
Let G = (V, E) be a graph and let S, T ⊆ V be disjoint. Define e(S, T ) = {(u, v) ∈ E | u ∈ S, v ∈ T }.
The density of (S, T ) is d(S, T ) = e(S, T )/(|S||T |). Thus d(S, T ) is the probability that a random pair of
vertices, one from S and one from T , have an edge between them. For ε > 0, the pair (S, T ) is ε-regular
if over all S0 ⊆ S and T 0 ⊆ T with |S0 | ≥ ε|S| and |T 0 | ≥ ε|T |, we have |d(S0 , T 0 ) − d(S, T )| ≤ ε. That is,
the density of all sufficiently large subsets of (S, T ) is approximately d(S, T ).

Definition 3.1. A partition {V1 , . . . ,Vk } of V is an ε-regular partition of G if

      • for all i, |Vi | ≤ ε|V |,

      • for all i, j, |Vi | − |V j | ≤ 1, and

      • all but at most εk2 of the pairs (Vi ,V j ) are ε-regular.

    Szemerédi’s celebrated theorem [57] states that in every sufficiently large graph and every ε, an
ε-regular partition exists.

Lemma 3.2 (Regularity Lemma). For all ε > 0, there is a K(ε) such that every G has an ε-regular
partition where the number of parts k is at most K(ε).

   We need to compute such a partition in less than cubic time, in order to perform faster matrix
multiplication. There exist several polynomial time constructions of ε-regular partitions [3, 27, 29, 37].

                               T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                74
                            R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

The fastest deterministic algorithm runs in O(K 0 (ε)n2 ) time (for some K 0 (ε) related to K(ε)) and is due
to Kohayakawa, Rödl, and Thoma [37].5

Theorem 3.3 (Kohayakawa-Rödl-Thoma [37]). There is an algorithm that, on input ε > 0 and graph G
on n nodes, outputs an ε-regular partition with K 0 (ε) parts and runs in O(20/(ε 0 )5 (n2 + K 0 (ε)n)) time.
K 0 (ε) is a tower of at most 20/(ε 0 )5 twos where ε 0 = (ε 20 /1024 ).

    Let us give a few more details on how the above algorithm is obtained. The above theorem is
essentially Corollary 1.6 in Section 3.2 of [37], however we have explicitly spelled out the dependency
between ε 0 , K 0 , and ε. Theorem 1.5 in [37] shows that in O(n2 ) time, we can either verify ε-regularity
or obtain a witness for ε 0 -irregularity (with ε 0 as above). Here, a witness is simply a pair of subsets of
vertices for which the ε 0 -regularity condition fails to hold. Lemma 3.6 in Section 3.2 of [37] shows how
to take proofs of ε 0 -irregularity for a partition and refine the partition in linear time, so that the index
(a quantity that always lies in [0, 1]) of the partition increases by (ε 0 )5 /20. Thus, in at most 20/(ε 0 )5
iterations of partition refinement (each refinement taking O(K 0 (ε)n) time) we can arrive at an ε-regular
partition.
    We also need the Triangle Removal Lemma, first stated by Ruzsa and Szemerédi [49]. In one
formulation, the lemma says there is a function f such that f (ε) → 0 as ε → 0, and for every graph with
at most εn3 triangles, at most f (ε)n2 edges need to be removed to make the graph triangle-free. We use a
version stated by Green ([33], Proposition 1.3); for completeness, we give a full proof.

Lemma 3.4 (Triangle Removal Lemma). Suppose G has at most δ n3 triangles. Let k = K(ε) be the
number of parts in some ε-regular partition of G, where 4εk−3 > δ and ε is sufficiently small. Then there
is a set of at most 4ε 1/3 n2 edges such that their removal makes G triangle-free.6
     In particular, let {V1 , . . . ,Vk } be an ε-regular partition of G. By removing all edges in pairs (Vi ,Vi ),
the pairs (Vi ,V j ) with density less than 2ε 1/3 , and all non-regular pairs, G becomes triangle-free.

Proof. Let G0 = (V, E 0 ) be the graph obtained by removing all edges from the pairs (Vi ,V j ) that have
density less than 2ε 1/3 or are non-regular, and remove all edges from the pairs (Vi ,Vi ). As each low-
density pair has at most 2ε 1/3 (n/k)2 edges, there are at most εk2 non-regular pairs, and the number of
edges with both vertices in the same part is at most k · (n/k)2 , the number of edges removed is at most

                                  k2 · 2ε 1/3 (n/k)2 + εk2 · (n/k)2 + n2 /k ≤ 4ε 1/3 n2 ,

for sufficiently small ε > 0 (note that k  poly(1/ε)).
    We now show that G0 is triangle-free. Suppose there is a triangle (u, v, w) with u ∈ Vi , v ∈ V j , and
w ∈ Vk for distinct indices i, j, k. By construction, |Vi |, |V j | and |Vk | are all at least n/k, the density of all
pairs of edges is at least 2ε 1/3 , and all pairs are ε-regular. We claim that the number of triangles between
the parts Vi , V j , and Vk is at least δ n3 , contradicting our hypothesis on G.
   5 The paper [37] claims that [28, 29] give an algorithm for constructing a regular partition that runs in linear time, but we are

unsure of this claim. The algorithm given in Frieze-Kannan [29] seems to require that we can verify regularity in linear time
without giving an algorithm for this verification.
   6 In our later algorithms, we will choose ε to depend on n.



                                T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                                75
                                     N IKHIL BANSAL AND RYAN W ILLIAMS

    WLOG we may assume |Vi | = |V j | = |Vk |; let their cardinality be t ≥ n/k. Let b = 2ε 1/3 − ε. First we
claim there are less than εt nodes in Vi which have less than bt neighbors in V j . If not, then there would
be a Vi0 ⊂ Vi with |Vi0 | = εt such that
                                                         εbt 2
                                       d(Vi0 ,V j ) <           = b = 2ε 1/3 − ε ,
                                                         εt · t
yet d(Vi ,V j ) ≥ 2ε 1/3 , contradicting ε-regularity. Analogously, there are less than εt nodes in Vi with less
than bt neighbors in Vk . Hence there are at least (1 − 2ε)t nodes in Vi that have at least bt neighbors in
V j and bt neighbors in Vk . Let v ∈ Vi be such a node and let S j and Sk be the neighbors of v in Vi and Vk ,
respectively. By ε-regularity, the number of edges between S j and Sk is at least (2ε 1/3 − ε)(bt)2 = b3t 2 .
So at least (1 − 2ε)t nodes in Vi participate in at least b3t 2 triangles, hence the number of triangles among
Vi , V j , Vk is at least T = (1 − 2ε)t · b3t 2 . For small enough ε > 0, we have 2ε 1/3 − ε > (5/3)ε 1/3 and
4.5ε − 9ε 2 > 4ε, so
                T = (1 − 2ε)b3t 3 > (1 − 2ε)(5/3)3 εt 3 > 4.5εt 3 − 9ε 2t 3 > 4ε(n/k)3 > δ n3 ,
contradicting our assumption that there were less than δ n3 triangles.
    Notice that the lemma gives an efficient way of discovering which edges to remove, when combined
with an algorithmic Regularity Lemma. However the above proof yields only a very weak bound on f (ε),
of the form c/(log? 1/ε)δ for some constants c > 1 and δ > 0. It is of great interest to prove a triangle
removal lemma with much smaller f (ε). A step in this direction is the result result of Fox [26] mentioned
earlier.
    There are also other (weaker) notions of regularity that suffice for certain applications, where the
dependence on ε is much better. We discuss below a variant due to Frieze and Kannan [29]. There are
also other variants known, for example [36, 4, 22]. We refer the reader to the survey [38]. Frieze and
Kannan defined the following notion of a pseudoregular partition.
Definition 3.5 (ε-pseudoregular partition). Let P = V1 , . . . ,Vk be a partition of V , and let di j be the
density of (Vi ,V j ). For a subset S ⊆ V , and i = 1, . . . , k, let Si = S ∩Vi . The partition P is ε-pseudoregular
if the following relation holds for all disjoint subsets S, T of V :
                                                          k
                                          e(S, T ) − ∑ di j |Si ||T j | ≤ εn2 .
                                                        i, j=1

A partition is equitable if for all i, j, |Vi | − |V j | ≤ 1.
Theorem 3.6 (Frieze-Kannan [29], Theorem 2 and Section 5.1). For all ε ≥ 0, an equitable ε-pseudo-
                                                                  2
regular partition of an n node graph with at most min{n, 24d64/(3ε )e } parts can be constructed in
                                                           2
                                                              
                                               O(1/ε 2 ) n
                                          O 2
                                                        ε 2δ 3
time with a randomized algorithm that succeeds with probability at least 1 − δ .
    The runtime bound above is a little tighter than what Frieze and Kannan claim, but an inspection of
their algorithm shows that this bound is achieved. Note that Lovász and Szegedy [42] have proven that
for any ε-pseudoregular partition, the number of parts must be at least (1/4) · 21/(8ε) .

                             T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                    76
                       R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

3.2   Preprocessing Boolean matrices for sparse operations
Our algorithms exploit regularity to reduce dense BMM to a collection of somewhat sparse matrix
multiplications. To this end, we need results on preprocessing matrices to speed up computations on
sparse inputs. The first deals with multiplication of an arbitrary matrix with a sparse vector, and the
second deals with multiplication of a sparse matrix with another (arbitrary) matrix.

Theorem 3.7 (Blelloch-Vassilevska-Williams [13]). Let B be a n × n Boolean matrix and let w be the
wordsize. Let κ ≥ 1 and ` > κ be integer parameters. There is a data structure that can be constructed
with O((n2 κ/`) · ∑κb=1 b` ) preprocessing time, so that for any Boolean vector v, the product B ? v can be
computed in
                                                       n2
                                                                
                                                              nt
                                         O n log n +       +
                                                       `w κw
time, where t is the number of nonzeros in v.

    This result is typically applied as follows. Fix a value of t to be the number of nonzeros we expect in
a typical vector v. Choose ` and κ such that n/` ≈ t/κ, and ∑κb=1 b` = nδ for some δ > 0. One such
choice is κ = δ ln(n)/ ln(en/t) and ` = κ · en/t in which case we obtain:

Theorem 3.8. Let B be a n × n Boolean matrix. There is a data structure that can be constructed with
Õ(n2+δ ) preprocessing time, so that for any Boolean vector v, the product B ? v can be computed in
                                                               
                                                    nt ln(en/t)
                                        O n log n +
                                                      δ w ln n

time, where t is the number of nonzeros in v.

    We should remark that we do not explicitly apply the above theorem, but the idea (of preprocessing
for sparse vectors) is used liberally in this paper.
    The following result is useful for multiplying a sparse matrix with another arbitrary matrix.

Theorem 3.9. There is an O(mn log(n2 /m)/(w log n)) time algorithm for computing A ? B, for every
n × n A and B, where A has m nonzeros and B is arbitrary.

    This result follows in a straightforward manner by combining the two lemmas below. The first is a
graph compression method due to Feder and Motwani.

Lemma 3.10 (From Feder-Motwani [24], Theorem 3.3). Let δ ∈ (0, 1) be constant. We can write any
n × n Boolean matrix A with m nonzeros as A = (C ? D) ∨ E where C, D are n × m/n1−δ , m/n1−δ × n,
respectively, both with at most m(log n2 /m)/(δ log n) nonzeros, and E is n × n and has at most n2−δ
nonzeros. Furthermore, finding C, D, E takes O(mnδ log2 n) time.

    Since the lemma is not stated explicitly in [24], let us sketch the proof for completeness. Using
algorithmic Ramsey theoretic arguments (i. e., finding large bipartite cliques in a dense enough graph),
Feder and Motwani show that for every bipartite graph G on 2n nodes (with n nodes each on left and
right) and m > n2−δ edges, its edge set can be decomposed into m/n1−δ edge-disjoint bipartite cliques,

                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                              77
                                         N IKHIL BANSAL AND RYAN W ILLIAMS

where the total sum of vertices over all bipartite cliques (a vertex appearing in K cliques is counted K
times) is at most m(log n2 /m)/(δ log n). Every A can be written in the form (C ? D) ∨ E, by having the
columns of C (and rows of D) correspond to the bipartite cliques. Set C[i, k] = 1 iff the ith node of the
LHS of G is in the kth bipartite clique, and similarly set D for the nodes on the RHS of G. Note that E is
provided just in case A turns out to be sparse.
    We also need the following simple folklore result. It is stated in terms of wordsize w, but it can easily
be implemented on other models such as pointer machines with w = log n.

Lemma 3.11 (Folklore). There is an O(mn/w + pq + pn) time algorithm for computing A ? B, for every
p × q matrix A and q × n matrix B where A has m nonzeros and B is arbitrary.

Proof. We assume the nonzeros of A are stored in a list structure; if not we construct this in O(pq) time.
Let B j be the jth row of B and Ci be the ith row of C in the following. We start with an output matrix C
that is initially zero. For each nonzero entry (i, j) of A, update Ci to be the OR of B j and Ci . Each update
takes only O(n/w) time. It is easy to verify that the resulting C is the matrix product.


4     Combinatorial Boolean matrix multiplication via triangle removal
In this section, we prove Theorem 2.1. That is, we show that a more efficient Triangle Removal Lemma
implies more efficient Boolean matrix multiplication. Let A and B be the matrices whose product D
we wish to compute. The key idea is to split the task into two cases. First, we use simple random
sampling to determine the entries in the product that have many witnesses (where k is a witness for (i, j) if
A[i, k] = B[k, j] = 1). To compute the entries with few witnesses, we set up a tripartite graph corresponding
to the remaining undetermined entries of the matrix product, and argue that it has few triangles. (Each
triangle corresponds to a specific witness for a specific entry in D that is still undetermined.) By a Triangle
Removal Lemma, a sparse number of edges hit all the triangles in this graph.7 Using three carefully
designed sparse matrix products (which only require one of the matrices to be sparse), we can recover all
those entries D[i, j] = 1 which have few witnesses.
     We now describe our algorithm for BMM.
Algorithm: Let A and B be n × n matrices. We wish to compute D = A ? B, i. e.,
                                                            !
                                                           n
                                                           _
                                            D[i, j] =           A[i, k] ∧ B[k, j]   .
                                                          k=1


Random sampling for pairs with many witnesses. First, we detect the pairs (i, j) with at least εn witnesses.
Construct a n × n matrix C as follows. Pick a sample R of (6 log n)/ε elements from [n]. For each (i, j),
1 ≤ i, j ≤ n, check if there is a k ∈ R that is a witness for (i, j) in the product. If yes, set C[i, j] = 1,
otherwise C[i, j] = 0. Clearly, this takes at most O((n2 log n)/ε) time. Note that C is dominated by the
desired D, in that C[i, j] ≤ D[i, j] for all i, j. If (i, j) has at least εn witnesses, then some witness lies in R
with probability at least 1 − 1/n6 . Thus with probability at least 1 − 1/n4 , C[i, j] = D[i, j] = 1 for every
(i, j) with at least εn witnesses.
    7 Note that the triangle removal lemma may also return edges that do not lie in any triangle.



                               T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                               78
                        R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

Triangle removal for pairs with few witnesses. It suffices to determine those (i, j) such that C[i, j] = 0
and D[i, j] = 1. We shall exploit the fact that such pairs do not have many witnesses. Make a tripartite
graph H with vertex sets V1 , V2 , V3 , each with n nodes indexed by 1, . . . , n. Define edges as follows:

    • Put an edge (i, k) ∈ (V1 ,V2 ) if and only if A[i, k] = 1.

    • Put an edge (k, j) ∈ (V2 ,V3 ) if and only if B[k, j] = 1.

    • Put an edge (i, j) ∈ (V1 ,V3 ) if and only if C[i, j] = 0.

That is, edges from V1 to V3 are given by C, the complement of C. Observe that (i, k, j) ∈ (V1 ,V2 ,V3 )
is a triangle if and only if k is a witness for (i, j) and C[i, j] = 0. Thus our goal is to find the pairs
(i, j) ∈ (V1 ,V3 ) that are in triangles of H.
     Since every (i, j) ∈ (V1 ,V3 ) has at most εn witnesses, there are at most εn3 triangles in H. Applying
the promised Triangle Removal Lemma (in Theorem 2.1), we can find in time O(T (n)) a set of edges F
where |F| ≤ f (ε)n2 and each triangle must use an edge in F. Hence it suffices to compute those edges
(i, j) ∈ (V1 ,V3 ) that participate in a triangle with an edge in F.
     Define AF [i, j] = 1 if and only if A[i, j] = 1 and (i, j) ∈ F. Similarly define BF and CF . Every triangle
of H passes through at least one edge from one of these three matrices. Let TA (resp. TB and TC ) denote
the set of triangles with an edge in AF (resp. BF and CF ). Note that we do not know these triangles.
     We can determine the edges (i, j) ∈ (V1 ,V3 ) that are in some triangle in TA or TB directly by computing
C1 = AF ? B and C2 = A ? BF , respectively. As AF and BF are sparse, by Theorem 3.9, these products can
be computed in O(|F| log(n2 /|F|)/(w log n)) time. The 1-entries of C ∧C1 (resp. C ∧C2 ) participate in a
triangle in TA (resp. TB ). This determines the edges in (V1 ,V3 ) participating in triangles from TA ∪ TB .
     Set C = C ∨ (C1 ∧ C) ∨ (C2 ∧ C), and update C and the edges in (V1 ,V3 ) accordingly. The only
remaining edges in (V1 ,V3 ) that could be involved in a triangle are those corresponding to 1-entries in CF .
We now need to determine which of these actually lie in a triangle.
     Our remaining problem is the following: we have a tripartite graph on vertex set (V1 ,V2 ,V3 ) with
at most f (ε)n2 edges between V1 and V3 , and each such edge lies in at most εn triangles. We wish to
determine the edges in (V1 ,V3 ) that participate in triangles. This problem is solved by the following
theorem.

Theorem 4.1 (Reporting Edges in Triangles). Let G be a tripartite graph on vertex set (V1 ,V2 ,V3 ) such
that there are at most δ n2 edges in (V1 ,V3 ), and every edge of (V1 ,V3 ) is in at most t triangles. Then the
set of edges in (V1 ,V3 ) that participate in triangles can be computed in O(δ n3 log(1/δ )/(w log n) + n2t)
time.

    Setting δ = f (ε) and t = εn, Theorem 4.1 implies the desired time bound in Theorem 2.1. The idea
of the proof of Theorem 4.1 is to work with a new tripartite graph where the vertices have asymptotically
smaller degrees, at the cost of adding slightly more nodes. This is achieved by having some nodes in our
new graph correspond to small subsets of nodes in the original tripartite graph.

Proof of Theorem 4.1. We first describe how to do the computation on a pointer machine with w = log n,
then describe how to modify it to work for the word RAM.

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                 79
                                    N IKHIL BANSAL AND RYAN W ILLIAMS

Graph Construction We start by defining a new tripartite graph G0 on vertex set (V1 ,V20 ,V30 ). Let
γ < 1/2. V20 is obtained by partitioning the nodes of V2 into n/(γ log n) groups of size γ log n each. For
each group, we replace it by 2γ log n = nγ nodes, one corresponding to each subset of nodes in that group.
Thus V20 has n1+γ /(γ log n) nodes.
     V30 is also constructed out of subsets of nodes. We form n/` groups each consisting of ` nodes in
V3 , where ` = γ(log n)/(δ log(e/δ )). For each group, we replace it by k` ≤ (el/k)k ≤ nγ nodes, one
                                                                                        

corresponding to each subset of size up to κ = γ(log n)/(log(e/δ )). So V30 has O(n1+γ /`) nodes.
Edges in (V20 ,V30 ): Put an edge between u in V20 and x in V30 if there is an edge (i, j) in (V2 ,V3 ) such that i
lies in the set corresponding to u, and j lies in the set corresponding to x. For each such edge (u, x), we
make a list of all edges (i, j) ∈ (V2 ,V3 ) corresponding to it. Observe the list for a single edge has size at
most γ log n · ` = O(log2 n).
Edges in (V1 ,V20 ): The edges from v ∈ V1 to V20 are defined as follows. For each group in V2 consider the
neighbors of v in that group. Put an edge from v to the node in V20 corresponding to this subset. Each v
has at most n/(γ log n) edges to nodes in V20 .
Edges in (V1 ,V30 ): Let v ∈ V1 . For each group g of ` nodes in V3 , let Nv,g be the set of neighbors of v in g.
Let dv,g = |Nv,g |. Partition Nv,g arbitrarily into t = ddv,g /κe subsets s1 , . . . , st each of size at most κ. Put
edges from v to s1 , . . . , st in V30 . The number of these edges from v is at most ∑g ddv,g /κe ≤ n/` + dv /κ,
where dv is the number of edges from v to V3 . Since ∑v dv ≤ δ n2 , the total number of edges from V1 to V30
is O(δ log(1/δ )n2 /(γ log n)).

Final Algorithm For each vertex v ∈ V1 , iterate over each pair of v’s neighbors u ∈ V20 and x ∈ V30 . If
(u, x) is an edge in G0 , output the list of edges (i, j) in (V2 ,V3 ) corresponding to (u, x), otherwise continue
to the next pair. From these outputs we can easily determine the edges (v, j) in (V1 ,V3 ) that are in triangles:
(v, j) is in a triangle if and only if node j in V3 is output as an end point of some edge (i, j) ∈ (V2 ,V3 )
during the loop for v in V1 .
Running Time: The graph construction takes at most O(n2+2γ ). In the final algorithm, the total number of
pairs (u, w) in (V20 ,V30 ) that are examined is at most

                      (n/ log n) · O(δ n2 (log 1/δ )/ log n)) ≤ O(δ log(1/δ )n3 / log2 n) .

     We claim that the time used to output the lists of edges is at most O(n2t) time. A node j from V3 is
on an output list during the loop for v in V1 if and only if (v, j) is an edge in a triangle, with some node in
V2 that has a 1 in the node i in V20 . Since each edge from (V1 ,V3 ) in a triangle is guaranteed to have at
most t witnesses in V2 , the node j is output at most t times over the loop for v in V1 . Hence the length of
all lists output during the loop for v is at most nt, and the total time for output is at most O(n2t).
Modification for w-word RAM: We now show how to replace a log-speedup by a w-speedup with wordsize
w. We form V30 as above and consider the graph on (V1 ,V2 ,V30 ) (note the V2 instead of V20 previously).
Recall that a node x ∈ V30 corresponds to a subset Vx ⊂ V3 of at most κ vertices (from some group of size
`). An edge (v, x) ∈ (V1 ,V30 ) implies that v is adjacent to every vertex in Vx . For each v ∈ V1 , let Sv be
the n-bit indicator vector corresponding to neighbors of v in V2 . For each x ∈ V30 , let Tx denote the n-bit
indicator vector corresponding to the union of neighbors of vertices Vx in V2 , i. e., Tx [i] = 1 iff some v ∈ Vx
is adjacent to i ∈ V2 . The vectors Sv and Tx are stored as n/w words. With each bit Tx [i] of Tx such that

                            T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                    80
                        R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

Tx [i] = 1, we store a pointer to a list of vertices v ∈ Vx that are adjacent to i ∈ V2 . It is easily checked that
the overall space usage is bounded by O(n · |V30 | · κ) = Õ(n2+γ ).
      The algorithms works as follows: Now for every v in V1 and every neighbor x ∈ V30 of v, for
i = 1, . . . , n/w, we look up the ith word q from Sv and the ith word q0 in the Tx , and compute q ∧ q0 . If this
is nonzero, then each bit location b where q ∧ q0 has a 1 means that the node corresponding to b forms a
triangle with v and some vertex in Vx . For each such bit (say at location i), the list associated to it consists
of precisely the nodes in Vx forming a triangle with i ∈ V2 and v.
      The running time follows as there are O(δ n2 (log 1/δ )/ log n)) edges in (V1 ,V30 ), and hence there we
look up at most O(n(δ n2 (log 1/δ )/ log n)(n/w) words q, q0 . Since the total number of triangles is n2t,
the total time spent in processing whenever q ∧ q0 = 1 is O(n2t).

Remark 4.2. Note that we only use randomness in the BMM algorithm to determine the pairs (i, j) that
have many witnesses. Moreover, by choosing a larger sample R in the random sampling step (notice we
have a lot of slack in the running time of the random sampling step), the probability of failure can be
made exponentially small.
    Using the best known bounds for triangle removal, we obtain the following corollary to Theorem 2.1:
Corollary 4.3. There is a δ > 0 and a randomized algorithm for Boolean matrix multiplication that
works with high probability and runs in
                                            n log(log? n)
                                           3                
                                        O
                                           w(log n)(log? n)δ
time.
                    √
Proof. Let ε = 1/ n. By the usual proof of the triangle removal lemma (via the Regularity Lemma), it
suffices to set f (ε) = 1/(log? 1/ε)δ in Theorem 2.1 for a constant δ > 0.

    It is our hope that further work on triangle removal may improve the dependency of f . In the next
section, we show how to combine the Weak Regularity Lemma along with the above ideas to construct a
faster algorithm for BMM.


5       Faster Boolean matrix multiplication via Weak Regularity
We first state a useful lemma for processing a boolean matrix B to compute the product uT Bv quickly given
any boolean vectors u and v. Viewing B as an incidence matrix of a bipartite graph, this corresponds to
determining whether there is some edge e ∈ (U,V ) where U and V are sets corresponding to characteristic
vectors u and v. This lemma is inspired by Theorem 3.7 and uses a similar technique to our algorithm for
reporting the edges that appear in triangles (Theorem 4.1).
Theorem 5.1 (Preprocessing for Bilinear Forms). Let B be an n × n Boolean matrix. Let κ ≥ 1 and ` ≥ κ
be integer parameters. For the pointer machine, there is a data structure that can be built in
                                                           !2 
                                                    κ  
                                                         `
                                      O n2 /`2 · ∑            
                                                   b=1 b


                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                    81
                                      N IKHIL BANSAL AND RYAN W ILLIAMS

time, so that for any u, v ∈ {0, 1}n , the product uT Bv over the Boolean semiring can be computed in
                                                 n t   n t 
                                                        u         v
                                         O n` +      +         +
                                                   ` κ       ` κ
time, where tu and tv are the number of nonzeros in u and v, respectively. Moreover, the data structure
can output the list of pairs (i, j) such that ui B[i, j]v j = 1 in O(p) additional time, where p is the number
of such pairs.
    On the word RAM with w ≥ log n, the same can be achieved in
                                                                      
                                                  n     n min(tu ,tv )
                                       O n` + ·            +
                                                 w      `         κ
time.

    For our applications, we shall set ` = log2 n and κ = log n/(5 log log n). Then the preprocessing is
n3−Ω(1) , uT Bv can be computed in time
                                                                                    
                                       n     tu log log n          n     tv log log n
                            O           2
                                           +                        2
                                                                       +                                    (5.1)
                                     log n       log n           log n       log n

on a pointer machine, and it can be computed on RAMs with large wordsize w in time

                                        n2
                                                                        
                                                n min(tu ,tv ) log log n
                                 O            +                            .                                (5.2)
                                     w log2 n          w log n

Proof of Theorem 5.1. As in the proof of Theorem 4.1, we first describe how to implement the algorithm
on a pointer machine, then show how it may be adapted. We view B as a bipartite graph G = (U,V, E)
in the natural way, where U = V = [n] and (i, j) ∈ E iff B[i, j] = 1. We group vertices in U and V into
dn/`e groups, each of size at most `. For each group g, we introduce a new vertex for every subset of up
to κ vertices in that group. Let U 0 and V 0 be the vertices obtained. We view the nodes of U 0 and V 0 also
as vectors of length ` with up to κ non-zeros. Clearly
                                                                      !!
                                                               κ  
                                     0       0                     `
                                 |U | = |V | = O (n/`) ∑                  .
                                                              b=1 b

     For every vertex u0 ∈ U 0 , we store a table Tu0 of size |V 0 |. The v0 -th entry of Tu0 is 1 iff there is an
i ∈ U in the set corresponding to u0 , and a j ∈ V in the set corresponding to v0 , such that B[i, j] = 1. Each
(i, j) is said to be a witness to Tu0 [v0 ] = 1. In the output version of the data structure, we associate a list
Lv0 with every nonzero entry v0 in the table Tu0 which contains those (i, j) pairs which are witnesses to
Tu0 [v0 ] = 1. Note that |Lv0 | ≤ O(κ 2 ).
     Given query vectors u and v, we compute uT Bv and those (i, j) satisfying ui B[i, j]v j = 1 as follows.
Let ug be the restriction of the vector u to group g of U. Note |ug | ≤ `. Let t(u, g) denote the number of
non-zeros in ug . Express ug as a Boolean sum of at most dt(u, g)/κe vectors (nodes) from U 0 ; this can
be done since each vector in U 0 has up to κ non-zeros. Do this over all groups g of U. Now u can be
represented as a Boolean sum of at most n/` + tu /κ vectors from U 0 . We repeat a similar procedure for v

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                   82
                         R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

over all groups g of V , obtaining a representation of v as a sum of at most n/` + tv /κ vectors from V 0 .
These representations can be determined in O(n`) time.
    Let Su ⊆ U 0 be the subset of vectors representing u, and Sv ⊆ V 0 be the vectors for v. For all u0 ∈ Su
and v0 ∈ Sv , look up Tu0 [v0 ]; if it is 1, output the list Lv0 . Observe uT Bv = 1 iff there is some Tu0 [v0 ] that
equals 1. It is easily seen that this procedure satisfies the desired running time bounds.
    Finally, we consider how to implement the above on the word RAM model. We shall have two
(analogous) data structures depending on whether tu ≤ tv or not.
    Suppose tu ≤ tv (the other situation is analogous). As previously in Theorem 4.1, we form the graph
U 0 with vertices corresponding to subsets of up to κ nonzeros within a vector of size `. With each such
vertex u0 ∈ U 0 we associate an n-bit vector Tu0 (which is stored as an n/w-word vector), obtained by taking
the union of the rows of B corresponding to u0 . Now, since v can also be stored as an n/w-word vector, the
product Tu0 · v can be performed in n/w time. For a given u there are at most n/` + tu /κ relevant vectors
Tu0 and hence the product uT Bv can be computed in time O((n/` + tu /κ)(n/w)).

Theorem 5.2. There is a combinatorial algorithm that, given any two Boolean n × n matrices A and B,
computes A ? B correctly with probability exponentially close to 1, in O(n3 (log log n)2 /(log2.25 n)) time
on a pointer machine, and O(n3 (log log n)/(w log7/6 n)) time on a word RAM.

Proof. The algorithm builds on the ideas in Theorem 2.1 (the BMM algorithm using triangle removal),
while applying the bilinear form preprocessing of Theorem 5.1, the algorithm for reporting edges in
triangles (Theorem 4.1), and Weak Regularity. We first describe the algorithm for pointer machines.

                                                                      √
Algorithm As in Theorem 2.1, by taking a random sample of n indices from [n], we can determine
those pairs (i, j) such that (A ? B)[i, j] = 1 where there are at least n3/4 witnesses to this fact. This takes
O(n2.5 ) time and succeeds with probability 1 − exp(−nΩ(1) ).
    Next we construct a tripartite graph G = (V1 ,V2 ,V3 , E) exactly as in Theorem 2.1, and just as before
our goal is to determine all edges (i, j) ∈ (V1 ,V3 ) that form at least one triangle with some vertex in V2 .
    Compute an ε-pseudoregular partition {W1 , . . . ,Wk } of the bipartite subgraph (V1 ,V3 ), with ε =
     √                                                                              2
1/(α log n) for an α > 0. By Theorem 3.6 this partition can be found in 2O(α log n) time. Set α to make
the runtime O(n2.5 ). Recall di j is the density of the pair (Wi ,W j ). The preprocessing stores two data
structures, one for pairs with “low” density and one for pairs with “high” density.

   1. (Low √Density Pairs) Let√ F be the set of all edges in (V1 ,V3 ) that lie in some pair (Wi ,W j ), where
      di j ≤ ε. Note |F| ≤ εn2 . Apply the algorithm of Theorem 4.1 to determine the subset of edges
      in F that participate in triangles. Remove the edges of F from G.
                                                               √
   2. (High Density Pairs) For all pairs (Wi ,W j ) with di j > ε, build the data structure for computing
      bilinear forms (Theorem 5.1) for the submatrix Ai j corresponding to the graph induced by (Wi ,W j ),
      with ` = log2 n and κ = log n/(5 log log n).

   Then for each vertex v ∈ V2 , let Si (v) = N(v) ∩Wi , and T j (v) = N(v) ∩W j . Whenever (i, j) is a high
density pair, compute all pairs of nodes in Si (v) × T j (v) that form a triangle with v, using the bilinear
form query algorithm of Theorem 5.1.

                            T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                    83
                                      N IKHIL BANSAL AND RYAN W ILLIAMS

                                                           2.75
               √ 2 the random sampling step takes O(n ) time.
Analysis Clearly,                                                     Consider the low density pairs step.
Recall |F| ≤ εn and every edge in (V1 ,V3 ) is in at most n3/4 triangles. Moreover, the function
f (δ ) = δ log(1/δ ) is increasing for small δ (e.g., over [0, 1/4]). Hence the algorithm that reports all
edges appearing in triangles (from Theorem 4.1) takes at most
                              √
                           O( εn3 log(1/ε)/ log2 n) ≤ O(n3 log log n/ log2.25 n)

time.
   Now we bound the runtime of the high density pairs step. First note that the preprocessing for bilinear
forms (Theorem 5.1) takes only
                                     2 !
             n2           log2 n
                                               2                                      
                                                   n            2 2 log n/(5 log log n)
       O         ·                        ≤ O            · (log  n)                       = O(n2+4/5 )
           log2 n log n/(5 log log n)            log2 n

time overall.
   Let e(S, T ) denote the number of edges between subsets S and T . Since there are O(n2.75 ) triangles,
we have that
                                  ∑ e(N(v) ∩V1 , N(v) ∩V3 ) ≤ n2.75 .                               (5.3)
                                          v∈V2

Since {Wi } is ε-pseudoregular partition of (V1 ,V3 ), by Definition 3.5, for any vertex v ∈ V2 we have that

                           ∑ di j |Si (v)||Tj (v)| − e(N(v) ∩V1 , N(v) ∩V3 ) ≤ εn2 .
                           i, j

Summing up over all vertices v, together with (5.3) implies that

                                  ∑ ∑ di j |Si (v)||Tj (v)| ≤ εn3 + O(n2.75 ) ≤ 2εn3
                              v∈V2 i, j
                                                   √
for large n. Summing over densities di j ≥          ε, we obtain

                                                                   √        2n3
                                  ∑       ∑ √ |Si (v)||Tj (v)| ≤ 2 εn3 ≤ log0.25 n .                      (5.4)
                              v∈V2 i, j:di j ≥ ε


   Applying expression (5.1), the time taken by all queries on the data structure for bilinear forms
(Theorem 5.1) for a fixed pair (Wi ,W j ) is at most
                                                     !                            !
                           (n/k) |Si (v)| log log nk   (n/k) |T j (v)| log log nk
                     ∑           +                             +                    .
                    v∈V2 log k
                              2n            log nk     log2 nk      log nk
                                                                                  √
   Expanding the products, summing up over the pairs (i, j) with di j ≥            ε, the total running time is
bounded by
                                                 |Si (v)||T j (v)|(log log n)2
                               ∑ ∑√                       log2 (n/k)
                              v∈V2 i, j:di j ≥ ε


                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                 84
                        R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

plus other terms with a total contribution of at most O(n3 log log(n/k)/ log3 (n/k)). Thus by (5.4), the
total runtime is upper bounded by O(n3 (log log n)2 / log2.25 n). Finally, the random sampling step ensures
that the number of witnesses is at most n0.75 for every edge, so the output cost in the algorithm is at most
O(n2.75 ).
Modification for the word RAM. To exploit a model with a larger wordsize, we apply the same algorithm
as
√ above, except we run the low density1/3   pairs step for pairs (Wi ,W j ) with density di j ≤ ε 1/3 (instead of
  ε). For the pairs (Wi ,W j ) with di j > ε , construct the data structure for bilinear forms (Theorem 5.1)
for the word RAM.
    First we consider pairs (i, j) for which di j < ε 1/3 . For these pairs, as above, we apply the processing
step in Theorem 4.1 for reporting the edges appearing in triangles. This has a running time

                       O(ε 1/3 n3 log(1/ε)/(w log n)) ≤ O(n3 log log n/(w log7/6 n)) .

   Pairs (i, j) for which di j > ε 1/3 , we use the data structure of Theorem 5.1 to answer the bilinear
queries. By (5.2), the total running time for such pairs is

                                                          n 2
                                                                                                              !
                                                                   n                                    n
                                                            
                                                          k        k · min(|Si (v)|, |T j (v)|) log log k
                           ∑          ∑                          +
                           v∈V2 i, j:di j >ε 1/3       w log2 nk               w log(n/k)
                                                           n                                    n
                              n3                           k · min(|Si (v)|, |T j (v)|) log log k
                      ≤             + ∑ ∑                                                         .               (5.5)
                           w log2 nk v∈V2 i, j:di j ≥ε 1/3               w log nk

To bound the second term above, observe that

                                                   ∑          ∑            min(|Si (v)|, |T j (v)|)
                                                   v∈V2 i, j:di j ≥ε 1/3

                                         ≤         ∑          ∑            (|Si (v)| · |T j (v)|)1/2
                                                   v∈V2 i, j:di j ≥ε 1/3
                                            √ s
                                         ≤ k n· ∑                              ∑        |Si (v)||T j (v)|
                                                                v∈V2 i, j:di j ≥ε 1/3

                                                   2kn2 log log n
                                         ≤                                 .                                      (5.6)
                                                      (log1/6 n)

Here the second inequality follows by Cauchy-Schwarz and the last inequality follows as

                                                                                                       2n3
                                ∑          ∑            |Si (v)||T j (v)| ≤ 2ε 2/3 n3 ≤
                               v∈V2 i, j:di j ≥ε 1/3                                               log1/3 n

using an argument identical to that used to obtain (5.4).
    By (5.6) the total running time in (5.5) is O(n3 log log n/(w log7/6 n)) as desired.

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                        85
                                      N IKHIL BANSAL AND RYAN W ILLIAMS

6    Independent set queries via Weak Regularity
We consider the following independent set query problem. We want to preprocess an n-node graph in
polynomial time and space, so that given any S1 , . . . , Sw ⊆ V , we can determine in n2 / f (n) time which of
S1 , . . . , Sw are independent sets. Using such a subroutine, we can easily determine in n3 /(w f (n)) time if a
graph has a triangle (provided the preprocessing itself can be done in O(n3 /(w f (n))) time), by executing
the subroutine on collections of sets corresponding to the neighborhoods of each vertex.
      The independent set query problem is equivalent to: preprocess a Boolean matrix A so that w
queries of the form “vTj Av j = 0?” can be computed in n2 / f (n) time, where the products are over the
Boolean semiring. We shall solve a more general problem: preprocess A to answer w queries of the form
“uT Av = 0?”, for arbitrary u, v ∈ {0, 1}n .
      Our method employs weak regularity along with other combinatorial ideas seen earlier in the paper.
Theorem 6.1. For all δ ∈ (0, 1/2), every n × n Boolean matrix A can be preprocessed in O(n2+δ ) time
such that given arbitrary Boolean vectors u1 , . . . , ulog n and v1 , . . . , vlog n , we can determine if uTp Av p = 0,
for all p = 1, . . . , log n in             2
                                                n (log log n)2
                                                                  
                                         O
                                                  δ (log n)5/4
time on a pointer machine.
    On the word RAM we can determine if uTp Av p = 0, for all p = 1, . . . , w in time
                                             2                   
                                                 n (log log n)
                                          O
                                                  δ (log n)7/6
where w is the wordsize.
Proof of Theorem 6.1. We describe the algorithm on the pointer machine; it can be extended to the word
RAM by a modification identical to that in Theorem 5.2. We start with the preprocessing.

Preprocessing Interpret A as a bipartite graph in the natural way. Compute a ε-pseudoregular partition
                                                 √
of the bipartite A = (V,W, E) with ε = Θ(1/ log n), using Theorem 3.6. (Note that this is the only
randomized part of the algorithm.) Let V1 ,V2 , . . . ,Vk be the parts of V and let W1 , . . . ,Wk be the parts of
                       2
W , where k ≤ 2O(1/ε ) .
    Let Ai j be the submatrix of A√corresponding to the subgraph induced by the pair (Vi ,W j ). Let di j be
the density of (Vi ,W j ). Let ∆ = ε.
    For each of the k2 submatrices Ai j , do the following:
    1. If di j ≤ ∆, apply the graph compression of Theorem 3.10 to preprocess Ai j in time mnδ log2 n, so
       that (using Lemma 3.11) the submatrix Ai j can be multiplied by any n/k × log n matrix B in time
                                                m log((n/k)2 /m)
                                                                
                                            O                      ,
                                                    log(n/k)
       where m is the number of nonzeros in Ai j . (Note that m ≤ ∆(n/k)2 .)
    2. If di j > ∆, apply the bilinear form preprocessing of Theorem 5.1 to Ai j with ` = log2 n and
       κ = δ log n/(5 log log n).

                             T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                       86
                        R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

Query Algorithm Given Boolean vectors u p and v p for p = 1, . . . , log n, let S p ⊆ [n] be the subset
corresponding to u p and T p ⊆ [n] be the subset corresponding to v p . For 1 ≤ i, j ≤ k, let Sip = S p ∩Vi and
T jp = T p ∩W j .

   1. Compute the estimate Q p = ∑ki, j=1 di j |Sip ||T jp | for all p = 1, . . . , log n. If Q p > εn2 , then output
      uTp Av p = 1.

   2. Let I = {p : Q p ≤ εn2 }. Note that |I| ≤ log n. We determine uTp Av p for each p ∈ I as follows:

         • For all (i, j) with di j > ∆, apply the bilinear form algorithm of Theorem 5.1 to compute
           eipj = (Sip )T Ai j T jp for each p ∈ I.
         • For all (i, j) with di j ≤ ∆, form an nk × |I| matrix B j with columns T jp over all p ∈ I. Compute
           Ci j = Ai j ? B j using the Ai j from preprocessing step 1. For each p ∈ I, compute the (Boolean)
           dot product eipj = (Sip )T ·Cipj , where Cipj is the p-th column of Ci j .
                                                                p
         • For each p ∈ I, return uTp Av p =
                                                       W
                                                          i, j ei j .

Analysis. We first consider the preprocessing time. By Theorem 3.6, we can choose ε so that the
ε-pseudoregular partition is constructed in O(n2+δ ) time. By Theorems 3.9 and 5.1, the preprocessing
for matrices Ai j takes at most O(k2 (n/k)2+δ ) time for some δ < 1/2. Thus, the total time is at most
O(n2+δ ).
    We now analyze the query algorithm. Notice that step 1 of the query algorithm works due to ε-
pseudoregularity: if Q p > εn2 then the number of edges between S p and T p in A is greater than 0.
Computing all Q p takes time at most O(k2 n log n).
    Consider the second step. As ∑i, j di j |Sip ||T jp | ≤ εn2 for each p ∈ I, we have

                                                                         εn2 √ 2
                                           ∑           |Sip ||T jp | ≤      = εn .                             (6.1)
                                        i, j:di j ≥∆                      ∆

    Analogously to Theorem 5.2, the total runtime over all p ∈ I and pairs (i, j) with di j > ∆ is at most

                                                                                       |T jp | log log nk
                                                                    !                                     !
                                       n/k      |Sip | log log nk             n/k
                     ∑ ∑                    +                            ·          +
                                         2n
                     p∈I i, j:di j >∆ log k              log nk             log2 nk           log nk
                                                            |Sip ||T jp |(log log n)2
                                                                                      !
                           n2 log log n
                 ≤ O                     +∑ ∑                                           .                   (6.2)
                                log3 n     p∈I i, j:di j >∆           log2 n

    The inequality (6.1), the fact that |I| ≤ log n, and our choice of ε imply that (6.2) is at most

                                           O(n3 (log log n)2 / log5/4 n) .

    Now we consider the pairs (i, j) with di j ≤ ∆. By Theorem 3.9, computing the product Ci j = Ai j B j
for all p ∈ I (at once) takes
                                                  2         !
                                              ∆ nk log(1/∆)
                                        O
                                                 log(n/k)

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                      87
                                   N IKHIL BANSAL AND RYAN W ILLIAMS

time. Summing over all relevant pairs (i, j) (there are at most k2 ), this is O(n2 (log log n)/ log5/4 n) by our
choice of ∆.
    The extension to the word RAM model follows by a modification identical to that in Theorem 5.2.


7    Conclusion
We have shown how regularity concepts can be applied to yield faster combinatorial algorithms for
fundamental graph problems. These results hint at an alternative line of research on Boolean matrix
multiplication that has been unexplored. It is likely that the connections are deeper than we know; let us
give a few reasons why we believe this.
     First, we applied generic tools that are probably stronger than necessary, so it should be profitable to
search for regularity concepts that are designed with matrix multiplication in mind (see the last paragraph
for more details). Secondly, Trevisan [58] has promoted the question of whether or not the Triangle
Removal Lemma requires the full Regularity Lemma. Our work gives a rather new motivation for this
question, and opens up the possibility that BMM may be related to other combinatorial problems as well.
As mentioned earlier, Jacob Fox [26] has recently proved a sharper Triangle Removal Lemma, but it
remains to be seen if his argument can be applied to solve BMM faster.
     Another interesting direction is to explore different algebraic structures. There may be similar
algorithms for matrix products over finite fields or the (min, +)-semiring. These algorithms would pre-
sumably apply different removal lemmas, but such lemmas are known to exist. For instance, Shapira [54]
(and independently, Král, Serra, and Vena [39]) recently proved the following, generalizing a result of
Green [33]. Let Ax = b be a set of linear equations over a finite field F, with n variables and m equations.
If S ⊆ F has the property that there are only o(|F|n−m ) solutions in Sn to Ax = b, then o(|F|) elements
can be removed from S so that the resulting Sn has no solutions to Ax = b. In light of our work, results
such as this are possible tools for finite field linear algebra with combinatorial algorithms.
     Finally, Vassilevska Williams and Williams [61] have recently shown how to derandomize the
algorithms of this paper. By exploiting the structure of BMM itself, one can show that any polynomial
time algorithm for computing a Weak Regularity partition can be applied to solve BMM faster; hence
the deterministic algorithm of Alon and Naor [6] suffices. In fact, the paper [61] proves that an n3−ε
time algorithm for BMM follows from an n3−3ε time algorithm for triangle detection in n-node graphs,
and stronger relations hold for n3 /poly(log n) runtimes. It is likely that faster triangle detection does not
require a partitioning as “strong” as Weak Regularity. We should not have to preprocess a graph so that
all independent set queries are fast; we only have to preprocess so that a given collection of n independent
set queries are fast. That is, triangle detection only requires that we query the n neighborhoods of the n
vertices in the graph. We believe that further progress on alternative algorithms for BMM can be made by
continuing to study the problem in a graph-theoretic way.


Acknowledgements
We thank Avrim Blum for suggesting the independent set query problem, which led us to this work. We
also thank the anonymous referees and the program committee for helpful comments. This work was

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                                 88
                       R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

conducted while N.B. worked at IBM Watson, and R.W. worked at the Institute for Advanced Study and
IBM Almaden. R.W. was supported by NSF Grant CCF-0832797 (Expeditions in Computing) at the
Institute for Advanced Study, Princeton, NJ, and the Josef Raviv Memorial Fellowship at IBM.


References
 [1] D ONALD A INGWORTH , C HANDRA C HEKURI , P IOTR I NDYK , AND R AJEEV M OTWANI: Fast
     estimation of diameter and shortest paths (without matrix multiplication). SIAM J. Comput.,
     28(4):1167–1181, 1999. Preliminary version in SODA’96. [doi:10.1137/S0097539796303421] 71

 [2] M ARTIN A LBRECHT, G REGORY BARD , AND W ILLIAM H ART: Algorithm 898: Effi-
     cient multiplication of dense matrices over GF(2). ACM Trans. Math. Softw., 37(1), 2010.
     [doi:10.1145/1644001.1644010] 70

 [3] N OGA A LON , R ICHARD A. D UKE , H ANNO L EFMANN , VOJTECH R ÖDL , AND R APHAEL
     Y USTER: The algorithmic aspects of the regularity lemma. J. Algorithms, 16(1):80–109, 1994.
     Preliminary version in FOCS’92. [doi:10.1006/jagm.1994.1005] 74

 [4] N OGA A LON , E LDAR F ISCHER , M ICHAEL K RIVELEVICH , AND M ARIO S ZEGEDY: Efficient
     testing of large graphs. Combinatorica, 20(4):451–476, 2000. Preliminary version in FOCS’99.
     [doi:10.1007/s004930070001] 70, 76

 [5] N OGA A LON , E LDAR F ISCHER , I LAN N EWMAN , AND A SAF S HAPIRA: A combinatorial charac-
     terization of the testable graph properties: It’s all about regularity. SIAM J. Comput., 39(1):143–167,
     2009. Preliminary version in STOC’06. [doi:10.1137/060667177] 70

 [6] N OGA A LON AND A SSAF NAOR: Approximating the cut-norm via Grothendieck’s in-
     equality.   SIAM J. Comput., 35(4):787–803, 2006. Preliminary version in STOC’04.
     [doi:10.1137/S0097539704441629] 88

 [7] DANA A NGLUIN: The four Russians’ algorithm for Boolean matrix multiplication is optimal for its
     class. SIGACT News, 1:29–33, 1976. [doi:10.1145/1008591.1008593] 72

 [8] V. Z. A RLAZAROV, E. A. D INIC , M. A. K RONROD , AND I. A. FARADZHEV: On economical
     construction of the transitive closure of a directed graph. Soviet Mathematics Doklady, 11(5):1209–
     1210, 1970. 70, 71

 [9] M ICHAEL D. ATKINSON AND N ICOLA S ANTORO: A practical algorithm for Boolean matrix
     multiplication. Inf. Process. Lett., 29:37–38, 1988. [doi:10.1016/0020-0190(88)90130-5] 70

[10] J ULIEN BASCH , S ANJEEV K HANNA , AND R AJEEV M OTWANI: On diameter verification and
     Boolean matrix multiplication, 1995. Technical Report No. STAN-CS-95-1544, Department of
     Computer Science, Stanford University. 70, 71

[11] F. A. B EHREND: On sets of integers which contain no three terms in arithmetic progression. Proc.
     Nat. Acad. Sci., 32(12):331–332, 1946. 72

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                               89
                                 N IKHIL BANSAL AND RYAN W ILLIAMS

[12] A RNAB B HATTACHARYYA , V ICTOR C HEN , M ADHU S UDAN , AND N ING X IE: Testing linear-
     invariant non-linear properties. Theory of Computing, 7:75–99, 2011. Preliminary version in
     STACS’06. [doi:10.4086/toc.2011.v007a006] 70

[13] G UY E. B LELLOCH , V IRGINIA VASSILEVSKA , AND RYAN W ILLIAMS: A new combinatorial
     approach for sparse graph problems. In Proc. 35th Internat. Colloq. on Automata, Languages and
     Programming (ICALP’08), pp. 108–120, 2008. [doi:10.1007/978-3-540-70575-8 10] 71, 77

[14] AVRIM B LUM: 2009. Personal communication. 73

[15] C HRISTIAN B ORGS , J ENNIFER T. C HAYES , L ÁSZL Ó L OV ÁSZ , V ERA T. S ÓS , BAL ÁZS S ZEGEDY,
     AND K ATALIN V ESZTERGOMBI : Graph limits and parameter testing. In Proc. 38th STOC, pp.
     261–270. ACM Press, 2006. [doi:10.1145/1132516.1132556] 70

[16] T IMOTHY M. C HAN: More algorithms for all-pairs shortest paths in weighted graphs. In Proc. 39th
     STOC, pp. 590–598. ACM Press, 2007. [doi:10.1145/1250790.1250877] 71

[17] S IDDHARTHA C HATTERJEE , A LVIN R. L EBECK , P RAVEEN K. PATNALA , AND M ITHUNA T HOT-
     TETHODI : Recursive array layouts and fast matrix multiplication. IEEE Trans. on Parallel and
     Distributed Systems, 13:1105–1123, 2002. [doi:10.1109/TPDS.2002.1058095] 70

[18] H ENRY C OHN , ROBERT D. K LEINBERG , BAL ÁZS S ZEGEDY, AND C HRISTOPHER U MANS:
     Group-theoretic algorithms for matrix multiplication. In Proc. 46th FOCS, pp. 379–388. IEEE
     Comp. Soc. Press, 2005. [doi:10.1109/SFCS.2005.39] 70

[19] H ENRY C OHN AND C HRISTOPHER U MANS: A group-theoretic approach to fast ma-
     trix multiplication. In Proc. 44th FOCS, pp. 438–449. IEEE Comp. Soc. Press, 2003.
     [doi:10.1109/SFCS.2003.1238217] 70

[20] A MIN C OJA -O GHLAN , C OLIN C OOPER , AND A LAN M. F RIEZE: An efficient sparse regularity
     concept. SIAM J. Discrete Math., 23(4):2000–2034, 2010. [doi:10.1137/080730160] 70

[21] D ORIT D OR , S HAY H ALPERIN , AND U RI Z WICK: All-pairs almost shortest paths. SIAM J. Comput.,
     29(5):1740–1759, 2000. Preliminary version in FOCS’96. [doi:10.1137/S0097539797327908] 70,
     71

[22] R ICHARD A. D UKE , H ANNO L EFMANN , AND VOJTECH R ÖDL: A fast approximation algorithm
     for computing the frequencies of subgraphs in a given graph. SIAM J. Comput., 24(3):598–620,
     1995. [doi:10.1137/S0097539793247634] 76

[23] M ICHAEL E LKIN: An improved construction of progression-free sets. In Proc. 21st Ann. ACM-
     SIAM Symp. on Discrete Algorithms (SODA’10), pp. 886–905, 2010. 72

[24] T OM ÁS F EDER AND R AJEEV M OTWANI: Clique partitions, graph compression and speeding-
     up algorithms. J. Comput. System Sci., 51(2):261–272, 1995. Preliminary version in STOC’91.
     [doi:10.1006/jcss.1995.1065] 71, 77

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                             90
                      R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

[25] M ICHAEL J. F ISCHER AND A LBERT R. M EYER: Boolean matrix multiplication and transi-
     tive closure. In Proc. 12th FOCS (SWAT’71), pp. 129–131. IEEE Comp. Soc. Press, 1971.
     [doi:10.1109/SWAT.1971.4] 70

[26] JACOB F OX: A new proof of the graph removal lemma. Ann. of Math., 174(1):561–579, 2011.
     [doi:10.4007/annals.2011.174.1.17] 72, 76, 88

[27] A LAN F RIEZE AND R AVI K ANNAN: A simple algorithm for constructing Szemerédi’s regularity
     partition. Electr. J. Comb., 6, 1999. 74

[28] A LAN M. F RIEZE AND R AVI K ANNAN: The Regularity Lemma and approximation schemes
     for dense problems. In Proc. 37th FOCS, pp. 12–20. IEEE Comp. Soc. Press, 1996.
     [doi:10.1109/SFCS.1996.548459] 70, 72, 75

[29] A LAN M. F RIEZE AND R AVI K ANNAN: Quick approximation to matrices and applications.
     Combinatorica, 19(2):175–220, 1999. [doi:10.1007/s004930050052] 70, 72, 74, 75, 76

[30] A NKA G AJENTAAN AND M ARK H. OVERMARS: On a class of o(n2 ) problems in computational
     geometry. Computational Geometry, 5:165–185, 1995. [doi:10.1016/0925-7721(95)00022-2] 73

[31] Z VI G ALIL AND O DED M ARGALIT: All pairs shortest distances for graphs with small integer
     length edges. Inform. and Comput., 134:103–139, 1997. [doi:10.1006/inco.1997.2620] 70

[32] W. T. G OWERS: Lower bounds of tower type for Szemerédi’s uniformity lemma. Geom. and Funct.
     Anal., 7:322–337, 1997. [doi:10.1007/PL00001621] 72

[33] B EN G REEN: A Szemerédi-type regularity lemma in abelian groups. Geom. and Funct. Anal.,
     15:340–376, 2005. [doi:10.1007/s00039-005-0509-8] 71, 75, 88

[34] A NDR ÁS H AJNAL , W OLFGANG M AASS , AND G Y ÖRGY T UR ÁN: On the communication
     complexity of graph properties. In Proc. 20th STOC, pp. 186–191. ACM Press, 1988.
     [doi:10.1145/62212.62228] 70

[35] A LON I TAI AND M ICHAEL RODEH: Finding a minimum circuit in a graph. SIAM J. Comput.,
     7(4):413–423, 1978. [doi:10.1137/0207033] 70

[36] Y. KOHAYAKAWA: Szemerédi’s regularity lemma for sparse graphs. In F. C UCKER AND M. S HUB,
     editors, Foundations of Computational Mathematics, pp. 216–230. Springer, 1997. 76

[37] YOSHIHARU KOHAYAKAWA , VOJTECH R ÖDL , AND L UBOS T HOMA: An optimal algorithm for
     checking regularity. SIAM J. Comput., 32(5):1210–1235, 2003. [doi:10.1137/S0097539702408223]
     74, 75

[38] J ÁNOS KOML ÓS AND M IKL ÓS S IMONOVITS: Szemerédi’s Regularity Lemma and its applications
     in graph theory. In Combinatorics, Paul Erdős is Eighty, (D. Miklós et. al, eds.), Bolyai Society
     Mathematical Studies, volume 2, pp. 295–352, 1996. 76

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                           91
                                N IKHIL BANSAL AND RYAN W ILLIAMS

[39] DANIEL K R ÁL , O RIOL S ERRA , AND L LU ÍS V ENA: A removal lemma for systems of linear
     equations over finite fields. Israel J. Math., 187(1):193–207, 2012. [doi:10.1007/s11856-011-0080-
     y] 88
[40] L ILLIAN L EE: Fast context-free grammar parsing requires fast Boolean matrix multiplication. J.
     ACM, 49:1–15, 2002. [doi:10.1145/505241.505242] 70, 71
[41] A NDRZEJ L INGAS: A geometric approach to Boolean matrix multiplication. In Proc. 13th Int. Symp.
     on Algorithms and Computation (ISAAC’02), pp. 501–510, 2002. [doi:10.1007/3-540-36136-7 44]
     71
[42] L ÁSZL Ó L OV ÁSZ AND BAL ÁZS S ZEGEDY: Szemerédi’s theorem for the analyst. Geom. and Funct.
     Anal., 17(1):252–270, 2007. [doi:10.1007/s00039-007-0599-6] 76
[43] J. W. M OON AND L. M OSER: A matrix reduction problem. Mathematics of Computation, 20:328–
     330, 1966. [doi:10.1090/S0025-5718-66-99935-2] 70, 71
[44] PATRICK E. O’N EIL AND E LIZABETH J. O’N EIL: A fast expected time algorithm for
     Boolean matrix multiplication and transitive closure matrices. Inf. Control, 22:132–138, 1973.
     [doi:10.1016/S0019-9958(73)90228-3] 71
[45] V ICTOR Y. PAN: How to Multiply Matrices Faster. Volume 179 of Lecture Notes in Computer
     Science. Springer, 1984. 70
[46] M IHAI PATRASCU: Towards polynomial lower bounds for dynamic problems. In Proc. 42nd STOC,
     pp. 603–610. ACM Press, 2010. [doi:10.1145/1806689.1806772] 73
[47] L IAM RODITTY AND U RI Z WICK: On dynamic shortest paths problems. Algorithmica, 61(2):389–
     401, 2010. Preliminary version in 12th Europ. Symp. Algor. (ESA’04). [doi:10.1007/s00453-010-
     9401-5] 71
[48] VOJT ĚCH R ÖDL AND M ATHIAS S CHACHT: Property testing in hypergraphs and the removal
     lemma. In Proc. 39th STOC, pp. 488–495. ACM Press, 2007. [doi:10.1145/1250790.1250862] 70
[49] I. Z. RUZSA AND E. S ZEMER ÉDI: Triple systems with no six points carrying three triangles.
     Colloquia Mathematica Societatis János Bolyai, 18:939–945, 1978. 71, 72, 75
[50] W OJCIECH RYTTER: Fast recognition of pushdown automaton and context-free languages. Inf.
     Control, 67:12–22, 1985. Preliminary version in MFCS’84. [doi:10.1016/S0019-9958(85)80024-3]
     70, 71
[51] J OHN E. S AVAGE: An algorithm for the computation of linear forms. SIAM J. Comput., 3:150–158,
     1974. [doi:10.1137/0203011] 70, 72
[52] C.-P. S CHNORR AND C. R. S UBRAMANIAN: Almost optimal (on the average) combinatorial algo-
     rithms for Boolean matrix product witnesses, computing the diameter. In Proc. 2nd Intern. Workshop
     Random. and Approx. Tech. Comput. Sci. (RANDOM’98), pp. 218–231, 1998. [doi:10.1007/3-540-
     49543-6 18] 71

                        T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                           92
                      R EGULARITY L EMMAS AND C OMBINATORIAL A LGORITHMS

[53] R AIMUND S EIDEL: On the all-pairs-shortest-path problem in unweighted undirected graphs. J.
     Comput. System Sci., 51(3):400–403, 1995. [doi:10.1006/jcss.1995.1078] 70

[54] A SAF S HAPIRA: Green’s conjecture and testing linear-invariant properties. In Proc. 41st STOC, pp.
     159–166. ACM Press, 2009. [doi:10.1145/1536414.1536438] 88

[55] AVI S HOSHAN AND U RI Z WICK: All pairs shortest paths in undirected graphs with
     integer weights.   In Proc. 40th FOCS, pp. 605–614. IEEE Comp. Soc. Press, 1999.
     [doi:10.1109/SFFCS.1999.814635] 70

[56] VOLKER S TRASSEN: Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356,
     1969. [doi:10.1007/BF02165411] 70

[57] E NDRE S ZEMER ÉDI: Regular partitions of graphs. In Proc. Colloque Inter. CNRS (J. C. Bermond,
     J. C. Fournier, M. Las Vergnas and D. Sotteau, eds.), pp. 399–401, 1978. 71, 74

[58] L UCA T REVISAN: Additive combinatorics and theoretical computer science. In SIGACT News
     Complexity Column 63, 2009. 88

[59] L ESLIE G. VALIANT: General context-free recognition in less than cubic time. J. Comput. System
     Sci., 10(2):308–315, 1975. [doi:10.1016/S0022-0000(75)80046-8] 70

[60] V IRGINIA VASSILEVSKA W ILLIAMS: Multiplying matrices faster than Coppersmith-Winograd. In
     Proc. 44th STOC, 2012. To appear. 70

[61] V IRGINIA VASSILEVSKA W ILLIAMS AND RYAN W ILLIAMS: Subcubic equivalences between
     path, matrix, and triangle problems. In Proc. 51st FOCS, pp. 645–654. IEEE Comp. Soc. Press,
     2010. [doi:10.1109/FOCS.2010.67] 88

[62] RYAN W ILLIAMS: Matrix-vector multiplication in sub-quadratic time (some preprocessing re-
     quired). In Proc. 18th Ann. ACM-SIAM Symp. on Discrete Algorithms (SODA’07), pp. 995–1001.
     ACM Press, 2007. 70, 71, 73


AUTHORS

      Nikhil Bansal
      Eindhoven University of Technology
      Eindhoven, Netherlands
      n.bansal tue nl




                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                           93
                              N IKHIL BANSAL AND RYAN W ILLIAMS

    Ryan Williams
    Stanford University
    Stanford, CA USA
    rrwilliams gmail com


ABOUT THE AUTHORS

    N IKHIL BANSAL graduated from CMU in 2003. His advisor was Avrim Blum. His
        research has focused on approximation and online algorithms for scheduling and other
        optimization problems.


    RYAN W ILLIAMS graduated from CMU in 2007. His advisor was Manuel Blum, who tried
      for years to get Ryan to help him formulate a nontrivial theory of machine consciousness.
      Failing miserably, Ryan turned to easier problems. Ryan is from rural Alabama, where
      squirrel meat is a delicacy and farming is not an exotic profession.




                      T HEORY OF C OMPUTING, Volume 8 (2012), pp. 69–94                           94