Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques_ 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th International .pdf

Lecture Notes in Computer Science 3624
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
New York University, NY, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany

Chandra Chekuri Klaus Jansen
José D.P. Rolim Luca Trevisan (Eds.)
Approximation,
Randomization
and Combinatorial
Optimization
Algorithms and Techniques
8th International Workshop on Approximation Algorithms
for Combinatorial Optimization Problems, APPROX 2005
and 9th International Workshop on Randomization
and Computation, RANDOM 2005
Berkeley, CA, USA, August 22-24, 2005
Proceedings
1 3

Volume Editors
Chandra Chekuri
Lucent Bell Labs
600 Mountain Avenue, Murray Hill, NJ 07974, USA
E-mail: chekuri@research.bell-labs.com
Klaus Jansen
University of Kiel, Institute for Computer Science
Olshausenstr. 40, 24098 Kiel, Germany
E-mail: kj@informatik.uni-kiel.de
José D.P. Rolim
Université de Genève, Centre Universitaire d’Informatique
24, Rue Général Dufour, 1211 Genève 4, Suisse
E-mail: jose.rolim@cui.unige.ch
Luca Trevisan
University of California, Computer Science Department
679 Soda Hall, Berkeley, CA 94720-1776, USA
E-mail: luca@cs.berkeley.edu
Library of Congress Control Number: 2005930720
CR Subject Classification (1998): F.2, G.2, G.1
ISSN 0302-9743
ISBN-10 3-540-28239-4 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-28239-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springeronline.com
© Springer-Verlag Berlin Heidelberg 2005
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Olgun Computergrafik
Printed on acid-free paper SPIN: 11538462 06/3142 5 4 3 2 1 0

Preface
This volume contains the papers presented at the 8th International Workshop
on Approximation Algorithms for Combinatorial Optimization Problems
(APPROX 2005) and the 9th International Workshop on Randomization and
Computation (RANDOM 2005), which took place concurrently at the University
of California in Berkeley, on August 22–24, 2005. APPROX focuses on algorith-
mic and complexity issues surrounding the development of eﬃcient approximate
solutions to computationally hard problems, and APPROX 2005 was the eighth
in the series after Aalborg (1998), Berkeley (1999), Saarbrücken (2000), Berke-
ley (2001), Rome (2002), Princeton (2003), and Cambridge (2004). RANDOM is
concerned with applications of randomness to computational and combinatorial
problems, and RANDOM 2005 was the ninth workshop in the series follow-
ing Bologna (1997), Barcelona (1998), Berkeley (1999), Geneva (2000), Berkeley
(2001), Harvard (2002), Princeton (2003), and Cambridge (2004).
Topics of interest for APPROX and RANDOM are: design and analysis of
approximation algorithms, hardness of approximation, small space and data
streaming algorithms, sub-linear time algorithms, embeddings and metric space
methods, mathematical programming methods, coloring and partitioning, cuts
and connectivity, geometric problems, game theory and applications, network
design and routing, packing and covering, scheduling, design and analysis of ran-
domized algorithms, randomized complexity theory, pseudorandomness and de-
randomization, random combinatorial structures, random walks/Markov chains,
expander graphs and randomness extractors, probabilistic proof systems, ran-
dom projections and embeddings, error-correcting codes, average-case analysis,
property testing, computational learning theory, and other applications of ap-
proximation and randomness.
The volume contains 20 contributed papers selected by the APPROX Pro-
gram Committee out of 50 submissions, and 21 contributed papers selected by
the RANDOM Program Committee out of 51 submissions.
We would like to thank all of the authors who submitted papers, the members
of the program committees

VI Preface
APPROX 2005
Matthew Andrews, Lucent Bell Labs
Avrim Blum, CMU
Moses Charikar, Princeton University
Chandra Chekuri, Lucent Bell Labs (Chair)
Julia Chuzhoy, MIT
Uriel Feige, Microsoft Research and Weizmann Institute
Naveen Garg, IIT Delhi
Howard Karloff, AT&T Labs – Research
Stavros Kolliopoulos, University of Athens
Adam Meyerson, UCLA
Seffi Naor, Technion
Santosh Vempala, MIT
RANDOM 2005
Dorit Aharonov, Hebrew University
Boaz Barak, IAS and Princeton University
Funda Ergun, Simon Fraser University
Johan Håstad, KTH Stockholm
Chi-Jen Lu, Academia Sinica
Milena Mihail, Georgia Institute of Technology
Robert Krauthgamer, IBM Almaden
Dana Randall, Georgia Institute of Technology
Amin Shokrollahi, EPF Lausanne
Angelika Steger, ETH Zurich
Luca Trevisan, UC Berkeley (Chair)
and the external subreferees Scott Aaronson, Dimitris Achlioptas, Mansoor
Alicherry, Andris Ambainis, Aaron Archer, Nikhil Bansal, Tugkan Batu, Gerard
Ben Arous, Michael Ben-Or, Eli Ben-Sasson, Petra Berenbrink, Randeep Bhatia,
Nayantara Bhatnagar, Niv Buchbinder, Shuchi Chawla, Joseph Cheriyan, Roee
Engelberg, Lance Fortnow, Tom Friedetzky, Mikael Goldmann, Daniel
Gottesman, Sam Greenberg, Anupam Gupta, Venkat Guruswami, Tom Hayes,
Monika Henzinger, Danny Hermelin, Nicole Immorlica, Piotr Indyk, Adam Kalai,
Julia Kempe, Claire Kenyon, Jordan Kerenidis, Sanjeev Khanna, Amit
Kumar, Ravi Kumar, Nissan Lev-Tov, Liane Lewin, Laszlo Lovasz, Elitza
Maneva, Michael Mitzenmacher, Cris Moore, Michele Mosca, Kamesh Munagala,
Noam Nisan, Ryan O’Donnell, Martin Pal, Vinayaka Pandit, David Peleg, Yuval
Rabani, Dror Rawitz, Danny Raz, Adi Rosen, Ronitt Rubinfeld, Cenk Sahinalp,
Alex Samorodnitsky, Gabi Scalosub, Leonard Schulman, Roy Schwartz, Pranab
Sen, Mehrdad Shahshahani, Amir Shpilka, Anastasios Sidiropoulos, Greg Sorkin,
Adam Smith, Ronen Shaltiel, Maxim Sviridenko, Amnon-Ta-Shma, Emre
Telatar, Alex Vardy, Eric Vigoda, Da-Wei Wang, Ronald de Wolf, David
Woodruff, Hsin-Lung Wu, and Lisa Zhang.

Preface VII
We gratefully acknowledge the support from the Lucent Bell Labs of Murray
Hill, the Computer Science Division of the University of California at Berkeley,
the Institute of Computer Science of the Christian-Albrechts-Universität zu Kiel
and the Department of Computer Science of the University of Geneva. We also
thank Ute Iaquinto and Parvaneh Karimi Massouleh for their help.
August 2005 Chandra Chekuri and Luca Trevisan, Program Chairs
Klaus Jansen and José D.P. Rolim, Workshop Chairs

Table of Contents
Contributed Talks of APPROX
The Network as a Storage Device: Dynamic Routing
with Bounded Buﬀers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Stanislav Angelov, Sanjeev Khanna, and Keshav Kunal
Rounding Two and Three Dimensional Solutions of the SDP Relaxation
of MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Adi Avidor and Uri Zwick
What Would Edmonds Do? Augmenting Paths and Witnesses
for Degree-Bounded MSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Kamalika Chaudhuri, Satish Rao, Samantha Riesenfeld,
and Kunal Talwar
A Rounding Algorithm for Approximating Minimum
Manhattan Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Victor Chepoi, Karim Nouioua, and Yann Vaxès
Packing Element-Disjoint Steiner Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Joseph Cheriyan and Mohammad R. Salavatipour
Approximating the Bandwidth of Caterpillars . . . . . . . . . . . . . . . . . . . . . . . . . 62
Uriel Feige and Kunal Talwar
Where’s the Winner? Max-Finding and Sorting with Metric Costs. . . . . . . . 74
Anupam Gupta and Amit Kumar
What About Wednesday? Approximation Algorithms
for Multistage Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Anupam Gupta, Martin Pál, Ramamoorthi Ravi, and Amitabh Sinha
The Complexity of Making Unique Choices: Approximating 1-in-k SAT . . . 99
Venkatesan Guruswami and Luca Trevisan
Approximating the Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Alexander Hall and Christos Papadimitriou
Approximating the Best-Fit Tree Under Lp Norms . . . . . . . . . . . . . . . . . . . . . 123
Boulos Harb, Sampath Kannan, and Andrew McGregor
Beating a Random Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Gustav Hast

X Table of Contents
Scheduling on Unrelated Machines
Under Tree-Like Precedence Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
V.S. Anil Kumar, Madhav V. Marathe, Srinivasan Parthasarathy,
and Aravind Srinivasan
Approximation Algorithms for Network Design and Facility Location
with Service Capacities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Jens Maßberg and Jens Vygen
Finding Graph Matchings in Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Andrew McGregor
A Primal-Dual Approximation Algorithm for Partial Vertex Cover:
Making Educated Guesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Julián Mestre
Eﬃcient Approximation of Convex Recolorings . . . . . . . . . . . . . . . . . . . . . . . . 192
Shlomo Moran and Sagi Snir
Approximation Algorithms for Requirement Cut on Graphs . . . . . . . . . . . . . 209
Viswanath Nagarajan and Ramamoorthi Ravi
Approximation Schemes for Node-Weighted Geometric
Steiner Tree Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Jan Remy and Angelika Steger
Towards Optimal Integrality Gaps for Hypergraph Vertex Cover
in the Lovász-Schrijver Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Iannis Tourlakis
Contributed Talks of RANDOM
Bounds for Error Reduction with Few Quantum Queries . . . . . . . . . . . . . . . . 245
Sourav Chakraborty, Jaikumar Radhakrishnan,
and Nandakumar Raghunathan
Sampling Bounds for Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 257
Moses Charikar, Chandra Chekuri, and Martin Pál
An Improved Analysis of Mergers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Zeev Dvir and Amir Shpilka
Finding a Maximum Independent Set in a Sparse Random Graph . . . . . . . . 282
Uriel Feige and Eran Ofek
On the Error Parameter of Dispersers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Ronen Gradwohl, Guy Kindler, Omer Reingold, and Amnon Ta-Shma
Tolerant Locally Testable Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Venkatesan Guruswami and Atri Rudra

Table of Contents XI
A Lower Bound on List Size for List Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 318
Venkatesan Guruswami and Salil Vadhan
A Lower Bound for Distribution-Free Monotonicity Testing. . . . . . . . . . . . . . 330
Shirley Halevy and Eyal Kushilevitz
On Learning Random DNF Formulas Under the Uniform Distribution . . . . 342
Jeﬀrey C. Jackson and Rocco A. Servedio
Derandomized Constructions
of k-Wise (Almost) Independent Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 354
Eyal Kaplan, Moni Naor, and Omer Reingold
Testing Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
Oded Lachish and Ilan Newman
The Parity Problem in the Presence of Noise,
Decoding Random Linear Codes, and the Subset Sum Problem . . . . . . . . . . 378
Vadim Lyubashevsky
The Online Clique Avoidance Game on Random Graphs . . . . . . . . . . . . . . . . 390
Martin Marciniszyn, Reto Spöhel, and Angelika Steger
A Generating Function Method for the Average-Case Analysis of DPLL . . 402
Rémi Monasson
A Continuous-Discontinuous Second-Order Transition in the Satisﬁability
of Random Horn-SAT Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Cristopher Moore, Gabriel Istrate, Demetrios Demopoulos,
and Moshe Y. Vardi
Mixing Points on a Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Dana Randall and Peter Winkler
Derandomized Squaring of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
Eyal Rozenman and Salil Vadhan
Tight Bounds for String Reconstruction Using Substring Queries . . . . . . . . . 448
Dekel Tsur
Reconstructive Dispersers and Hitting Set Generators . . . . . . . . . . . . . . . . . . 460
Christopher Umans
The Tensor Product of Two Codes Is Not Necessarily Robustly Testable . . 472
Paul Valiant
Fractional Decompositions of Dense Hypergraphs . . . . . . . . . . . . . . . . . . . . . . 482
Raphael Yuster
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

The Network as a Storage Device:
Dynamic Routing with Bounded Buffers
Stanislav Angelov
, Sanjeev Khanna
, and Keshav Kunal
University of Pennsylvania, Philadelphia, PA 19104, USA
{angelov,sanjeev,kkunal}@cis.upenn.edu
Abstract. We study dynamic routing in store-and-forward packet net-
works where each network link has bounded buffer capacity for receiving
incoming packets and is capable of transmitting a fixed number of pack-
ets per unit of time. At any moment in time, packets are injected at
various network nodes with each packet specifying its destination node.
The goal is to maximize the throughput, defined as the number of packets
delivered to their destinations.
In this paper, we make some progress in understanding what is achievable
on various network topologies. For line networks, Nearest-to-Go (NTG),
a natural greedy algorithm, was shown to be O(n2/3
)-competitive by
Aiello et al [1]. We show that NTG is Õ(
√
n)-competitive, essentially
matching an identical lower bound known on the performance of any
greedy algorithm shown in [1]. We show that if we allow the online routing
algorithm to make centralized decisions, there is indeed a randomized
polylog(n)-competitive algorithm for line networks as well as rooted tree
networks, where each packet is destined for the root of the tree. For grid
graphs, we show that NTG has a performance ratio of Θ̃(n2/3
) while no
greedy algorithm can achieve a ratio better than Ω(
√
n). Finally, for an
arbitrary network with m edges, we show that NTG is Θ̃(m)-competitive,
improving upon an earlier bound of O(mn) [1].
1 Introduction
The problem of dynamically routing packets is central to traffic management
on large scale data networks. Packet data networks typically employ store-and-
forward routing where network links (or routers) store incoming packets and
schedule them to be forwarded in a suitable manner. Internet is a well-known
example of such a network. In this paper, we consider routing algorithms for
store-and-forward networks in a dynamic setting where packets continuously ar-
rive in an arbitrary manner over a period of time and each packet specifies its

Supported in part by NSF Career Award CCR-0093117, NSF Award ITR 0205456
and NIGMS Award 1-P20-GM-6912-1.

Supported in part by an NSF Career Award CCR-0093117, NSF Award CCF-
0429836, and a US-Israel Binational Science Foundation Grant.

Supported in part by an NSF Career Award CCR-0093117 and NSF Award CCF-
0429836.
C. Chekuri et al. (Eds.): APPROX and RANDOM 2005, LNCS 3624, pp. 1–13, 2005.
c
Springer-Verlag Berlin Heidelberg 2005

2 Stanislav Angelov, Sanjeev Khanna, and Keshav Kunal
destination. The goal is to maximize the throughput, defined as the number of
packets that are delivered to their destinations. This model, known as the Com-
petitive Network Throughput (CNT) model, was introduced by Aiello et al [1].
The model allows for both adaptive and non-adaptive routing of packets, where
in the latter case each packet also specifies a path to its destination node. Note
that there is no distinction between the two settings on line and tree networks.
Aiello et al analyzed greedy algorithms, which are characterized as ones that
accept and forward packets whenever possible. For the non-adaptive setting,
they showed that Nearest-to-Go (NTG), a natural greedy algorithm that favors
packets whose remaining distance is the shortest, is O(mn)-competitive1
on any
network with n nodes and m edges. On line networks, NTG was shown to be
O(n2/3
)-competitive provided each link has buffer capacity at least 2.
Our Results. Designing routing algorithms in the CNT model is challenging
since buffer space is limited at any individual node and a good algorithm should
use all nodes to buffer packets and not just nodes where packets are injected.
It is like viewing the entire network as a storage device where new information
is injected at locations chosen by the adversary. Regions in the network with
high injection rate need to continuously move packets to intermediate network
regions with low activity. However, if many nodes simultaneously send packets to
the same region, the resulting congestion would lead to buffer overflows. Worse
still, the adversary may suddenly raise the packet injection rate in the region of
low activity. In general, an adversary can employ different attack strategies in
different parts of the network. How should a routing algorithm coordinate the
movement of packets in presence of buffer constraints? This turns out to be a
difficult question even on simple networks such as lines and trees.
In this paper, we continue the thread of research started in [1]. Throughout
the remainder of this paper, we will assume that all network links have a uniform
buffer size B and bandwidth 1. We obtain the following results:
Line Networks: For line networks with B 1, we show that NTG is Õ(
√
n)-
competitive, essentially matching an Ω(
√
n) lower bound of [1]. A natural ques-
tion is whether non-greedy algorithms can perform better. For B = 1, we show
that any deterministic algorithm is Ω(n)-competitive, strengthening a result
of [1], and give a randomized Õ(
√
n)-competitive algorithm. For B 1, how-
ever, no super-constant lower bounds are known leaving a large gap between
known lower bounds and what is achievable by greedy algorithms. We show
that centralized decisions can improve performance exponentially by designing a
randomized O(log3
n)-competitive algorithm, referred to as Merge and Deliver.
Tree Networks: For tree networks of height h, it was shown in [1] that any
greedy algorithm has a competitive ratio of Ω(n/h). Building on the ideas used
1
The exact bound shown in [1] is O(mD) where D is the maximum length of any
given path in the network. We state here the worst-case bound with D = Ω(n).

The Network as a Storage Device: Dynamic Routing with Bounded Buffers 3
Table 1. Summary of results; the bounds obtained in this paper are in boldface.
Algorithm Line Tree Grid General
Greedy Ω(
√
n) [1] Ω(n) [1] Ω(
√
n) Ω(m)
NTG Õ(
√
n) Õ(n) Θ̃(n2/3
) Õ(m)
Previous bounds O(n2/3
) [1] O(nh) [1] - O(mn) [1]
Merge and Deliver (MD) O(log3
n) O(h log2
n) - -
in the line algorithm, we give a O(log2
n)-competitive algorithm when all pack-
ets are destined for the root. Such tree networks are motivated by the work on
buffer overflows of merging streams of packets by Kesselman et al [2] and the
work on information gathering by Kothapalli and Scheideler [3] and Azar and
Zachut [4]. The result extends to a randomized O(h log2
n)-competitive algo-
rithm when packet destinations are arbitrary.
Grid Networks: For adaptive setting in grid networks, we show that NTG with
one-bend routing is Θ̃(n2/3
)-competitive. We establish a lower bound of Ω(
√
n)
on the competitive ratio for greedy algorithms with shortest path routing.
General Networks: Finally, for arbitrary network topologies, we show that any
greedy algorithm is Ω(m)-competitive and prove that NTG is, in fact, Θ̃(m)-
competitive. These results hold for both adaptive and non-adaptive settings,
where NTG routes on a shortest path in the adaptive case.
Related Work. Dynamic store-and-forward routing networks have been stud-
ied extensively. An excellent survey of packet drop policies in communication
networks can be found in [5]. Much of the earlier work has focused on the is-
sue of stability with packets being injected in either probabilistic or adversarial
manner. In stability analysis the goal is to understand how the buffer size at
each link needs to grow as a function of the packet injection rate so that packets
are never dropped. A stable protocol is one where the maximum buffer size does
not grow with time. For work in the probabilistic setting, see [6–10]. Adversarial
queuing theory introduced by Borodin et al [11] has also been used to study
the stability of protocols and it has been shown in [11, 12] that certain greedy
algorithms are unstable, that is, require unbounded buffer sizes. In particular,
NTG is unstable.
The idea of using the entire network to store packets effectively has been
used in [13] but in their model, packets can not be buffered while in transit and
the performance measure is the time required to deliver all packets. Moreover,
packets never get dropped because there is no limit on the number of packets
which can be stored at their respective source nodes.
Throughput competitiveness was highlighted as a network performance mea-
sure by Awerbuch et al in [14]. They used adversarial traffic to analyze store-

and-forward routing algorithms in [15], but they compared their throughput to
an adversary restricted to a certain class of strategies and with smaller buffers.
Kesselman et al [2] study the throughput competitiveness of work-conserving
algorithms on line and tree networks when the adversary is a work-conserving
algorithm too. Work-conserving algorithms always forward a packet if possible
but unlike greedy, they need not accept packets when there is space in the
buffer. However, they do not make assumptions about uniform link bandwidths
or buffer sizes as in [1] and our model. They also consider the case when packets
have different weights. Non-preemptive policies for packets with different values
but with unit sized buffers have been analyzed in [16, 17].
In a recent parallel work, Azar and Zachut [4] also obtain centralized algo-
rithms with polylog(n) competitive ratios on lines. They first obtain a O(log n)-
competitive deterministic algorithm for the special case when all packets have
the same destination (which is termed information gathering) and then show that
it can be extended to a randomized algorithm with O(log2
n)-competitive ratio
for the general case. Their result can be extended to get O(h log n)-competitive
ratio on trees. Information gathering problem is similar to our notion of balanced
instances (see Section 2.3) though their techniques are very different from ours
– they construct an online reduction from the fractional buffers packet routing
with bounded delay problem to fractional information gathering and the former
is solved by an extension of the work of Awerbuch et al [14] for the discrete
version. The fractional algorithm for information gathering is then transformed
to a discrete one.
2 Preliminaries
2.1 Model and Problem Statement
We model the network as a directed graph on n nodes and m links. A node has
at most I traffic ports where new packets can be injected, at most one at each
port. Each link has an output port at its tail with a capacity B 0 buffer, and
an input port at its head that can store 1 packet. We assume uniform buffer size
B at each link and bandwidth of each link to be 1.
Time is synchronous and each time step consists of forwarding and switching
sub-steps. During the forwarding phase, each link selects at most one packet
from its output buffer according to an output scheduling policy and forwards it
to its input buffer. During the switching phase, a node clears all packets from
its traffic ports and input ports at its incoming link(s). It delivers packets if the
node is their destination or assigns them to the output port of the outgoing link
on their respective paths. When more than B packets are assigned to a link’s
output buffer, packets are discarded based on a contention-resolution policy.
We consider preemptive contention-resolution policies that can replace packets
already stored at the buffer with new packets. A routing protocol specifies the
output scheduling and contention-resolution policy. We are interested in online
policies which make decisions with no knowledge of future packet arrivals.
Each injected packet comes with a specified destination. We assume that
the destination is different from the source, otherwise the packet is routed opti-

mally by any algorithm and does not interfere with other packets. The goal of
the routing algorithm is to maximize throughput, that is, the total number of
packets delivered by it. We distinguish between two types of algorithms, namely
centralized and distributed. A centralized algorithm makes coordinated decisions
at each node taking into account the state of the entire network while a dis-
tributed algorithm requires that each node make its decisions based on local
information only. Distributed algorithms are of great practical interest for large
networks. Centralized algorithms, on the other hand, give us insight into the
inherent complexity of the problem due to the online nature of the input.
2.2 Useful Background Results and Definitions
The following lemma which gives an upper bound on the packets that can be
absorbed over a time interval, will be a recurrent idea while analyzing algorithms.
Lemma 1. [1] In a network with m links, the number of packets that can be
delivered and buffered in a time interval of length T units by any algorithm is
O(mT/d + mB), where d 0 is a lower bound on the number of links in the
shortest path to the destination for each injected packet.
The proof bounds the available bandwidth on all links and compares it
against the minimum bandwidth required to deliver a packet injected during the
time interval. The O(mB) term accounts for packets buffered at the beginning
and the end of the interval, which can be arbitrarily close to their destination.
An important class of distributed algorithms is greedy algorithms where each
link always accepts an incoming packet if there is available buffer space and
always forwards a packet if its buffer is non-empty. Based on how contention
is resolved when receiving/forwarding packets, we obtain different algorithms.
Nearest-To-Go (NTG) is a natural greedy algorithm which always selects a short-
est path to route a packet and prefers a packet that has shortest remaining
distance to travel, in both choosing packets to accept or forward.
Line and tree networks are of special interest to us, where there is a unique
path between every source and destination pair. A line network on n nodes is a
directed path with nodes labeled 1, 2, . . . , n and a link from node i to i + 1, for
i ∈ [1, n). A tree network is a rooted tree with links directed towards the root.
Note that there is a one-to-one correspondence between links and nodes (for all
but one node) on lines and trees. For simplicity, whenever we refer to a node’s
buffer, we mean the output buffer of its unique outgoing link.
A simple useful property of any greedy algorithm is as follows:
Lemma 2. [1] If at some time t, a greedy algorithm on a line network has
buffered k ≤ nB packets, then it delivers at least k packets by time t + (n − 1)B.
At times it will be useful to geometrically group packets into classes in the
following manner. A packet belongs to class j if the length of the path on which
it is routed by the optimal algorithm is in the range [2j
, 2j+1
). In the case when
paths are unique or specified the class of a packet can be determined exactly.

2.3 Balanced Instances for Line Networks
At each time step, a node makes two decisions – which packets to accept (and
which to drop if buffer overflows) and which packets to forward. When all packets
have the same destination, the choice essentially comes down to whether a node
should forward a packet or not, Motivated by this, we define the notion of a
balanced instance, so that an algorithm can focus only on deciding when to
forward packets. In a balanced instance, each released packet travels a distance
of Θ(n). We establish a theorem that shows that with a logarithmic loss in
competitive ratio, any algorithm on balanced instances can be converted to a
one on arbitrary instances. Azar and Zachut [4] prove a similar theorem that it
suffices to do so for information gathering.
Theorem 1. An α-competitive algorithm for balanced instances gives an
O(α log n)-competitive randomized algorithm for any instance on a line network.
Proof. It suffices to randomly partition the line into disjoint networks of length
each s.t. (i) online algorithm only accepts packets whose origin and destination
are contained inside one of the smaller networks, (ii) each accepted packet travels
a distance of Ω(), and (iii) Ω(1/ log n)-fraction of the packets routed by the
optimal algorithm are expected to satisfy condition (i) and (ii).
We first pick a random integer i ∈ [0, log n). We next look at three distinct
ways to partition the line network into intervals of length 3d each, starting from
node with index 1, d + 1, or 2d + 1, where d = 2i
. The online algorithm chooses
one of the three partitionings at random and only routes class i packets whose
end-points lie completely inside an interval. It is not hard to see that the process
decomposes the problem into line routing instances satisfying (iii).
The above partitioning can be done by releasing O(1) packets initially as a
part of the network setup process. It can be used by distributed algorithms too.
3 Nearest-to-Go Protocol
In this section we prove asymptotically tight bounds (upto logarithmic factors)
on the performance of Nearest-To-Go (NTG) on lines, show their extension to
grids and then prove bounds on general graphs.
For a sequence σ of packets, let Pj denote packets of class j. Also, let OPT
be the set of packets delivered by an optimal algorithm. Let class i be the
class from which optimal algorithm delivers most packets, that is, |Pi ∩ OPT| =
Ω(|OPT|/ log n). We will compare the performance of NTG to an optimal al-
gorithm on Pi, which we denote by OPTi. It is easy to see that if NTG is
α-competitive against OPTi then it is O(α log n)-competitive against OPT.
3.1 Line Networks
We show that NTG is O(
√
n log n) competitive on line networks with B 1.
This supplements the lower bound Ω(
√
n) on the competitive ratio of any greedy
algorithm and improves the O(n2/3
)-competitive ratio of NTG obtained in [1].

We prove that NTG maintains competitiveness w.r.t. to OPTi on a suitable
collection of instances, referred to as virtual instances.
Virtual Instances. Let class i be defined as above. Consider the three ways of
partitioning defined in Section 2.3 with i as the chosen class. Instead of randomly
choosing a partitioning, we pick one containing at least |Pi ∩ OPT|/3 packets.
For each interval in this optimal partitioning, we create a new virtual instance
consisting of packets (of all classes) injected within the interval as well as any
additional packets that travel some portion of the interval during NTG execution
on the original sequence σ. Let us fix any such instance. W.l.o.g. we relabel the
nodes by integers 1 through 3d. We also add two specials nodes of index 0 (to
model packets incoming from earlier instances) and 3d + 1 (to model packets
destined beyond the current instance). We divide the packets traveling through
the instance as good and bad where a packet is good iff its destination is not
the node labeled 3d + 1. Due to the NTG property, bad packets never delay
good packets or cause them to be dropped. Hence while analyzing our virtual
instance, we will only consider good packets. We also let n = 3d + 2.
Analysis. For each virtual instance, we compare the number of good packets
delivered by NTG to the number of class i packets contained within the interval
that OPTi delivered. The analysis (and not the algorithm) is in rounds where
each round has 2 phases. Phase 1 terminates when OPTi absorbs nB packets
for the first time, and Phase 2 has a fixed duration of (n − 1)B time units.
Packets are given a unique index and credits to account for dropped packets.
We swap packets (meaning swap their index and credits) when either of the two
events occurs: (i) a new packet causes another to be dropped, or (ii) a moving
packet is delayed by another packet. The rules respectively ensure that a packet
is (virtually) never dropped after being accepted and is (virtually) never delayed
at a node once it starts moving. Recall that NTG drops packets at a node only
when the node has B packets. For every node, mark the first time it overflows
and give a charge of 2 to each of its packets. We do not charge that node for
the next B time steps. Note that charge of 2 is enough since OPTi could have
moved away at most B packets from that node during that period and have B
packets in its buffers. Repeat the process till the end of Phase 1.
Lemma 3. NTG absorbs Ω(
√
nB) packets by the end of Phase 1.
Proof. Let 2c be the maximum charge accumulated by a packet in Phase 1. NTG
absorbed ≥ nB
2c packets since OPTi absorbed at least nB packets. Now consider
the rightmost packet with 2c charge. Split its charge into two parts, 2s – charge
accumulated while static at its source node and 2m – charge accumulated while
moving. Since the packet was charged 2s, it stayed idle for at least B(s − 1)
time units and hence at least that many packets crossed over it. It is never in
the same buffer with these packets again. On the other hand, while the packet
was moving, it left behind m(B − 1) packets which never catch up with it again.
Therefore, NTG absorbs at least max{nB
2c , (c − 1)(B − 1)} = Ω(
√
nB) packets.

By the end of the round, OPTi buffered and delivered O(nB) class i packets
since it absorbed at most 2nB packets in Phase 1 and by Lemma 1, OPTi
absorbed O(nB) packets during Phase 2. By Lemma 2 and 3, NTG delivers
Ω(
√
nB) packets by the end of Phase 2. Thus we obtain an O(
√
n) competitive
ratio. Note we can clear the charge of packets in NTG buffers and consider OPTi
buffers to be empty at the end of the round.
Theorem 2. For B 1, NTG is an O(
√
n log n)-competitive algorithm on line
networks.
Remark 1. We note here that one can show that allowing greedy a buffer space
that is α times larger can not improve the competitive ratio to o(
√
n/α).
3.2 Grid Networks
We consider 2-dimensional regular grids on n nodes, that is, with
√
n rows and
columns, where nodes are connected by directed links to the neighbors in the
same row and column. Since it is no longer true that there is a unique shortest
path between a source-destination pair on a grid, we restrict NTG to one-bend
routing – every packet first moves along the row it is injected to its destination’s
column and then moves along that column. It is not hard to see that since
distance between any two nodes is O(
√
n), any greedy algorithm that routes
packets on shortest paths is Ω(
√
n)-competitive. An adversary releases O(nB)
packets at all links, all with the same destination. By time O(
√
nB), the greedy
algorithm will deliver O(
√
nB) packets and drop the remaining while an optimal
algorithm can deliver all packets. The theorem below gives tight bounds on the
performance of NTG with one-bend routing on grid networks.
Theorem 3. For B 1, NTG with one-bend routing is Θ̃(n2/3
)-competitive
algorithm on grid networks.
3.3 General Networks
The following theorem improves the O(mn) upper bound on the performance of
NTG on general graphs obtained in [1].
Theorem 4. NTG is Θ̃(m)-competitive on any network for both adaptive and
non-adaptive settings. Moreover, every greedy algorithm is Ω(m) competitive.
4 The Merge and Deliver Algorithm
In this section, we present a randomized, centralized polylog(n)-competitive al-
gorithm, referred to as Merge and Deliver (MD), for line and tree networks.

4.1 Line Networks
By Theorem 1, it suffices to obtain a polylog(n)-competitive algorithm on bal-
anced instances. A balanced instance allows us to treat all packets as equal and
focus on the aspect of deciding if some buffered packet should be forwarded or
not, which is where we feel the real complexity of the problem lies. Centralized
algorithms make it possible to take a global view of the network and spread
packets all around to overcome some of the challenges discussed in Section 1.
The MD algorithm never drops accepted packets and forwards them in
such a manner that it creates a small number of contiguous sequences of nodes
with full buffers, which we call segments. A key observation is that once the
adversary has filled buffers in a segment, it can absorb only one extra packet
per time unit over the entire segment! This follows from the fact that there is a
unit bandwidth out of the segment. The algorithm tries to decrease the number
of such segments by merging segments of similar length. After a few rounds of
merges, MD accumulates enough packets without dropping too many. It then
delivers them greedily knowing (from Lemma 1) that the adversary cannot get
much leverage during that time.
We divide the execution of Merge and Deliver into rounds where each round
has 2 phases and starts with empty buffers. Assume w.l.o.g. that OPT buffers
are also empty, otherwise we view the stored packets as delivered.
1. Phase 1 is the receiving phase. The phase ends when the number of packets
absorbed by MD exceeds nB/p(n) for a suitable p(n) = polylog(n). We
show that OPT absorbs O(nB) packets in the phase.
2. Phase 2 is the delivery phase and consists of (n − 1)B time steps. In the
phase, MD greedily forwards packets without accepting any new packets.
By Proposition 2 it delivers all packets that are buffered by the end of Phase
1. We show that OPT absorbs O(nB) packets in this phase.
For clarity of exposition, we first describe and analyze the algorithm for
buffers of size 2. We then briefly sketch extension to any B 2.
Algorithm for Receiving Phase. At a given time step, a node can be empty,
single, or double, where the terms denote the number of packets in its buffer. We
define a segment to be a sequence (not necessarily maximal) of double nodes.
Segments are assigned to classes, and a new double node (implying it received at
least one new packet) is assigned to class 0. We ensure that a segment progresses
to a higher class after some duration of time which depends on its current class.
It is instructive to think of the packets as forming two layers on the nodes.
We call the bottom layer the transport layer. Now, consider a pair of segments
s.t. there are only single nodes in between. It is possible to merge them, that is
make the top layer of both segments adjacent to each other, using the transport
layer, in a time proportional to the length of the left segment. Since segments
might be separated by a large distance, it is not possible to physically move all
those packets. Instead, we are only interested that the profile of packets changes

as if the segments had actually become adjacent. MD accomplishes this by
forwarding at each time step a packet from the right end of the left segment and
all the intermediate single nodes. We call this process teleporting and a teleport
from node u to v is essentially forwarding a packet at each node in [u, v − 1].
We will only merge segments of the same class. The merged segment is then
assigned to the next higher class. We ensure that segments of the same class have
similar lengths or at least have enough credits as other segment of the same class.
Thus we will be able to account for dropped packets in each class separately and
will show that the class of a segment is bounded during the phase.
Note, that there could be merging segments of other classes in between two
segments of a given class. To overcome this, we divide time into blocks of log n
size. At the ith
time unit of each block, only class i segments are active. Also,
each class i starts its own clock at the beginning of Phase 1, with period 2i
.
This clock ticks only during the time units of class i. Fix a class i, and consider
its period and time steps, and the class i segments. At the beginning of the
period, segments are numbered from left to right starting from 1. MD merges
each odd-number segment with the segment that follows it. The (at most one)
unpaired segment is merged into the transport layer. At the end of the period, a
merged pair becomes one segment of class i+1 and waits idle until the next class
i + 1 period begins. We now formally define the segment merging operation.
Segment Merging. Consider the pair of segments such that the first segment ends
at node e and the second segment starts at s + 1 at the time step. We teleport
a packet from e to s. We distinguish the following special cases:
(i) If e = s or one of the segments has length 0, the merge is completed.
(ii) If s e is already a double node we teleport the packet to s
, where s
≥ e
is the minimum index s.t. nodes [s
, s] are double. The profile changes as if
we moved all segments in [s
, s], which are of different class and therefore
inactive, backward by one index and teleported the packet to s.
(iii) If the two segments are not linked by a transport layer, that is, there is
an empty node to the right of the odd segment or the odd segment is the
last segment of class i, we teleport the packet to the first empty node to
its right.
Analysis. We first note that online can accept one packet (and drop at most
one packet) at a single node. Hence the number of packets which online receives
plus the number of packets dropped at double nodes is an upper bound on the
number of packets dropped by online. We need to count only the packets dropped
at double nodes to compare the packets in optimum to that in online. The
movement strategy in Phase 1 ensures that there is never a net outflow of packets
from the transport layer to the upper layer. Therefore we can treat packets in
transport layer as good as delivered and devise a charging scheme which only
draws credits from packets in top layer to account for dropped packets.

Since packets could join the transport layer during merging, we cannot ensure
that segment lengths grow geometrically, but we show an upper bound on their
lengths in this lemma which follows easily from induction.
Lemma 4. At the start of class i period, the length of a class i segment is ≤ 2i
.
We note that each merge operation decreases the length of the left segment
of a merging segment pair by 1. The next lemma follows.
Lemma 5. The merge of two class i segments needs at most 2i
ticks and can
be accomplished in a class i period.
Corollary 1. If a class i segment is present at the beginning of class i period,
by the end of the period, its packets either merge with another class i segment to
form a class i + 1 segment, or join the transport layer.
We assign O(log2
n) credits to each packet which are absorbed as follows:
– log2
n + 3 log n units of merging credits, which are distributed equally over
all classes, that is a credit of log n + 3 in each class.
– log2
n + 2 log n units of idling credits, which are distributed equally over all
classes, that is a credit of log n + 2 in each class.
When a new segment of class 0 is created it gets the credit of the new packet
injected in the top layer. When two segments of class i are merged the unused
credits of the segments are transferred to the new segment of class i + 1. If
a packet gets delivered or joins the transport layer, it gives its credit to the
segment it was trying to merge into. The following lemma establishes that a
class i segment has enough credit irrespective of its length.
Lemma 6. A class i segment has 2i
(log n+3) merging credits and 2i
(log n+2)
idling credits. The class of a segment is bounded by log n.
Proof. The ﬁrst part follows by induction. If a class i segment has 2i
(log n + 3)
merging credits, there must have been 2i
class 0 segments to begin with. Since
we stop the phase right after n/p(n) packets are absorbed, i log n.
We now analyze packets dropped by segments in each class. Consider a seg-
ment of length l ≤ 2i
in class i. It could have been promoted to class i in the
middle of a class i period but since the period spans 2i
blocks of time, it stays
idle for ≤ 2i
log n time units. Number of packets dropped is bounded by B = 2
times the segment length plus the total bandwidth out of the segment which
is ≤ 2 · 2i
+ 2i
log n. This can be paid by the idling credit of the segment. At
the beginning of a new class i period, it starts merging with another segment
of length, say l
≤ 2i
, to its right. The total number of packets dropped in both
segments is bounded by 2(l + l + l
) + 2 · 2i
log n ≤ 6 · 2i
+ 2 · 2i
log n, using a
similar argument. This can be paid by the merging credit of the segments.
We let p(n) = 2 log2
n+5 logn and note that the number of packets dropped
by online during Phase 1 can be bounded by the credit assigned to all absorbed
packets. Using Lemma 1 we show that OPT can only deliver and buﬀer O(n)
packets during Phase 2. Since MD delivers at least n/p(n) packets in the round,
a competitive ratio of O(p(n)) = O(log2
n) is established.

Generalization to B 2. We can generalize MD to B 2 by redefining the
concept of a double node and segment. Also, we increase the credits assigned to
a packet by a factor of B/(B − 1) ≤ 2, scaling p(n) accordingly.
Theorem 5. For B 1, Merge and Deliver is a randomized O(log3
n)-compe-
titive algorithm on a line network.
4.2 Tree Networks
The Merge and Deliver algorithm extends naturally to tree networks. We con-
sider rooted tree networks where each edge is directed towards the root node.
It is known that on trees with n nodes and height h, a greedy algorithm is
Ω(n/h)-competitive [1]. We show the following:
Theorem 6. For B 1, there is a O(log2
n)-competitive centralized determin-
istic algorithm for trees when packets have the same destination. When destina-
tions are arbitrary, there is a randomized O(h log2
n)-competitive algorithm.
5 Line Networks with Unit Buffers
For unit buffer sizes, Aiello et al [1] give a lower bound of Ω(n) on the competitive
ratio of any greedy algorithm. We show that the lower bound holds for any
deterministic algorithm. However, with randomization, we can beat this bound.
We show that the following simple strategy is Õ(
√
n)-competitive. The online
algorithm chooses uniformly at random one of the two strategies, NTG and
modified NTG, call it NTG’, that forwards packets at every other time step.
The intuition is that if in NTG there is a packet that collides with c other
packets then Ω(c) packets are absorbed by NTG’.
Theorem 7. For B = 1 any deterministic online algorithm is Ω(n)-competitive
on line networks. There is a randomized Õ(
√
n)-competitive algorithm for the
problem.
6 Conclusion
Our results establish a strong separation between the performance of natural
greedy algorithms and centralized routing algorithms even on the simple line
topology. An interesting open question is to prove or disprove the existence of
distributed polylog(n)-competitive algorithms for line networks.
References
1. Aiello, W., Ostrovsky, R., Kushilevitz, E., Rosén, A.: Dynamic routing on networks
with fixed-size buffers. In: Proceedings of the 14th SODA. (2003) 771–780

2. Kesselman, A., Mansour, Y., Lotker, Z., Patt-Shamir, B.: Buffer overflows of merg-
ing streams. In: Proceedings of the 15th SPAA. (2003) 244–245
3. Kothapalli, K., Scheideler, C.: Information gathering in adversarial systems: lines
and cycles. In: SPAA ’03: Proceedings of the fifteenth annual ACM symposium on
Parallel algorithms and architectures. (2003) 333–342
4. Azar, Y., Zachut, R.: Packet routing and information gathering in lines, rings and
trees. In: Proceedings of 15th ESA. (2005)
5. Labrador, M., Banerjee, S.: Packet dropping policies for ATM and IP networks.
IEEE Communications Surveys 2 (1999)
6. Broder, A.Z., Frieze, A.M., Upfal, E.: A general approach to dynamic packet rout-
ing with bounded buffers. In: Proceedings of the 37th FOCS. (1996) 390
7. Broder, A., Upfal, E.: Dynamic deflection routing on arrays. In: Proceedings of the
28th annual ACM Symposium on Theory of Computing. (1996) 348–355
8. Mihail, M.: Conductance and convergence of markov chains–a combinatorial treat-
ment of expanders. In: Proceedings of the 30th FOCS. (1989) 526–531
9. Mitzenmacher, M.: Bounds on the greedy routing algorithm for array networks.
Journal of Computer System Sciences 53 (1996) 317–327
10. Stamoulis, G., Tsitsiklis, J.: The efficiency of greedy routing in hypercubes and
butterflies. IEEE Transactions on Communications 42 (1994) 3051–208
11. Borodin, A., Kleinberg, J., Raghavan, P., Sudan, M., Williamson, D.P.: Adversarial
queueing theory. In: Proceedings of the 28th STOC. (1996) 376–385
12. Andrews, M., Mansour, Y., Fernéndez, A., Kleinberg, J., Leighton, T., Liu, Z.: Uni-
versal stability results for greedy contention-resolution protocols. In: Proceedings
of the 37th FOCS. (1996) 380–389
13. Busch, C., Magdon-Ismail, M., Mavronicolas, M., Spirakis, P.G.: Direct routing:
Algorithms and complexity. In: ESA. (2004) 134–145
14. Awerbuch, B., Azar, Y., Plotkin, S.: Throughput competitive on-line routing. In:
Proceedings of 34th FOCS. (1993) 32–40
15. Scheideler, C., Vöcking, B.: Universal continuous routing strategies. In: Proceed-
ings of the 8th SPAA. (1996) 142–151
16. Kesselman, A., Mansour, Y.: Harmonic buffer management policy for shared mem-
ory switches. Theor. Comput. Sci. 324 (2004) 161–182
17. Lotker, Z., Patt-Shamir, B.: Nearly optimal FIFO buffer management for DiffServ.
In: Proceedings of the 21th PODC. (2002) 134–143

Rounding Two and Three Dimensional Solutions
of the SDP Relaxation of MAX CUT
Adi Avidor
and Uri Zwick
School of Computer Science
Tel-Aviv University, Tel-Aviv 69978, Israel
{adi,zwick}@tau.ac.il
Abstract. Goemans and Williamson obtained an approximation algorithm for
the MAX CUT problem with a performance ratio of αGW 0.87856. Their
algorithm starts by solving a standard SDP relaxation of MAX CUT and then
rounds the optimal solution obtained using a random hyperplane. In some cases,
the optimal solution of the SDP relaxation happens to lie in a low dimensional
space. Can an improved performance ratio be obtained for such instances? We
show that the answer is yes in dimensions two and three and conjecture that
this is also the case in any higher fixed dimension. In two dimensions an opti-
mal 32
25+5
√
5
-approximation algorithm was already obtained by Goemans. (Note
that 32
25+5
√
5
0.88456.) We obtain an alternative derivation of this result us-
ing Gegenbauer polynomials. Our main result is an improved rounding proce-
dure for SDP solutions that lie in R3
with a performance ratio of about 0.8818 .
The rounding procedure uses an interesting yin-yan coloring of the three dimen-
sional sphere. The improved performance ratio obtained resolves, in the nega-
tive, an open problem posed by Feige and Schechtman [STOC’01]. They asked
whether there are MAX CUT instances with integrality ratios arbitrarily close to
αGW 0.87856 that have optimal embedding, i.e., optimal solutions of their
SDP relaxations, that lie in R3
.
1 Introduction
An instance of the MAX CUT problem is an undirected graph G = (V, E) with non-
negative weights wij on the edges (i, j) ∈ E. The goal is to find a subset of vertices
S ⊆ V that maximizes the weight of the edges of the cut (S, S̄). MAX CUT is one of
Karp’s original NP-complete problems, and has been extensively studied during the last
few decades. Goemans and Williamson [GW95] used semidefinite programming and
the random hyperplane rounding technique to obtain an approximation algorithm for the
MAX CUT problem with a performance guarantee of αGW 0.87856. Håstad [Hås01]
showed that MAX CUT does not have a (16/17 + ε)-approximation algorithm, for any
ε 0, unless P = NP. Recently, Khot et al. [KKMO04] showed that a certain plau-
sible conjecture implies that MAX CUT does not have an (αGW + ε)-approximation,
for any ε 0.
The algorithm of Goemans and Williamson starts by solving a standard SDP re-
laxation of the MAX CUT instance. The value of the optimal solution obtained is an

This research was supported by the ISRAEL SCIENCE FOUNDATION (grant no. 246/01).
c

Rounding Two and Three Dimensional Solutions of the SDP Relaxation of MAX CUT 15
Fig. 1. An optimal solution of the SDP relaxation of C5. Graph nodes lie on the unit circle
upper bound on the size of the maximum cut. The ratio between these two quantities is
called the integrality ratio of the instance. The inﬁmum of the integrality ratios of all
the instances is the integrality ratio of the relaxation. The study of the integrality ratio
of the standard SDP relaxation was started by Delorme and Poljak [DP93a, DP93b].
The worst instance they found was the 5-cycle C5 which has an integrality ratio of
32
25+5
√
5
0.884458. The optimal solution of the SDP relaxation of C5, given in Fig-
ure 1, happens to lie in R2
. Feige and Schechtman [FS01] showed that the integral-
ity ratio of the standard SDP relaxation of MAX CUT is exactly αGW . More specif-
ically, they showed that for every ε 0 there exists a graph with integrality ratio of
at most αGW + ε. The optimal solution of the SDP relaxation of this graph lies in
RΘ(
√
1/ε log 1/ε)
. An interesting open question, raised by Feige and Schechtman, was
whether it is possible to ﬁnd instances of MAX CUT with integrality ratio arbitrarily
close to αGW 0.87856 that have optimal embedding, i.e., optimal solutions of their
SDP relaxations, that lie in R3
. We supply a negative answer to this question.
Improved approximation algorithms for instances of MAX CUT that do not have
large cuts were obtained by Zwick [Zwi99], Feige and Langberg [FL01] and Charikar
and Wirth [CW04]. These improvements are obtained by modifying the random hyper-
plane rounding procedure of Goemans and Williamson [GW95]. Zwick [Zwi99] uses
outward rotations. Feige and Langberg [FL01] introduced the RPR2
rounding tech-
nique. Other rounding procedures, developed for the use in MAX 2-SAT and MAX
DI-CUT algorithms, were suggested by Feige and Goemans [FG95], Matuura and Mat-
sui [MM01a, MM01b] and Lewin et al. [LLZ02].
In this paper we study the approximation of MAX CUT instances that have a
low dimensional optimal solution of their SDP relaxation. For instances with a so-
lution that lies in the plane, Goemans (unpublished) achieved a performance ratio of
32
25+5
√
5
= 4
5 sin2( 2π
5 )
0.884458. We describe an alternative way of getting such a
performance ratio using a random procedure that uses Gegenbauer polynomials. This
result is optimal in the sense that it matches the integrality ratio obtained by Delorme
and Poljak [DP93a, DP93b].
Unfortunately, the use of the Gegenbauer polynomials does not yield improved ap-
proximation algorithms in dimensions higher than 2. To obtain an improved ratio in
dimension three we use an interesting yin-yan like coloring of the unit sphere in R3
.
The question whether a similar coloring can be used in four or more dimensions re-
mains an interesting open problem.
The rest of this extended abstract is organized as follows. In the next section we
describe our notations and review the approximation algorithm of Goemans and

16 Adi Avidor and Uri Zwick
Williamson. In Section 3 we introduce the Gegenbauer polynomials rounding tech-
nique. In Section 4 we give our Gegenbauer polynomials based algorithm for the plane
and compare it to the algorithm of Goemans. In Section 5 we introduce the 2-coloring
rounding. Finally, in Section 6 we present our 2-coloring based algorithm for the three
dimensional case.
2 Preliminaries
Let G = (V, E) be an undirected graph with non-negative weights attached to its edges.
We let wij be the weight of an edge (i, j) ∈ E. We may assume, w.l.o.g., that E =
V × V . (If an edge is missing, we can add it and give it a weight of 0.) We also assume
V = {1, . . ., n}. We denote by Sd−1
the unit sphere in Rd
. We define:
sgn(x) =

1 if x ≥ 0
−1 otherwise
The Goemans and Williamson algorithm embeds the vertices of the input graph on
the unit sphere Sn−1
by solving a semidefinite program, and then rounds the vectors to
integral values using random hyperplane rounding.
MAX CUT Semidefinite Program: MAX CUT may be formulated as the following
integer quadratic program:
max

i,j wij
1−xi·xj
2
s.t. xi ∈ {−1, 1}
Here, each vertex i is associated with a variable xi ∈ {−1, 1}, where −1 and 1 may
be viewed as the two sides of the cut. For each edge (i, j) ∈ E the expression
1−xixj
2
is 0 if the two endpoints of the edge are on the same side of the cut and 1 if the two
endpoints are on different sides.
The latter program may be relaxed by replacing each variable xi with a unit vector
vi in Rn
. The product between the two variables xi and xj is replaced by an inner prod-
uct between the vectors vi and vj. The following semidefinite programming relaxation
of MAX CUT is obtained:
max

i,j wij
1−vi·vj
2
s.t. vi ∈ Sn−1
This semidefinite program can be solved in polynomial time up to any specified
precision. The value of the program is an upper bound on the value of the maximal cut
in the graph. Intuitively, the semidefinite programming embeds the input graph G on
the unit sphere Sn−1
, such that adjacent vertices in G are far apart in Sn−1
.
Random Hyperplane Rounding: The random hyperplane rounding procedure chooses
a uniformly distributed unit vector r on the sphere Sn−1
. Then, xi is set to sgn(vi · r).
We will use the following simple Lemma, taken from [GW95]:
Lemma 1. If u, v ∈ Sn−1
, then Prr∈Sn−1 [sgn(u · r) = sgn(v · r)] = arccos(u·v)
π .

3 Gegenbauer Polynomials Rounding
For any m ≥ 2, and k ≥ 0, the Gegenbauer (or ultraspherical) polynomials G
(m)
k (x)
are defined by the following recurrence relations:
G
(m)
0 (x) = 1, G
(m)
1 (x) = x
G
(m)
k (x) = 1
k+m−3

(2k + m − 4)xG
(m)
k−1(x) − (k − 1)G
(m)
k−2(x)

for k ≥ 2
G
(2)
k are Chebyshev polynomials of the first kind, G
(3)
k are Legendre polynomials of
the first kind, and G
(4)
k are Chebyshev polynomials of the second kind. The following
interesting Theorem is due to Schoenberg [Sch42]:
Theorem 1. A polynomial f is a sum of Gegenbauer polynomials G
(m)
k , for k ≥ 0,
with non-negative coefficients, i.e., f(x) =

k≥0 αkG
(m)
k (x), αk ≥ 0, if and only if
for any v1, . . . , vn ∈ Sm−1
the matrix (f(vi · vj))ij is positive semidefinite.
Note that G
(m)
k (1) = 1 for all m ≥ 2 and k ≥ 0. Let v1, . . . , vn ∈ Sm−1
and let
f be a convex sum of G
(m)
k with respect to k, i.e., f(x) =

k≥0 αkG
(m)
k (x), where
αk ≥ 0,

k αk = 1. Then, there exist unit vectors v
1, . . . , v
n ∈ Sn−1
such that
v
i · v
j = f(vi · vj), for all 1 ≤ i, j ≤ n.
In this setting, we define the Gegenbauer polynomials rounding technique with pa-
rameter f as:
1. Find vectors v
i satisfying v
i · v
j = f(vi · vj), where 1 ≤ i, j ≤ n (This can be
done by solving a semidefinite program)
2. Let r be a vector uniformly distributed on the unit sphere Sn−1
3. Set xi = sgn(v
i · r), for 1 ≤ i ≤ n
Note that the vectors v
1, . . . , v
n do not necessarily lie on the unit sphere Sm−1
.
Therefore, the vector r is chosen from Sn−1
. Note also that the random hyperplane
rounding of Goemans and Williamson is a special case of the Gegenbauer polynomi-
als rounding technique as f can be taken to be f = G
(m)
1 . The next corollary is an
immediate result of Lemma 1:
Corollary 1. If v1, . . . , vn are rounded using Gegenbauer polynomials rounding with
polynomial f, then Pr[xi = xj] =
arccos(f(vi·vj ))
π , where 1 ≤ i, j ≤ n.
Alon and Naor [AN04] also studied a rounding technique which applies a polyno-
mial on the elements of an inner products matrix. In their technique the polynomial may
be arbitrary. However, the matrix must be of the form (vi · uj)ij, where vi’s and uj’s
are distinct sets of vectors.
4 An Optimal Algorithm in the Plane
In this section we use Gegenbauer polynomials to round solutions of the semidefinite
programming relaxation of MAX CUT that happen to lie in the plane. Note that as the

solution lie in R2
the suitable polynomials to apply are only convex combinations of
G
(2)
k ’s.
Algorithm [MaxCut – Gege]:
1. Solve the MAX CUT semidefinite program
2. Run random hyperplane rounding with probability β and Gegenbauer polynomials
rounding with the polynomial G
(2)
4 (x) = 8x4
− 8x2
+ 1 with probability 1 − β
We choose β = 4
5
π
5 cot 2π
5 + 1

0.963322. The next Theorem together with
the fact that the value of the MAX CUT semidefinite program is an upper bound on
the value of the optimal cut, imply that our algorithm is a 32
25+5
√
5
-approximation al-
gorithm.
Theorem 2. If the solution of the program lies in the plane, then the expected cut size
produced by the algorithm is at least a fraction 32
25+5
√
5
of the value of the MAX CUT
semidefinite program.
As the expected size of the cut produced by the algorithm equals

i,j wijPr[xi =
xj], we may derive Theorem 2 from the next lemma:
Lemma 2. Pr[xi = xj] ≥ 32
25+5
√
5
1−vi·vj
2 , for all 1 ≤ i, j ≤ n.
Proof. Let θij = arccos(vi · vj). By using Lemma 1, and Corollary 1 we get that
Pr[xi = xj] = β
θij
π
+ (1 − β)
arccos(G
(2)
4 (vi · vj))
π
.
As G
(2)
k (x) are Chebyshev polynomials of the first kind, G
(2)
k (cos θ) = cos(kθ). There-
fore,
Pr[xi = xj] = β
θij
π
+ (1 − β)
arccos(cos(4θij))
π
.
By using Lemma 3 in Appendix A, the latter value is greater or equals to
32
25+5
√
5
1−cos(θij )
2 , and equality holds if and only if θij = 4π
5 or θij = 0.
The last argument in the proof of Lemma 2 shows that the worst instance in our
framework is when θij = 4π
5 for all (i, j) ∈ E. The 5-cycle C5, with an optimal solution
vk = (cos(4π
5 k), sin(4π
5 k)), for 1 ≤ k ≤ 5, is such an instance. (See Figure 1.)
In our framework, Gegenbauer polynomials rounding performs better than the Goe-
mans Williamson algorithm whenever the Gegenbauer polynomials transformation in-
creases the expected angle between adjacent vertices. Denote the ‘worst’ angle of the
Goemans-Williamson algorithm by θ0 2.331122. In order to achieve an improved
approximation, the Gegenbauer polynomial transformation should increase the angle
between adjacent vertices with inner product cos θ0. It can be shown that for any di-
mension m ≥ 3, and any k ≥ 2, G
(m)
k (cos θ0) cos θ0. Hence, in higher dimensions
other techniques are needed in order to attain an improved approximation algorithm.

Fig. 2. WINDMILL4 2-coloring
5 Rounding Using 2-Coloring of the Sphere
Definition 1. A 2-coloring C of the unit sphere Sm−1
is a function C : Sm−1
→
{−1, +1}.
Let v1, . . . , vn ∈ Sm−1
and let C be a 2-coloring of the unit sphere Sm−1
. In the
sphere 2-coloring rounding method with respect to a 2-coloring C, the 2-colored unit
sphere is randomly rotated and then each variable xi is set to C(vi). We give below a
more formal definition of the 2-coloring rounding with parameter C. In the definition, it
is more convenient to rotate v1, . . . , vn than to rotate the 2-colored sphere.
1. Choose a uniformly distributed rotation (orthonormal) matrix R
2. Set xi = C(Rvi), for 1 ≤ i ≤ n
The matrix R in the first step may be chosen using the following procedure, which
employs the Graham-Schmidt process:
1. Choose independently m vectors r1, . . . , rm ∈Rm
according to the m-dimensional
standard normal distribution
2. Iteratively, set ri ← ri −
i−1
j=1(ri · rj)rj, and then ri ← ri/||ri||
3. Let R be the matrix whose columns are the vectors ri
Note that when 2-coloring rounding is used the probability Pr[xi = xj] depends
only on the angle between vi and vj.
2-coloring rounding relates to Gegenbauer polynomials rounding by the following
argument: Let WINDMILLk be the 2-coloring of S1
in which a point (cos θ, sin θ)∈
S1
is colored by black (−1) if θ ∈ [2π
k , (2 + 1)π
k ), and by white (+1) otherwise
(where 0 ≤ k). The 2-coloring of WINDMILL4 for example is shown in
Figure 2. Denote the angle between vi and vj by θij. Also denote by PrWM [xi = xj]
the probability that xi = xj if 2-coloring rounding with WINDMILLk is used, and
by PrGEGE[xi = xj] the probability that xi = xj if Gegenbauer polynomials rounding
with the polynomial G
(2)
k is used. Then,
PrWM [xi =xj]=
θij −2π/k
π/k if θij ∈ [2π
k , (2 + 1)π
k ) where 0 ≤ k
(2+1)π/k−θij
π/k if θij ∈ [(2 + 1)π
k , (2 + 2)π
k ) where 0 ≤ k

Fig. 3. YINYAN 2-coloring
In other words, PrWM [xi = xj] =
arccos(cos(kθij ))
π . On the other hand, by Corollary 1
PrGEGE[xi = xj] =
arccos G
(2)
k (cos(θij ))
π . As G
(2)
k are Chebyshev polynomials of the
ﬁrst kind G
(2)
k (cos(θij)) = cos(kθij). Hence, PrWM [xi = xj] = PrGEGE[xi = xj].
Goemans unpublished result was obtained in this way, with k = 4.
6 An Algorithm for Three Dimensions
Our improved approximation algorithm uses 2-coloring rounding, combined with ran-
dom hyperplane rounding. A ‘good’ 2-coloring would be one that helps the random
hyperplane rounding procedure with the ratio of the so-called ‘bad’ angles.
In our approximation algorithm for three dimensions, we use a simple 2-coloring
we call YINYAN . We give a description of YINYAN in the spherical coordinates
(θ, φ), where 0 ≤ θ ≤ 2π and −π
2 ≤ φ ≤ π
2 (in our notations, the cartesian representa-
tion of (θ, φ) is (cos θ cos φ, sin θ cos φ, sin φ).)
YINYAN (θ, φ) =

+1 if 4φ ±
sin(4θ)
−1 otherwise
Here ±
√
x = sgn(x) |x|. Figure 3 shows the YINYAN 2-coloring. Note that, the
coloring of the equator in YINYAN is identical to the coloring WINDMILL4,
which was used by Goemans in the plane.
Algorithm [MaxCut – YINYAN ]:
1. Solve the MAX CUT semideﬁnite program
2. Let CUT1 be a cut obtained by running random hyperplane rounding
3. Let CUT2 be a cut obtained by running 2-coloring rounding with YINYAN 2-
coloring
4. Return the largest of CUT1 and CUT2

We denote the weight of a cut CUT by wt(CUT), the size of the optimal cut
by OPT , and the expected weight of the cut produced by the algorithm by ALG. In
addition, we denote by PrGW [xi = xj] the probability that xi = xj if the random
hyperplane rounding of Goemans and Williamson is used, and by PrY Y [xi = xj] the
probability that xi = xj if 2-coloring rounding with the 2-coloring YINYAN is used.
If the angle between vi and vj is θij, we write the latter also as PrY Y [θij]. In these
notations, for any 0 ≤ β ≤ 1,
ALG = E[max{wt(CUT1), wt(CUT2)}]
≥ βE[wt(CUT1)] + (1 − β)E[wt(CUT2)]
= β
i,j
wijPrGW [xi = xj] + (1 − β)
i,j
wijPrY Y [xi = xj]
Thus, we have the following lower bound on the performance ratio of the algorithm, for
any 0 ≤ β ≤ 1:
ALG
OPT
≥
β

i,j wijPrGW [xi = xj] + (1 − β)

i,j wijPrY Y [xi = xj]

i,j wij
1−vi·vj
2
≥ inf
0θ≤π

β θ
π + (1 − β)PrY Y [θ]
1−cos θ
2
The probability PrY Y (θ) may be written as a three dimensional integral. However,
this integral does not seem to have an analytical representation. We numerically calcu-
lated a lower bound on the latter integral. A lower bound of 0.8818 on the ratio may
be obtained by choosing β = 0.46. It is possible to obtain a rigorous proof for the
approximation ratio using a tool such as REALSEARCH (see [Zwi02]). However, this
would require a tremendous amount of work. An easy way the reader can validate that
using YINYAN yields a better than αGW approximation is by computing PrY Y [θ]
for angles around the ‘bad’ angle θ0 2.331122. This can be done by a simple easy to
check MATLAB code given in Appendix B.
An improved algorithm is also obtained using, say, the coloring
SINCOL(θ, φ) =

+1 if 2φ sin(4θ)
−1 otherwise
However, a better ratio is obtained by taking YINYAN instead of SINCOL or any
version of SINCOL with adjusted constants. There is no reason to believe that the
current function is optimal. We were mainly interested in showing that a ratio strictly
larger than the ratio of Goemans and Williamson can be obtained.

7 Concluding Remarks
We presented two new techniques for rounding solutions of semidefinite programming
relaxations. We used the first technique to give an approximation algorithm that matches
the integrality gap of MAX CUT for instances embedded in the plane. We used the sec-
ond method to obtain an approximation algorithm which is better than the Goemans-
Williamson approximation algorithm for instances that have a solution in R3
. The latter
approximation algorithm rules out the existence of a graph with integrality ratio arbi-
trarily close to αGW and an optimal solution in R3
, thus resolving an open problem
raised by Feige and Schechtman [FS01]. We conjecture that the 2-coloring rounding
method can be used to obtain an approximation algorithm with a performance ratio
strictly larger than αGW for any fixed dimension.
It can be shown that adding “triangle constraints” to the MAX CUT semidefinite
programming relaxation rules out a non-integral solution in the plane. It would be inter-
esting to derive similar approximation ratios for three (or higher) dimensional solutions
that satisfy the “triangle constraints”.
Acknowledgments
We would like to thank Olga Sorkine for her help in rendering Figure 3.
References
[AN04] N. Alon and A. Naor. Approximating the cut-norm via Grothendieck’s inequality. In
Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago,
Illinois, pages 72–80, 2004.
[CW04] M. Charikar and A. Wirth. Maximizing Quadratic Programs: Extending
Grothendieck’s Inequality. In Proceedings of the 45th Annual IEEE Symposium on
Foundations of Computer Science, Rome, Italy, pages 54–60, 2004.
[DP93a] C. Delorme and S. Poljak. Combinatorial properties and the complexity of a max-cut
approximation. European Journal of Combinatorics, 14(4):313–333, 1993.
[DP93b] C. Delorme and S. Poljak. Laplacian eigenvalues and the maximum cut problem.
Mathematical Programming, 62(3, Ser. A):557–574, 1993.
[FG95] U. Feige and M. X. Goemans. Approximating the value of two prover proof sys-
tems, with applications to MAX-2SAT and MAX-DICUT. In Proceedings of the
3rd Israel Symposium on Theory and Computing Systems, Tel Aviv, Israel, pages
182–189, 1995.
[FL01] U. Feige and M. Langberg. The RPR2
rounding technique for semidefinite pro-
grams. In Proceedings of the 28th Int. Coll. on Automata, Languages and Program-
ming, Crete, Greece, pages 213–224, 2001.
[FS01] U. Feige and G. Schechtman. On the integrality ratio of semidefinite relaxations
of MAX CUT. In Proceedings of the 33th Annual ACM Symposium on Theory of
Computing, Crete, Greece, pages 433–442, 2001.
[GW95] M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for
Maximum Cut and Satisfiability Problems Using Semidefinite Programming. Jour-
nal of the ACM, 42:1115–1145, 1995.

[Hås01] J. Håstad. Some optimal inapproximability results. Journal of the ACM, 48(4):798–
859, 2001.
[KKMO04] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproimability re-
sutls for MAX-CUT and other 2-variable CSPs? In Proceedings of the 45th Annual
IEEE Symposium on Foundations of Computer Science, Rome, Italy, pages 146–154,
2004.
[LLZ02] M. Lewin, D. Livnat, and U. Zwick. Improved rounding techniques for the MAX
2-SAT and MAX DI-CUT problems. In Proceedings of the 9th IPCO, Cambridge,
Massachusetts, pages 67–82, 2002.
[MM01a] S. Matuura and T. Matsui. 0.863-approximation algorithm for MAX DICUT. In
Approximation, Randomization and Combinatorial Optimization: Algorithms and
Techniques, Proceedongs of APPROX-RANDOM’01, Berkeley, California, pages
138–146, 2001.
[MM01b] S. Matuura and T. Matsui. 0.935-approximation randomized algorithm for MAX
2SAT and its derandomization. Technical Report METR 2001-03, Department of
Mathematical Engineering and Information Physics, the University of Tokyo, Japan,
September 2001.
[Sch42] I. J. Schoenberg. Positive definite functions on spheres. Duke Math J., 9:96–107,
1942.
[Zwi99] U. Zwick. Outward rotations: a tool for rounding solutions of semidefinite program-
ming relaxations, with applications to MAX CUT and other problems. In Proceed-
ings of the 31th Annual ACM Symposium on Theory of Computing, Atlanta, Georgia,
pages 679–687, 1999.
[Zwi02] U. Zwick. Computer assisted proof of optimal approximability results. In Proceed-
ings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, San Fran-
cisco, California, pages 496–505, 2002.
Appendix A
Lemma 3. Let β = 4
5
π
5 cot 2π
5 + 1

and αGEGE = 4
5 sin2( 2π
5 )
= 32
25+5
√
5
. Then, for
every θ ∈ [0, π]
β
θ
π
+ (1 − β)
arccos(cos 4θ)
π
≥ αGEGE
1 − cos θ
2
and equality holds if and only if θ = 0 or θ = 4π
5 .
Proof. Define h(θ) = β θ
π + (1 − β)arccos(cos 4θ)
π − αGEGE
1−cos θ
2 .
We show that the minimum of h(θ) is zero and is attained only at θ = 0 and θ = 4π
5 .
Note that
arccos(cos 4θ) =
⎧
⎪
⎪
⎨
⎪
⎪
⎩
4θ if 0 ≤ θ π
4
2π − 4θ if π
4 ≤ θ π
2
4θ − 2π if π
2 ≤ θ 3π
4
4π − 4θ if 3π
4 ≤ θ ≤ π

Hence, for θ ∈ [0, π] {0, π
4 , π
2 , 3π
4 , π}
h
(θ) =
4−3β
π − 1
2 αGEGE sin θ if 0 θ π
4 or π
2 θ 3π
4
5β−4
π − 1
2 αGEGE sin θ if π
4 θ π
2 or 3π
4 θ π
and h
(θ) = −1
2 αGEGE cos θ.
There are three cases:
Case I: 0 ≤ θ ≤ π/2
In this case h
(θ) 0 for θ /
∈ {0, π/4, π/2}. Therefore, the minimum of h(θ) in the
interval 0 ≤ θ ≤ π/2 may be attained at θ = 0, θ = π/4 or θ = π/2. Indeed, h(0) = 0
and the minimum is attained. In addition, h(π/4) = 1−3/4·β−αGEGE(1−
√
2/2)/2 =
0.14798 . . . 0, and h(π/2) = (β − αGEGE)/2 = 0.03942 . . . 0.
Case II: π/2 ≤ θ ≤ 3π/4
As for all θ ∈ (π
2 , 3π
4 ) the function h
(θ) is deﬁned and positive, the minimum may be
attained when h
(θ) = 0, θ = π/2 or θ = 3π/4. The unique solution θ0 ∈ (π
2 , 3π
4 ) of
the equation 0 = h
(θ0) = 4−3β
π − 1
2 αGEGE sin(θ0) is θ0 = arcsin

4−3β
π
2
αGEGE

.
However, for this local minimum, h(θ0) = 0.00146 . . . 0. In addition, as men-
tioned before h(π
2 ) 0. Moreover, h(3π
4 ) = β 3
4 + (1 − β) − 1
2 αGEGE(1 +
√
2
2 ) =
0.00423 . . . 0.
Case III: 3π/4 ≤ θ ≤ π
Again, for all θ ∈ (3π
4 , π) the function h
(θ) is deﬁned and positive. Hence, the
minimum may be attained when h
(θ) = 0, θ = 3π/4 or θ = π. The unique solu-
tion θ0 ∈ (3π
4 , π) of the equation 0 = h
(θ0) = 5β−4
π − 1
2 αGEGE sin(θ0) is θ0 =
arcsin

5β−4
π
2
αGEGE

= arcsin

4
5 cot(2π
5 ) · 2
5 sin2
( 2π
5 )
4

= 4π
5 . For this local min-
imum, h(θ0) = 0. In addition, as mentioned before h(3π
4 ) 0. Moreover, h(π) =
β − αGEGE 0.

Appendix B
A short MATLAB routine to compute the probability that two vectors on S2
with a
given angle between them are separated by rounding with YINYAN 2-coloring. The
routine uses Monte-Carlo with ntry trials.
For example, for the angle 2.331122, by running yinyan(2.331122, 10000000) we
get that the probability is 0.7472. This is strictly larger than the random hyperplane
rounding separating probability of 2.331122
π 0.7420.
function p = yinyan(angle,ntry)
c = cos(angle); s = sin(angle);
cnt = 0;
for i=1:ntry
% Choose two random vectors with the given angle.
u = randn(1,3); u = u/sqrt(u*u’);
v = randn(1,3); v = v - (u*v’)*u; v = v/sqrt(v*v’);
v = c*u + s*v;
% Convert to polar coordinates
[tet1,phi1] = cart2sph(u(1),u(2),u(3));
[tet2,phi2] = cart2sph(v(1),v(2),v(3));
% Check whether they are separated
cnt = cnt + ((4*phi1f(4*tet1))˜=(4*phi2f(4*tet2)));
end
p = cnt/ntry;
function y = f(x)
s = sin(x);
y = sign(s)*sqrt(abs(s));

What Would Edmonds Do? Augmenting Paths
and Witnesses for Degree-Bounded MSTs
Kamalika Chaudhuri1,
, Satish Rao1,
,
Samantha Riesenfeld1,
, and Kunal Talwar2,†
1
UC Berkeley
{kamalika,satishr,samr}@cs.berkeley.edu
2
Microsoft Research, Redmond, WA
kunal@microsoft.com
Abstract. Given a graph and degree upper bounds on vertices, the
BDMST problem requires us to find the minimum cost spanning tree
respecting the given degree bounds.Könemann and Ravi [10, 11] give
bicriteria approximation algorithms for the problem using local search
techniques of Fischer [5]. For a graph with a cost C, degree B spanning
tree, and parameters b, w 1, their algorithm produces a tree whose cost
is at most wC and whose degree is at most w
w−1
bB + logb n. We give a
polynomial-time algorithm that finds a tree of optimal cost and with max-
imum degree at most bB+2(b+1)logb n. We also give a quasi-polynomial
algorithm which produces a tree of optimal cost C and maximum degree
bounded by B +O(log n/ log log n). Our algorithms work when there are
upper as well as lower bounds on the degrees of the vertices.
1 Introduction
We study the problem of finding a minimum cost spanning tree in a graph G =
(V, E) such that the tree meets degree bounds on the vertices in V . A solution
to this problem, called a degree-bounded minimum spanning tree (BDMST),
can be used to find minimum cost travelling salesman path (TSPP) between
two endpoints by finding the minimum cost spanning tree where the endpoints
have degree bounds of one and the internal nodes have degree bounds of two.
Since the Hamiltonian path problem can be solved given even a polynomial
factor approximation algorithm for TSPP, and a Hamiltonian path is NP-hard
to approximate to within any computable factor, we expect that it is necessary
to relax the degree bounds to get any reasonable approximation algorithm. (Note
the implicit assumption that the edge costs of the input graph do not define a
metric.)

Research partially supported by the Berkeley Fellowship and NSF Grant CCR-
0121555

Research partially supported by NSF Grant 0013128000

Research partially supported by NSF Graduate Fellowship
†
Part of this research was performed when the author was a student at UC Berkeley
and was supported by the NSF via grants CCR-0121555 and CCR-0105533
c

What Would Edmonds Do? 27
The BDMST problem, [14] also has several applications in efficient network
routing protocols. For example, in multicast networks, one wishes to build a
spanning tree that minimizes the total cost of the network as well as the maxi-
mum work done by (that is, the degree of) any vertex. The BDMST problem is
a natural formulation of the duelling constraints.
For the sake of simplicity, we focus on a version of the problem in which each
node has the same degree bound B, though our results generalize to the case of
nonuniform degree bounds. We refer to this case as the uniform version of the
problem. Bicriteria approximation algorithms for this problem were first given
by Ravi et al. [14, 15]. The best previous results are by Könemann and Ravi
[10, 11]. They give an algorithm that, for a graph with a BDMST of maximum
degree B and cost C, and for any w 1, finds a spanning tree with maximum
degree w
w−1 bB + logb n, for any constant b, and cost wC. While this expression
yields a variety of tradeoffs, one can observe that the Könemann -Ravi algorithm
always either violates the degree constraints or exceeds the optimal cost by a
factor of two.
We give a polynomial-time algorithm that removes the error in the cost com-
pletely and also a quasi-polynomial-time algorithm that both obtains a solution
of optimal cost and significantly improves the error in the degree bounds. Specifi-
cally, we give a polynomial-time algorithm that, given a graph with a BDMST of
optimal cost C and maximum degree B, finds a spanning tree of cost exactly C
and maximum degree bB +2(b+1) logb n for any constant b. The main technical
contribution here is the use of a subroutine that finds minimum cost spanning
trees while enforcing both lower and upper bounds on degrees. This allows for
more natural and better performing cost-bounding techniques, which we discuss
below. We also give a quasi-polynomial-time algorithm that finds a spanning
tree of maximum degree B + O( log n
log log n ) and optimal cost C. If the degree is
ω(log n/ log log n), our algorithm gives a 1 + o(1) approximation in the degree
while achieving optimal cost. The main technical contribution here is the use of
augmenting paths to find low-degree minimum spanning trees, where Könemann
and Ravi[10] used a local optimization procedure to solve this subproblem. We
discuss this technique further below as well.
In a recent paper, Könemann and Ravi[11] improve their algorithms both
in terms of implementation and applicability by providing an algorithm that
can deal with nonuniform degree bounds and does not rely on explicitly solv-
ing a linear program. Our polynomial-time algorithm works as does theirs for
nonuniform bounds, except we get cost optimality. Our quasi-polynomial-time
algorithm extends easily to nonuniform degree bounds since our error is additive,
rather than multiplicative, for this version of the algorithm. The algorithms that
we give for enforcing lower-bound degree constraints require new ideas and may
also be of independent interest. Our results extend to a more general version of
the BDMST problem, in which upper bounds and lower bounds on the degree of
vertices are given.
Previous Techniques
Fürer and Raghavachari[6] give a beautiful algorithm for un-weighted graphs
that finds a spanning tree of degree Δ + 1 in any graph that has a spanning tree

28 Kamalika Chaudhuri et al.
of optimal degree Δ. This is optimal, in the sense that an exact algorithm would
give an algorithm for Hamiltonian path. Their algorithm takes any spanning
tree and relieves the degree of a high-degree node by an alternating sequence
of edge additions and deletions. While there is a node in the current tree of
degree at least Δ + 1, their algorithm either finds such a way to lower its degree
or certifies that the maximum degree of any spanning tree must be at least Δ.
The certificate consists of a set of nodes S whose removal leaves at least Δ|S|
connected components in the graph. This implies that the average degree of the
set S must be at least Δ. This algorithm is reminiscent of Edmonds’ algorithm
for unweighted matching [3].
For the weighted version of the BDMST problem, we consult a different
algorithm of Edmonds[2] that computes weighted matchings. This algorithm can
be viewed as first finding a solution to the dual of the matching problem: it finds
an assignment of penalties to nodes that lengthens adjacent edges and then finds
a maximal packing of balls around subsets of nodes of the graph. Once the dual
is known, one can ignore the values of the weights on the edges and rely solely
on an un-weighted matching algorithm.
Könemann and Ravi begin along Edmond’s path by examining a linear pro-
gramming relaxation for the BDMST problem and its dual1
. Unfortunately, in
this case, the combinatorial subroutine must produce a low-degree minimum cost
spanning tree in the graph where the edge costs have been modified by the dual
potentials. To find an MST in the modified graph that has (approximately) min-
imum degree over all MSTs, Ravi and Könemann use an algorithm of Fischer.
For a graph with an MST of degree Δ, Fischer’s algorithm provides an MST of
degree at most bΔ + logb n for any b 1. They show how to get a good solution
by solving a linear program with a relaxed (by a factor of (w/w − 1)) degree
constraint and then applying Fischer’s algorithm to the dual.
Our Results and Techniques
As mentioned above, Könemann and Ravi’s methods introduce a tradeoff be-
tween degree and cost that always creates a constant-factor error somewhere.
In this paper, we remedy this problem by following along the lines of Edmonds’
approach to weighted matchings. We show that any MST in the modified graph
in which the nodes with dual penalties have (relatively) high degree has low cost
in the original graph. Thus, instead of just finding an MST of low maximum
degree, we also wish to enforce that the nodes with penalties have high degree.
This entails developing algorithms and certificates for enforcing degree lower
bounds, in addition to our algorithms for enforcing degree upper bounds. More-
over, we have to combine the algorithms to simultaneously enforce the high- and
low-degree constraints.
In addition to improving the cost of the solution, we improve the degree-
bounds at the cost of a quasi-polynomial running time. Our approach relies on
1
As in the case of Edmonds’ non-bipartite matching algorithm, the linear program
and its dual are of exponential size, though their solutions can be found in polynomial
time.

our analogy with bipartite matching. As a guide to our approach, consider the
problem on bipartite graphs of assigning each node on the “left” to a node on the
“right” while minimizing the maximum degree of any node on the right. This
problem can naturally be solved using matching algorithms. Consider instead
finding a locally optimal solution where for each node on the left, all of its
neighbors have degree within 1 of each other. One can then prove that no node
on the right has degree higher than bΔ + logb n where Δ is the optimum value.
The proof uses Hall’s lower bound to show that any breadth first search in the
graph of matched edges is shallow, and the depth of such a tree bounds the
maximum degree of this graph. This is the framework of Fischer’s algorithm for
finding a minimum cost spanning tree of degree bΔ + logb n.
In the case of the bipartite graph problem above, an augmenting or alternat-
ing path algorithm for matching gives a much better solution than the locally
greedy algorithm. Based on this insight, we use augmenting paths to address
Fischer’s problem of finding a MST of minimum maximum degree. By doing so,
we are able to find a minimum cost tree of degree at most Δ + O( log n
log log n ). In
contrast to Fisher’s local (or single-swap) approach, our algorithm finds a se-
quence of moves to decrease the degree of one high degree node. This introduces
significant complications since changing the structure of the tree also changes
which swaps are legal.
We also remark that our augmenting path algorithms, though based on the
most powerful methods in matching, fall short of the Δ + 1 that Fürer and
Raghavachari obtain. While we make some progress by removing all the leading
constants, and by bounding the cost using degree lower bounds, Δ + 1 is a
tantalizing target.
2 Bounded Degree Minimum Spanning Trees
BDMST Problem: Given a weighted graph G = (V, E, c), c : E → R+
,
and a positive integer B ≥ 2, find the minimum cost spanning tree in the set
{T : ∀v ∈ V, degT (v) ≤ B}.
Könemann and Ravi [11] show that an MST of minimum degree for a cer-
tain cost function is also a BDMST with analogous guarantees on degree and
with cost within a constant factor of the optimal. We show in this section that
an algorithm which obtains an MST with guarantees on both maximum and
minimum degrees can be used to produce a BDMST with optimal cost and
analogous degree bounds. We present an algorithm in Section 6 that obtains
such guarantees.
Linear Programs for BDMST: An integer linear program for the BDMST
problem is given by:
optB = min

e∈E
cexe
s.t.

e∈δ(v)
xe ≤ B ∀v ∈ V ; x ∈ spG; xe ∈ {0, 1}
(1)

where δ(v) is the set of edges of E that are incident to v, and spG is the convex
hull of edge-incidence vectors of spanning trees of G. A tree defined by a vector
x ∈ spG, the entries of which are not necessarily all integer, is called a fractional
spanning tree. It can be written as a convex combination of spanning trees of
G. For a fractional tree T with edge incidence vector x ∈ spG, let degT (v) =

e∈δ(v) xe. We can write the linear program relaxation of (1) as min{c(T ) : T ∈
spG, ∀v ∈ V degT (v) ≤ B}. The approach used by Könemann and Ravi [11]
is to take the Lagrangean dual of this LP, given by: maxλ≥0 minT ∈spG {c(T ) +

v∈V λv(degT (v) − B)}. We can think of the optimal solution to the dual as a
vector λB
of Lagrangean multipliers on the nodes and a set OB
of optimal trees,
such that every tree T B
∈ OB
minimizes c(T B
)+

v∈V λB
v (degT B (v)−B). The
set of multipliers λB
and OB
can be computed in polynomial time. The optimal
value optLD(B) of this dual program is a lower bound on optB and a tight lower
bound on the optimal value of the LP relaxation.
Following the analysis of [11], let us define a new cost function cλB
, where
cλB
(e) = ce + λB
u + λB
v for an edge e = (u, v). Since B

v∈V λB
v is constant for
a fixed choice of λB
, every tree in OB
is an MST of G under the cost function
cλB
. The optimal solution to the linear program is a fractional MST T f
B =

T ∈OB αT T . Note that the degree of v in T f
B is degT f
B
(v) =

T αT degT (v).
Let LB =

v : λB
v 0

be the set of nodes with positive dual variables in
the optimal solution. Complementary slackness conditions applied to the optimal
solutions of the linear program and its dual imply the following claim.
Lemma 1 For all v ∈ V , degT f
B
(v) ≤ B, and for all v ∈ LB, degT f
B
(v) = B.
So we are given the existence of the fractional MST T f
(under the cost
function cλ
) that meets both upper and lower degree bounds (i.e. no node has
degree more than B and every node in LB has degree exactly B). Our approach to
solving the BMST problem is to find an integral MST, under a slightly different
cost function, that meets both the upper and lower degree bounds approximately.
Then we use the dual program to show that this tree does not cost too much
more than T f
.
Let B∗
= B + ω, for some ω 0 to be specified later. Let T B∗
∈ OB∗
, so
T B∗
is an MST under the cost function cλB∗
. Since λB∗
is a feasible solution for
the dual LP, it is clear that c(T B∗
) +

v∈V λB∗
v (degT B∗ (v) − B) ≤ optLD(B).
Further, if T B∗
has the property that for every node v ∈ LB∗ , degT B∗ (v) is at
least B = B∗
− ω, the second term in the above expression is non-negative and
hence c(T B∗
) is at most optLD(B).
Lemma 2 Let T be an MST of G under the cost function cλB∗
such that for
every v ∈ LB∗ degT (v) ≥ B. Then c(T ) ≤ optLD(B).
MSTs with Maximum and Minimum Degree Bounds
In Section 2, we observed that the problem of finding a BDMST can be reduced to
the problem of finding an actual MST, under the cost function cλ
, which satisfies
the upper and lower bounds ont the degrees. Our approach is to start with an

arbitrary MST T ∈ OB∗
and try to decrease its maximum degree. So that we
can bound the cost of T with respect to the optimal BDMST, we simultaneously
increase to a relatively high degree of at least B the degree of each node in
LB∗ =

v ∈ V : λB∗
v 0

. We first consider meeting these dueling constraints
one at a time. In Section 4 we give an algorithm for finding an MST in which
the maximum degree bounds are approximately met. Our algorithm produces
a witness W that certifies that the tree we find has near-optimal maximum
degree. In Section 5, we show how to find an MST in which a specified subset L
of the nodes meet minimum degree bounds approximately. Again, the algorithm
produces a witness WLthat certifies near-optimality. Finally, in Section 6 we
given an algorithm for achieving these goals simultaneously and an analogous
witness.
3 Minimum Spanning Trees with Degree Bounds
MSTDB Problem: Given a graph G = (V, E) with costs on edges, a degree
upper bound BH, a set of vertices L and a degree lower bound BL, find, if one
exists, a minimum spanning tree of G such that no vertex has degree more than
BH and all vertices in L have degree at least BL.
Note that an MST with these degree bounds may not exist in the graph. If
this happens, the algorithm is expected to provide a combinatorial proof of non-
existence.
We present two algorithms for this problem. The first one, which is based
on Fischer’s algorithm for finding a minimum degree MST of a graph, runs in
polynomial time and finds an MST such that (a) every vertex has degree at most
bBH +logb n and (b) all vertices in L have degree at least BL/b−logb n. If it fails,
it finds a combinatorial witness to show that there exists no MST of the graph
in which all vertices have degree at most BH and the vertices in L have degree
at least BL. The second algorithm we present finds a MST with two properties:
(a) every vertex has degree BH +O( log n
log log n ) or less and (b) all vertices in L have
degree BL −O( log n
log log n ) or more. If it fails, it provides a combinatorial witness to
shows that there exists no MST of G in which all vertices have degree at most
BH and the vertices in L have degree at least BL. Our second algorithm runs in
quasi-polynomial time.
Recall from Section 2 that our BDMST algorithm runs MSTDB with the
following inputs: the cost function is c
uv = cuv + λu + λv, the degree upper
bound BH is B, the degree lower bound BL is B, and L, the set of vertices on
which the degree lower bounds should be enforced, is the set of vertices for which
λv 0. Lemma 1 guarantees that there always exists a fractional MST for this
cost function in which the maximum degree of any vertex is B and in which all
vertices in L have degree exactly B. The combinatorial witnesses produced by
the algorithm certify non-existence of fractional trees, as well as integral trees,
with the same degree bounds. It follows that the output of the first algorithm is
an MST of maximum degree bB +logb n, where b 1 is a constant, in which the
minimum degree of L is also B/b−logb n, and the output of the second algorithm

is an MST with maximum degree B + O( log n
log log n ) in which the minimum degree
of L is B − O( log n
log log n ). Otherwise the witnesses produced by the algorithms
provide a contradiction to Lemma 1.
Starting with an arbitrary MST, both of our MSTDB algorithms proceed
in phases of improvement. Each phase is essentially a combination of two al-
gorithms: MAXDMST and MINDMST. MAXDMST is essentially a subroutine
that either decreases the number of “high degree” vertices or produces a witness
to show that the maximum degree of the current tree is close to optimal. Simi-
larly, MINDMST decreases the number of nodes in L which are of “low degree”.
If it fails, it finds a witness to guarantee that the minimum degree of L in the
current tree is close to optimal.
In our first MSTDB algorithm, MAXDMST is essentially Fischer’s algorithm,
and MINDMST is a symmetric version of Fischer’s algorithm that finds an MST
in which a certain set of nodes have maximum minimum degree. Since these
algorithms are largely based on Fischer’s algorithm, we do not provide any more
details on them in this paper. Instead we focus for the next two sections on our
second MSTDB algorithm which uses an approach based on augmenting paths.
4 Minimum Spanning Trees with Degree Upper Bounds
Let dmax(T ) denote the maximum degree over all vertices in the current MST
T , and let dL
min(T ) denote the minimum degree over all vertices in L in T . In
this section, we describe the MAXDMST algorithm, which solves the following
problem.
Min Max Degree MST: Given a graph G and a degree bound d 0, find an
MST T of G such that dmax(T ) ≤ d.
The main idea behind MAXDMST is to find a series of edge swaps that decreases
in a controlled manner the degree of some high-degree vertex in the tree. We
say that an edge e in tree T can be swapped for edge e
if the following hold: (a)
e
/
∈ T , and (b) the unique cycle in T ∪ e
contains e. Performing the swap (e, e
)
involves removing e from T and adding e
to produce another tree T
. A swap
(e, e
) is called feasible if the edges e and e
have the same weight. Note that if T
is an MST, then the tree T
produced after a feasible swap on T is an MST as
well. MAXDMST attempts to find an augmenting path of swaps that decreases
the degree of a high-degree vertex by one without creating any more vertices of
the same or higher degree.
An important part of our algorithm is that when it terminates, it produces
a proof that the resulting tree T has close-to-optimal degree. We call such a
combinatorial proof a witness. The basic underlying structure of the witness
produced by MAXDMST is a partition of the nodes in G into a center set W
and k other sets C1, C2, . . . , Ck called clusters. The partition has the property
that there are no tree edges between clusters. The use of this partition as a
witness is based on the following idea: if there is no feasible swap involving an
inter-cluster edge and an edge in T adjacent to a node in W, then any MST

must connect the clusters C1, C2, . . . , Ck using edges adjacent to W. If it is also
true that k is large, then the average degree of W in any MST must be high.
Our version of the proof of the following lemma appears in the full version of
the paper.
Lemma 3 ([6]) Let V be partitioned into sets of nodes W, C1, C2, . . . , Ck. If for
any edge e between two clusters Ci and Cj, there is no MST on G containing e,
then for any MST T of G, dmax(T ) ≥
|W|+k−1
|W|

.
Definitions and Notation
Before describing our algorithm formally, we introduce some definitions and
notation. Let T be the MST of G at the beginning of the current phase of the
algorithm. We use W to refer generally to the center set, and Wi to denote the
center set at some specific iteration i of this phase. For a node u ∈ W, the
clusters connected to u by tree edges are called the children clusters of u. For
any d, we denote by S≥d the set of vertices that have degree d or more in T .
Let τ ⊆ T be a Steiner tree on the nodes of W0 (the Steiner nodes are the
nodes in V − W0). We say that we freeze the edges of τ incident on W, meaning
that the edges of τ that are incident on W are not allowed to be removed by
any swap. Freezing these edges ensures that the edge swaps made at the end of
the algorithm result in a tree (and not a graph containing a cycle).
Given T and a partition of the nodes into W and a set C of clusters, we call a
non-tree inter-cluster edge e
feasible with respect to W if there exists a feasible
swap (e, e
) such that e is incident on W and e is not frozen. A feasible swap
(e = (u, v), e
= (u
, v
)) is called good if (a) u ∈ W (b) v /
∈ W, (c) e is not frozen,
and (d) e
is an inter-cluster edge. The algorithm uses good swaps to decrease
the degrees of high-degree vertices in W.
The MAXDMST Algorithm
Given G and the current MST T on G, we begin each phase of the algorithm by
choosing a degree d such that |S≥d−1| ≤ log n
log log n |S≥d|.
Note that this inequality must be true for some d lying between dmax(T ) and
dmax(T ) − log n
log log n . For the duration of this phase, we maintain a center set W
of vertices, and a set C of clusters. The initial center set W0 is S≥d−1, and the
initial clusters are the connected components created by deleting W0 from T . In
general, though, a cluster is not necessarily internally connected. Consider the
Steiner tree τ ⊆ T on W0, and freeze the edges of τ incident on W0. In each
iteration i, we find a good swap (e = (u, v), e
= (u
, v
)) where u ∈ Wi and
v /
∈ Wi. We then take the following steps:
1. Remove u from Wi.
2. Form a new cluster Cu by merging u with the cluster containing u
, the
cluster containing v
, and all the children clusters of u.
3. Consider each feasible edge (u, w) such that w is a node that has been re-
moved from the witness at some prior stage, and merge Cu with the cluster
currently containing w.

Step 3 ensures that we never try to use a feasible edge between two nodes which
initially have degree d−1 in T . This process is repeated until we either remove a
vertex u∗
with degree d or higher from W, or we run out of good swaps. Either
event marks the end of the current phase.
Lemmas 4 and 5 show that in the first case, we can find a sequence of edge
swaps to get a tree T
from T such that the degree of u∗
decreases by one and
no vertex has its degree raised to more than d − 1. In the case that we run out
of good swaps, Lemma 6 and Theorem 9 shows that we can interpret Wi as a
witness to the fact that the maximum degree of T is close to optimal.
Lemma 4 Suppose we remove a vertex v from the center set Wi in some iter-
ation i. Then there is a sequence of feasible edge swaps which (a) decreases the
degree of v by one and (b) does not increase the degree of any other node in T
to d.
Note that since each of ((ui, vi), (u
i, v
i)) is a feasible swap when discovered,
performing any one of them does not introduce a cycle in the tree. Yet it is not
obvious that we can perform all of them together without disrupting the tree
structure. Lemma 5 shows that the graph produced after performing all these
edge swaps is still a tree. The frozen edges of the Steiner tree play a crucial
role in its proof. Proofs form Lemmas 4 and 5 are omitted from this extended
abstract.
Lemma 5 The graph produced after performing all the edge swaps in Lemma 4
is a tree.
The MAXDMST Witness
The actual witness W produced by the algorithm is slightly more complex than
the one described at the beginning of Section 4. A witness W actually consists
of a partition of the nodes into W and a set of clusters C1, C2, . . . , Ck, and a
set of edges R such that at least one endpoint of each edge in R is in W. The
witness W must have the property that any MST that contains all of the edges
of R cannot contain an inter-cluster edge. Now if the size of R is small and k is
large, then any MST must use a large number of edges incident to W to connect
the clusters. The proof of Lemma 6 is straightforward and is omitted from this
extended abstract.
Lemma 6 (Witness to high optimal degree) A high-degree witness W =
(W, R, {C1, C2, . . . , Ck}) certifies that any MST of G has maximum degree at
least
|W|+k−2|R|−1
|W|

.
Now we interpret the structure produced by running the MAXDMST algo-
rithm as a witness. Let R∗
be the subset of frozen edges that have at least one
endpoint incident on W∗
, where W∗
is the center set when the algorithm ter-
minates. Let C∗
1 , C∗
2 , . . . , C∗
k be the clusters and T ∗
the tree when the algorithm
terminates.
Lemma 7 W∗
= (W∗
, R∗
, {C∗
1 , C∗
2 , . . . , C∗
k }) is a high-degree witness.

Lemma 8 At most 2|W0| edges are frozen by the algorithm, where W0 is our
initial witness.
Theorem 9 If the MAXDMST algorithm halts with an MST of maximum de-
gree d, then Δopt ≥ d − 2 − O( log n
log log n ).
The proofs of Lemma 8, Lemma 7 and Theorem 9 are omitted from this extended
abstract.
5 Minimum Spanning Trees with Degree Lower Bounds
Max Min Degree MST: Given a subset L of the nodes of G and a minimum
degree bound d 0, find an MST T of G such that dL
min(T ) ≥ d.
In this section, we consider the problem defined above. Note that this is an NP-
hard decision problem, and our algorithm solves a gap version of it. Much like the
algorithm of the previous section, our algorithm either finds a tree which respects
the degree bounds up to an additive error of log n
log log n or gives a combinatorial
proof showing that the decision problem is a no instance, i.e. there is no MST in
G satisfying the degree constraints. While the algorithm and the witness here are
somewhat analogous to those in the previous section, they are not symmetric,
and new ideas are required for this case.
We first introduce the kind of witness we use for proving upper bounds on
the minimum degree in L. A witness WL consists of a set W of nodes in L, a
set of clusters C1, C2, . . . , Ck, D, that form a partition of V W, and a set R of
inter-cluster edges. In any MST T of G, for all i, 1 ≤ i ≤ k the cluster Ci must
be internally connected. For any MST T of G containing the edges of R, it must
also be true that for all v ∈ D, there is some i, 1 ≤ i ≤ k such that there is a
path from v to Ci in the forest created by deleting W from T .
Lemma 10 (Witness to low optimal degree) A witness
WL = (W, R, {C1, . . . , Ck} , D) certifies that any MST of G has a node in L with
degree at most
2|W|+|R|+k−2
|W|

.
The proof of Lemma 10 is omitted from this extended abstract. Since it
bounds the average degree of L, Corollary 11 follows.
Corollary 11 Given a witness WL as in the previous lemma, it holds that for
any fractional MST T f
with edge incidence vector x, the minimum fractional
degree in L, i.e. minv∈L

e∈δ(v) xe, is at most 2|W|+|R|+k−2
|W| .
Recall that dL
min(T ) = minv∈L degT (v) is the minimum degree over all nodes
in L in tree T , and let ΔL
opt be the maximum possible value of dL
min(T ) attain-
able by any MST; the algorithm we describe in this section finds an MST in
which dL
min(T ) is at least ΔL
opt − O( log n
log log n ).
We now introduce the following notation: For a tree T , we denote by SL
≤d
the subset of nodes in L which have degree d or less in T .

Definition 1 We call an edge e dotted with respect to an MST T and a set
W, if e is a non-tree edge incident to a node in W and it is feasible with respect
to some edge e
in T such that e
is not incident to a node in W.
If the set W has low degree, we are interested in dotted edges with respect
to W, since swapping in these edges and removing the edges they are feasible to,
increases the degree of W.
The MINDMST Algorithm
Like the algorithm in Section 4, at the beginning of each phase, we pick a degree
d such that in the current tree, |S≤d+1| ≤ log n
log log n |S≤d|.
Note that this inequality must be true for some d lying between dL
min(T ) and
dL
min(T ) + log n
log log n . The aim is to improve the degree of some vertex of degree
d or less without introducing more vertices of the same or lower degree. We
begin with the center set W0 = SL
≤d+1. Our initial clusters are the components
obtained when we remove W0 from T . Note that every node in L outside W0
belongs to a cluster and has degree at least d + 2.
We maintain a set of special nodes N which is also initialized to SL
≤d+1.
Consider the Steiner tree induced on N by the given tree T . We freeze the Steiner
tree edges adjacent to N— that is, we disallow the algorithm from swapping out
these edges— and put them in F. (Note that freezing edges reduces the set of
dotted edges that can be used.)
In each iteration i, we look for a dotted edge e = (u, v) adjacent to a node u
in the witness Wi, and an intra-cluster tree edge e
= (u
, v
) with which it can
be swapped. If we find such a pair, we take the following steps:
1. Remove u from Wi.
2. Form a new cluster Cu by merging u along with all the children clusters of
u, except for those which would involve a merge along an edge between u
and a (d + 1)-degree node.
3. Split the cluster containing (u
, v
) along e
into two clusters Cu and Cv .
4. Let u
be the endpoint of e
that has a tree path to u. Let w be the first node
in N on the u
-u path. Add u
to N. Freeze the edges on this path that are
incident on u
and on w, and add the new frozen edges to F. The u
-v path
in T is called the tail of this swap.
Step 1 allows us to delete an edge incident on u at a later step. Steps 2 and
3 are needed to ensure that we never swap out more than one edge incident on
any node. Finally, step 4 ensures that the sequence of swaps found by the end of
the phase can be executed concurrently without resulting in non-tree structures.
We repeat the above process until we either remove a node of degree d or
less from W, or there are no useful intra-cluster edges left. Either event ends
the current phase. In the former case, Lemma 12 shows that we can find an
improvement Once again the frozen edges play a crucial role in the proof, which
appears in the full version of the paper.
In the latter case, Theorem 13 shows that we can turn the resulting structure
into a witness W∗
L certifying that the tree has close to the optimal degree.

Lemma 12 If we remove a vertex u from W, we can find a sequence of edge
swaps which increases the degree of u by 1. Moreover, performing all these swaps
results in a tree without introducing any new nodes of degree d or less.
Theorem 13 If the MINDMST algorithm halts with a tree with minimum de-
gree d for a node in L, then ΔL
opt ≤ d + 4 + O( log n
log log n ).
Proof. We begin by showing how to convert the structure produced by the
MINDMST algorithm into a witness of the type in Lemma 10. Suppose the algo-
rithm begins a phase with an MST T , chooses a degree d, sets W0 = SL
≤d+1, but
then cannot improve the degree of any vertex in L of degree d or lower. It termi-
nates after t 0 iterations with Wt ⊇ SL
≤d as the center set and C1, C2, . . . , Ck ,
as the clusters.
Let W∗
= Wt, and let I be the set of inter-cluster edges with respect to the
clusters C1, . . . , Ck . Recall that F is the set of edges frozen by the algorithm.
Let R∗
= F ∪ I. We now describe how to modify the above clusters to get our
witness. Let C∗
be a collection of clusters, and let D∗
be a collection of nodes. We
begin with C∗
and D∗
empty. For 1 ≤ i ≤ k
, let C = Ci and do the following:
If cluster C is internally connected in every MST, then add C to C∗
. If not,
let T
be an MST of G in which C is not internally connected, and let (u, v) be
an edge of T |C T
|C.
Consider the clusters Cu and Cv that would result from splitting C about the
edge (u, v). If both Cu and Cv have edges in T to Wt, then split C and recurse
on each cluster Cu and Cv. If only one of the two, say Cu has a tree edge to Wt,
then split C, recurse on Cu, and add the vertices in Cv to D∗
.
The process terminates when every node outside W∗
either belongs to D∗
or to a cluster C that has been put in C∗
. Let us label the clusters in C∗
by
C∗
1 , C∗
2 , . . . C∗
k , k ≥ k
. Note that by construction, every vertex in D∗
has a path
in T − W∗
to a cluster in C∗
.
Lemma 14 W∗
L = (W∗
, R∗
, C∗
= {C∗
1 , . . . C∗
k } , D∗
) is a low-degree witness.
The proof of Lemma 14 is omitted due to space constraints. To finish the
proof of Theorem 13, we compute the bound that we get by using Lemma 10.
First we count the number of clusters created by the algorithm. The number of
clusters is the number of components formed by Wt from T , of which there are
at most (d + 1)|Wt|, plus the number of splits. There are at most 2|W0 − Wt|
additional clusters created by the splits during the algorithm. To see this, note
that there are a total of |W0 −Wt| splits caused by finding a useable dotted edge,
and each split creates one new cluster by splitting along the intra-cluster edge
(u
, v
) involved in the swap. There are also some clusters split by edges adjacent
to two (d + 1)-degree nodes, but there can be at most |W0 − Wt| of these.
The number of clusters after the algorithm stops is thus at most (d+1)|Wt|+
2|W0 −Wt|. In the process of breaking apart clusters with more than one edge to
Wt, at most |Wt| more splits may happen. Thus the total number k of clusters
in C∗
is at most (d + 2) |Wt| + 2 |W0 − Wt|.
There at most 2 |W0| + 2 |W0 − Wt| edges in F, and at most 2 |W0 − Wt|
inter-cluster edges with respect to the clusters C1, . . . Ck when the algorithm

terminates. Therefore |R∗
| ≤ 2 |W0| + 4 |W0 − Wt|. According to Lemma 10, the
witness W∗
L gives an upper bound of
2 |W∗
| + |R∗
| + k − 2
|W∗|
≤
(d + 4) |Wt| + 6 |W0 − Wt| + 2 |W0| − 2
|Wt|
on the minimum degree of at least one node in L. Using the fact that |S≤d+1| ≤
log n
log log n |S≤d|, we conclude that ΔL
opt ≤ d + 4 + O( log n
log log n ).
6 Combining Upper and Lower Bounds
Now we describe our MSTDB algorithm more formally. Recall from Section 3
that our MSTDB algorithm works in phases. Each phase employs algorithms
MAXDMST and MINDMST to improve either a high degree vertex or a low
degree vertex in L; when both improvements fail, their failure is justiﬁed by two
combinatorial witnesses.
Let us now look into a phase of the algorithm in more detail. Each phase of
MSTDB begins by picking a d such that
|S≤BL−d+1| ≤
log n
log log n
|S≤BL−d| and |S≥BH +d−1| ≤
log n
log log n
|S≥BH +d|.
It is easy to show that one can always ﬁnd such a d between 0 and 2 log n
log log n .
For the rest of the phase, vertices with degree BH +d or more are considered
“high degree” vertices and those in L with degree BL − d are considered “low
degree”. We employ MAXDMST and MINDMST to reduce the degree of a high
degree vertex and increase the degree of a low degree vertex respectively.
Before calling MAXDMST and MINDMST we need to ensure that the im-
provements they perform do not interfere. For this purpose, we look at all edges
e such that one end point of e lies in S≤BL−d+1 and the other in S≥BH +d−1.
Note that e might be a tree edge or a non-tree edge. We freeze all such edges,
which means that we neither add any of the non-tree edges nor delete any of
the tree edges while executing algorithms MAXDMST or MINDMST. Lemma
15 guarantees that this is enough to make the algorithm terminate. The proof
of Lemma 15 is omitted from the extended abstract. Theorem 16 shows that
when both MAXDMST and MINDMST fail with two combinatorial witnesses,
at least one of the witnesses is good. The proof will appear in the full version of
the paper.
Lemma 15 Algorithm MSTDB terminates.
Theorem 16 If MSTDB fails, then in every MST T of the graph, either:
(1) The maximum degree of T is more than BH, or
(2) Some node in L has degree less than BL.
References
1. T. M. Chan. Euclidean bounded-degree spanning tree ratios. In Proceedings of
the nineteenth annual symposium on Computational geometry, pages 11–19. ACM
Press, 2003.

2. J. Edmonds. Maximum matching and a polyhedron with 0–1 vertices. Journal of
Research National Bureau of Standards, 69B:125–130, 1965.
3. J. Edmonds. Paths, trees, and ﬂowers. Canadian Journal of Mathematics, 17:449–
467, 1965.
4. S. Even and R. E. Tarjan. Network ﬂow and testing graph connectivity. SIAM
Journal on Computing, 4(4):507–518, Dec. 1975.
5. T. Fischer. Optimizing the degree of minimum weight spanning trees. Technical
Report 14853, Dept of Computer Science, Cornell University, Ithaca, NY, 1993.
6. M. Fürer and B. Raghavachari. Approximating the minimum-degree Steiner tree
to within one of optimal. Journal of Algorithms, 17(3):409–423, Nov. 1994.
7. J. Hopcroft and R. Karp. An n 5/2 algorithm for maximum matching in bipartite
graphs. SIAM Journal on Computing, 2:225–231, 1973.
8. R. Jothi and B. Raghavachari. Degree-bounded minimum spanning trees. In Proc.
16th Canadian Conf. on Computational Geometry (CCCG), 2004.
9. S. Khuller, B. Raghavachari, and N. Young. Low-degree spanning trees of small
weight. SIAM J. Comput., 25(2):355–368, 1996.
10. J. Könemann and R. Ravi. A matter of degree: improved approximation algorithms
for degree-bounded minimum spanning trees. In Proceedings of ACM STOC, 2000.
11. J. Könemann and R. Ravi. Primal-dual meets local search: approximating MST’s
with nonuniform degree bounds. In ACM, editor, Proceedings ACM STOC, 2003.
12. R. Krishnan and B. Raghavachari. The directed minimum degree spanning tree
problem. In FSTTCS, pages 232–243, 2001.
13. C. H. Papadimitriou and U. Vazirani. On two geometric problems related to the
traveling salesman problem. J. Algorithms, 5:231–246, 1984.
14. R. Ravi, M. V. Marathe, S. S. Ravi, D. J. Rosenkrantz, and H. B. Hunt, III. Many
birds with one stone: multi-objective approximation algorithms. In Proceedings of
ACM STOC,1993.
15. R. Ravi, M. V. Marathe, S. S. Ravi, D. J. Rosenkrantz, and H. B. Hunt, III.
Approximation algorithms for degree-constrained minimum-cost network-design
problems. Algorithmica, 31, 2001.

A Rounding Algorithm for Approximating
Minimum Manhattan Networks
(Extended Abstract)
Victor Chepoi, Karim Nouioua, and Yann Vaxès
Laboratoire d’Informatique Fondamentale de Marseille
Faculté des Sciences de Luminy, Universitée de la Méditerranée
F-13288 Marseille Cedex 9, France
{chepoi,nouioua,vaxes}@lif.univ-mrs.fr
Abstract. For a set T of n points (terminals) in the plane, a Man-
hattan network on T is a network N(T) = (V, E) with the property
that its edges are horizontal or vertical segments connecting points in
V ⊇ T and for every pair of terminals, the network N(T) contains a
shortest l1-path between them. A minimum Manhattan network on T is
a Manhattan network of minimum possible length. The problem of find-
ing minimum Manhattan networks has been introduced by Gudmunds-
son, Levcopoulos, and Narasimhan (APPROX’99) and it is not known
whether this problem is in P or not. Several approximation algorithms
(with factors 8,4, and 3) have been proposed; recently Kato, Imai, and
Asano (ISAAC’02) have given a factor 2 approximation algorithm, how-
ever their correctness proof is incomplete. In this note, we propose a
rounding 2-approximation algorithm based on a LP-formulation of the
minimum Manhattan network problem.
1 Introduction
A rectilinear path P between two points p, q of the plane R2
is a path connecting p
and q and consisting of only horizontal and vertical line segments. More generally,
a rectilinear network N = (V, E) consists of a finite set V of points of R2
(the
vertices of N) and of a finite set of horizontal and vertical segments connecting
pairs of points of V (the edges of N). The length l(P) (or l(N)) of a rectilinear
path P (or of a rectilinear network N) is the sum of lengths of its edges. The
l1-distance between two points p = (px
, py
) and q = (qx
, qy
) in the plane R2
is
d(p, q) := ||p−q||1 = |px
−qx
|+|py
−qy
|. An l1-path between two points p, q ∈ R2
is a rectilinear path connecting p, q and having length d(p, q).
Given a set T = {t1, . . . , tn} of n points (terminals) in the plane, a Manhattan
network [4] on T is a rectilinear network N(T ) = (V, E) such that T ⊆ V and for
every pair of points in T, the network N(T ) contains an l1-path between them. A
minimum Manhattan network on T is a Manhattan network of minimum possible
length and the Minimum Manhattan Network problem (MMN problem) is to find
such a network.
c

A Rounding Algorithm for Approximating Minimum Manhattan Networks 41
Fig. 1. A minimum Manhattan network
The minimum Manhattan network problem has been introduced by Gud-
mundsson, Levcopoulos, and Narasimhan [4]. It is not known whether this prob-
lem is in P or not. Gudmundsson et al. [4] proposed a factor 4 and a factor 8
approximation algorithms with different time complexity. They also conjectured
that there exists a 2-approximation algorithm for this problem. Kato, Imai, and
Asano [5] presented a factor 2 approximation algorithm, however, their cor-
rectness proof is incomplete (this fact was independently noticed by Benkert,
Shirabe, and Wolf [1]). The algorithm in [5] proceeds in two steps. In the first
step, a subnetwork of length at most the optimum is constructed. It connects by
l1-paths only certain pairs of terminals. In the second step, this partial network
is augmented so that the remaining unconnected pairs are satisfied. It is claimed
on page 355 of [5] that the length of this augmentation is again bounded from
above by the length of a minimum Manhattan network. However, the details
of how to perform this augmentation and the proof of its correctness are not
provided. Following [5], Benkert et al. [1] outlined a factor 3 approximation al-
gorithm and presented a mixed-integer programming formulation of the MMN
problem. Notice that all four mentioned algorithms are geometric and some of
them employ results from computational geometry. Nouioua [7] presented an-
other factor 3 approximation algorithm based on the primal-dual method from
linear programming. In this paper we present a rounding method applied to the
optimal solution of the flow based linear program described in [1, 7] which leads
to a factor 2 approximation algorithm for the minimum Manhattan network
problem. For this, we define two subsets of pairs of terminals, called strips and
staircases, and for each of them, we describe a specific rounding procedure. Each
rounded up edge is paid by a group of parallel edges which together support at
least one-half unit of fractional flow. Finally, we prove that a rectilinear network
containing l1-paths between all the pairs belonging to strips and staircases is a
Manhattan network and thus, we end-up with an integer feasible solution whose
cost is at most twice the fractional optimum.
Gudmundsson et al. [4] introduced the minimum Manhattan networks in
connection with the construction of sparse geometric spanners. Given a set T of
n points in the plane endowed with a norm · , and a real number t ≥ 1, a
geometric network N is a t-spanner for T if for each pair of points p, q ∈ T, there
exists a pq-path in N of length at t times the distance p − q between p and q.

42 Victor Chepoi, Karim Nouioua, and Yann Vaxès
In the Euclidian plane (and more generally, for lp-planes with p ≥ 2), the linear
segment is the unique shortest path between two endpoints, and therefore the
unique 1-spanner of T is the trivial complete graph on T. On the other hand, if
the unit ball of the norm is a polygon (in particular, for l1 and l∞), the points
are connected by several shortest paths, therefore the problem of finding the
sparsest 1-spanner becomes non trivial. In this connection, minimum Manhattan
networks are precisely the optimal 1-spanners for the l1 (or l∞) plane. Sparse
geometric spanners have applications in VLSI circuit design, network design,
distributed algorithms and other areas, see for example the survey of [3]. Finally,
Lam, Alexandersson, and Pachter [6] suggested to apply minimum Manhattan
networks to design efficient search spaces for pair hidden Markov model (PHMM)
alignment algorithms.
2 Properties and LP-Formulation
In this section, we present several properties of minimum Manhattan networks.
First, we define some notations. Denote by [p, q] the linear segment having p and
q as end-points. The set of all points of R2
lying on l1-paths between p and q
constitute the smallest axis-parallel rectangle R(p, q) containing the points p, q.
For two terminals ti, tj ∈ T, set Rij := R(ti, tj). (This rectangle is degenerated if
ti and tj have the same x- or y-coordinate.) We say that Rij is an empty rectangle
if Rij ∩ T = {ti, tj}. The complete grid is obtained by drawing in the smallest
axis-parallel rectangle containing the set T a horizontal segment and a vertical
segment through every terminal. Using standard methods for establishing Hanan
grid-type results [10], it can be shown that the complete grid contains at least
one minimum Manhattan network [4].
A point p ∈ R2
is said to be an efficient point of T [2, 9] if there does not
exist any other point q ∈ R2
such that d(q, ti) ≤ d(p, ti) for all ti ∈ T and
d(q, tj) d(p, tj) for at least one tj ∈ T. Denote the set of all efficient points
by P, called the Pareto envelope of T. An optimal O(n log n) time algorithm
to compute the Pareto envelope of n points in the l1-plane is presented in [2]
(for properties of P and an O(n2
) time algorithm see also [9]). In particular,
it is known that P is ortho-convex, i.e. the intersection of P with any vertical
or horizontal line is convex, and that every two points of P can be joined in
P by an l1-path. P, being ortho-convex, is a union of ortho-convex (possibly
degenerated) rectilinear polygons (called blocks) glued together along vertices
(they become cut points of P); Fig. 2 presents two generic forms of the Pareto
envelope of four points.
Lemma 1. The Pareto envelope P contains at least one minimum Manhattan
network on T .
By this result, whose proof is omitted in this extended abstract, in order
to solve the MMN problem on T it suffices to complete the set of terminals by
adding to T the cut points of P and to solve a MMN problem on each block of P
with respect to the new and old terminals located inside or on its boundary. Due

(b)
(a)
Fig. 2. Pareto envelope of four points
to this decomposition of the MMN problem into smaller subproblems, further
we can assume without loss of generality that P consists of a single block with
at least 3 terminals; denote by ∂P the boundary of this ortho-convex rectilinear
polygon. Then every convex vertex of P is a terminal. Since the sub-path of ∂P
between two consecutive convex vertices of ∂P is the unique l1-path connecting
these vertices inside P and ∂P is covered by such l1-paths, from Lemma 2.1
we conclude that the edges of ∂P belong to any minimum Manhattan network
inside P.
From Lemma 2.1 and the result of [4] mentioned above we conclude that the
part Γ = (V, E) of the complete grid contained in P hosts at least one minimum
Manhattan network. Two edges of Γ are called twins if they are opposite edges
of a rectangular face of the grid Γ. Two edges e, f of Γ are called parallel if there
exists a sequence e = e1, e2, . . . , em+1 = f of edges such that for i = 1, . . . , m
the edges ei, ei+1 are twins. By deﬁnition, any edge e is parallel to itself and
all edges parallel to e have the same length. Notice also that exactly two edges
parallel to a given edge e belong to ∂P.
We continue with the notion of generating set introduced in [5] and used in
approximation algorithms from [1, 7]. A generating set is a subset F of pairs
of terminals (or, more compactly, of their indices) with the property that a
rectilinear network containing l1-paths for all pairs in F is a Manhattan network
on T. For example, F∅ consisting of all pairs ij with Rij empty is a generating
set. In the next section, we will describe a sparse generating set contained in F∅.
To give an LP-formulation of the minimum Manhattan network problem, let
F be an arbitrary generating set; for each pair ij ∈ F, let Γij := Γ ∩R(ti, tj) and
set Γij = (Vij , Eij). We formulate the MMN problem as a cut covering problem
using an exponential number of constraints, which we further convert into an
equivalent formulation that employs only a polynomial number of variables and
constraints. In both formulations, le will denote the length of an edge e of the
network Γ = (V, E) and xe will be a 0-1 decision variable associated with e. A
subset of edges C of Eij is called a (ti, tj)-cut if every l1-path between ti and tj in
Γij meets C. Let Cij denote the collection of all (ti, tj)-cuts and set C := ∪ij∈F Cij.
Then the minimum Manhattan networks can be viewed as the optimal solutions

Integer optimum = 28 Fractional optimum = 27.5
Fig. 3. Integrality gap
of the following integer linear program (the dual of the relaxation of this program
is a packing problem of the cuts from C):
minimize

e∈E
lexe (1)
subject to

e∈C
xe ≥ 1, C ∈ C
xe ∈ {0, 1}, e ∈ E
Indeed, every Manhattan network is a feasible solution of (1). Conversely, let
xe, e ∈ E, be a feasible solution for (1). Considering xe’s as capacities of the
edges e of Γ, and applying the covering constraints and the Ford-Fulkerson’s
theorem to each network Γij, ij ∈ F, oriented as described below, we conclude
the existence in Γij of an integer (ti, tj)-ﬂow of value 1, i.e., of an l1-path between
ti and tj. As a consequence, we obtain a Manhattan network of the same cost.
This observation leads to the second integer programming formulation for the
MMN problem (but this time, having a polynomial size). For each pair ij ∈ F
and each edge e ∈ Eij introduce a (ﬂow) variable fij
e . Orient the edges of Γij
so that the oriented paths connecting ti and tj are exactly the l1-paths between
those terminals. For a vertex v ∈ Vij {ti, tj} denote by Γ+
ij (v) the oriented edges
of Γij entering v and by Γ−
ij (v) the oriented edges of Γij out of v. We are lead
to the following integer program:
minimize

e∈E
lexe (2)
subject to

e∈Γ+
ij (v)
fij
e =

e∈Γ−
ij (v)
fij
e , ij ∈ F, v ∈ Vij {ti, tj}

e∈Γ−
ij (ti)
fij
e = 1, ij ∈ F
0 ≤ fij
e ≤ xe, ij ∈ F, ∀e ∈ Eij
xe ∈ {0, 1}, e ∈ E

Denote by (1
) and (2
) the LP-relaxation of (1) and (2) obtained by replacing
the boolean constrains xe ∈ {0, 1} by the linear constraints xe ≥ 0. Since (2
)
contains a polynomial number of variables and inequalities, it can be solved
in strongly polynomial time using the algorithm of Tardos [8]. The x-part of
any optimal solution of (2
) is an optimal solution of (1
). Notice also that there
exist instances of the MMN problem for which the cost of an optimal (fractional)
solution of (1
) or (2
) is smaller than the cost of an optimal (integer) solution
of (1) or (2). Fig. 3 shows such an example (xe = 1 for bolded edges and xe = 1
2
for dashed edges). Finally observe that in any feasible solution of (1
) and (2
)
for any edge e ∈ ∂P holds xe = 1.
3 Strips and Staircases
A degenerated empty rectangle Rij is called a degenerated vertical or horizontal
strip. A non-degenerated empty rectangle Rij is called a vertical strip if the x-
coordinates of ti and tj take consecutive values in the sorted list of x-coordinates
of the terminals and the intersection of Rij with degenerated vertical strips is
either empty or one of the points ti or tj. Analogously, a non-degenerated empty
rectangle Rij is called a horizontal strip if the y-coordinates of ti and tj take
consecutive values in the sorted list of y-coordinates of the terminals and the
intersection of horizontal sides of Rij with degenerated horizontal strips is either
empty or one of the points ti or tj; see Fig. 4. The sides of a vertical (resp.,
horizontal) strip Rij are the vertical (resp., horizontal) sides of Rij. We say
that the strips Rii and Rjj (degenerated or not) form a crossing configuration
if they intersect and the Pareto envelope of the points ti, ti , tj, tj is of type
(a); see Fig. 2. The importance of such configurations resides in the following
property whose proof is straightforward:
Lemma 2. If the strips Rii and Rjj form a crossing configuration as in Fig.
5, then from the l1-paths between ti and ti and between tj and tj one can derive
an l1-path connecting ti and tj and an l1-path connecting ti and tj.
For a crossing configuration Rii , Rjj , denote by o and o
the cut points of
the rectangular block of the Pareto envelope of ti, ti , tj, tj , and assume that the
(b)
(a)
Fig. 4. Horizontal and vertical strips

o
tj
ti
Rii
Sij|ij
Rjj
ti
tj
β
α
o
Mij
Q+
4 ∩ T = ∅
Q+
2 ∩ T = ∅
Fig. 5. Staircase Sij|ij
four tips of this envelope connect o with ti, tj and o
with ti , tj . Additionally,
suppose without loss of generality, that ti and tj belong to the first quadrant Q1
with respect to the origin o (the remaining quadrants are labelled Q2, Q3, and
Q4). Then ti and tj belong to the third quadrant with respect to the origin
o
. Denote by Tij the set of all terminals tk ∈ (T {ti, tj}) ∩ Q1 such that (i)
R(tk, o) ∩ T = {tk} and (ii) the region {q ∈ Q2 : qy
≤ ty
k} ∪ {q ∈ Q4 : qx
≤ tx
k}
does not contain any terminal of T. If Tij is nonempty, then all its terminals
belong to the rectangle Rij, more precisely, they are all located on a common
shortest rectilinear path between ti and tj. Denote by Sij|ij the non-degenerated
block of the Pareto envelope of the set Tij ∪ {o, ti, tj} and call this rectilinear
polygon a staircase; see Fig. 5 for an illustration. The point o is called the
origin of this staircase. Analogously one can define the set Tij and the staircase
Sij|ij with origin o
. Two other types of staircases will be defined if ti, tj belong
to the second quadrant and ti , tj belong to the fourth quadrant. In order to
simplify the presentation, further we will assume that after a suitable geometric
transformation every staircase is located in the first quadrant. (Notice that our
staircases are different from the staircase polygons occurring in the algorithms
from [4].)
Let α be the leftmost highest point of the staircase Sij|ij and let β be the
rightmost lowest point of this staircase. Denote by Mij the monotone boundary
path of Sij|ij between α and β and passing via the terminals of Tij. By defini-
tion, Sij|i j ∩ T = Tij. By the choice of Tij, there are no terminals of T located
in the regions Q+
2 := {q ∈ Q2 : qy
≤ αy
} and Q+
4 := {q ∈ Q4 : qx
≤ βx
}. In

B3
ti1 ti2
tk
tj
1
tj2
ti
1
ti
2
B2
B1
B4
tj
2
tj1
tk
Fig. 6. To the proof of Lemma 3.2
particular, no strip traverses a staircase. From the definition of staircase imme-
diately follows that two staircases either are disjoint or their intersection is a
subset of terminals; in particular, every edge of the grid Γ belongs to at most
one staircase.
Let F
be the set of all pairs ij such that Rij is a strip. Let F
be the set of
all pairs i
k such that there exists a staircase Sij|ij such that tk belongs to the
set Tij.
Lemma 3. F := F
∪ F
⊆ F∅ is a generating set.
Proof. Let N be a rectilinear network containing l1-paths for all pairs in F.
To prove that N is a Manhattan network on T , it suffices to establish that for
an arbitrary pair kk
∈ F∅ F, the terminals tk and tk can be joined in N
by an l1-path. Assume without loss of generality that tx
k ≤ tx
k and ty
k ≤ ty
k.
The vertical and horizontal lines through the points tk and tk partition the
plane into the rectangle Rkk , four open quadrants and four closed unbounded
half-bands labelled counterclockwise B1, B2, B3, and B4. Since tk ∈ B1 ∩ T and
tk ∈ B3 ∩ T, there should exist at least one vertical strip between a terminal
from B1 and a terminal from B3. Denote by Ri1i
1
the leftmost strip and by
Ri2i
2
the rightmost strip traversing the rectangle Rkk . These two strips may
coincide (and one or both of them may be degenerated), however they are both
different from Rkk because kk
/
∈ F. Suppose without loss of generality that
ti1 , ti2 ∈ B1 and ti
1
, ti
2
∈ B3. Analogously define the lowest horizontal strip
Rj1j
1
and the highest horizontal strip Rj2j
2
traversing Rkk . Again these strips
may coincide and/or may be degenerated but they must be different from Rkk .
Let tj1 , tj2 ∈ B4 and tj
1
, tj
2
∈ B2; see Fig. 6. From the choice of the strips in

question, we conclude that each of four combinations of horizontal and vertical
strips constitute crossing configurations. Moreover, Si2j2|i
2j
2
and Si
1j
1|i1j1
must
be staircases. The vertex tk either belongs to Ti2j2 or coincide with one of the
vertices ti2 , tj2 . Analogously, tk ∈ Ti
1j
1
∪ {ti
1
, tj
1
}. By Lemma 3.1 there is an
l1-path connecting each of the terminals ti1 , ti2 , tj1 , tj2 to each of the terminals
ti
1
, ti
2
, tj
1
, tj
2
. Also there exist l1-paths between tk and ti
2
, tj
2
and between tk
and ti1 , tj2 . Combining certain pieces of these l1-paths we will produce an l1-path
connecting tk and tk .
4 The Rounding Algorithm
Let (x,f)=((xe)e∈E, (fij
e )e∈E,ij∈F ) be an optimal solution of the linear program
(2
) (in general, this solution is not half-integral). The algorithm rounds up the
solution (x,f) in three phases. In Phase 0, we insert all edges of ∂P in the
integer solution. In Phase 1, the rounding is performed inside every strip Rii ,
in order to ensure the existence of an l1-path Pii between the terminals ti and
ti . In Phase 2, an iterative rounding procedure is applied to each staircase.
Let Rii be a strip. If Rii is degenerated, then [ti, ti ] is the unique l1-path
between ti and ti , yielding xe = fii
e = 1 for any edge e ∈ [ti, ti ]. If Rii is not
degenerated, then any l1-path in Γ between ti and ti has a simple form: it goes
along the side of Rii containing ti, then it makes a turn by following an edge
of Γ traversing Rii (called further a switch of Rii ), and continues its way on
the side containing ti until it reaches ti . Although, it may happen that several
such l1-paths have been used by the fractional flow fii
between ti and ti , the
cut condition ensures that xe + xe ≥ fii
e + fii
e ≥ 1 for any pair e, e
of twins on
opposite sides of the strip Rii , yielding max{xe, xe } ≥ 1
2 .
Let p be the furthest from ti vertex on the side of Rii containing ti such that
xe ≥ 1
2 for every edge e of the segment [ti, p]. Let pp
be the edge of Γ incident
to p that traverses the strip Rii . By the choice of p we have xe ≥ 1
2 for all edges
e of the segment [p
, ti ].
Phase 1 (procedure RoundStrip). For each strip Rii , if Rii is degenerated,
then take in the integer solution all edges of [ti, ti ], otherwise round up the
edges of [ti, p] and [p
, ti ] and take the edge pp
as a switch of Rii ; in both cases,
denote by Pii the resulting l1-path between ti and ti .
Let Sii|jj be a staircase. Denote by φ the closest to ti common point of
the l1-paths Pii and Pjj (this point is a corner of the rectangular face of Γ
containing the vertices o and o
). Let P+
ii and P+
jj be the sub-paths of Pii and
Pjj comprised between φ and the terminals ti and tj, respectively. Now we
slightly expand the staircase Sii|jj by considering as Sii|jj the region bounded
by the paths P+
ii , P+
jj , and Mij (P+
ii and P+
jj are not included in the staircase
but Mij and the terminals from the set Tij are). Inside Sii|jj , any flow fki
(or fkj
), k ∈ Tij, may be as fractional as possible: it may happen that several
l1-paths between tk and ti carry flow fki
. Any such l1-path intersects one of
the paths P+
ii or P+
jj , therefore the total fki
-flow arriving at P+
ii ∪ P+
jj is equal

to 1. (This flow can be redirected to φ via the paths P+
ii and P+
jj , and further,
along the path Pii , to the terminal ti ). Therefore it remains to decide how to
round up the flow fki
inside the expanded staircase Sii|jj . For this, notice that
either the total fki
-flow carried by the l1-paths that arrive at P+
ii is at least 1
2
or the total fki
-flow on the l1-paths that arrive at P+
jj is at least 1
2 .
Phase 2 (procedure RoundStaircase). For a staircase Sii|jj defined by the l1-
paths P+
ii and P+
jj and the monotone path Mij, find the lowest terminal tm ∈ Tij
such that the fmi
-flow on l1-paths between tm and ti that arrive first at P+
ii is
≥ 1
2 (we may suppose without loss of generality that this terminal exists). Let
ts be the terminal of Tij immediately below tm (this terminal may not exist).
By the choice of tm, the fsi
-flow on paths which arrive at Pjj is ≥ 1
2 . Denote
by φ
the intersection of the horizontal line passing via the terminal tm with the
path P+
ii . Analogously, let φ
denote the intersection of the vertical line passing
via ts with the path P+
jj . Round up all edges of the horizontal segment [tm, φ
]
and all edges of the vertical segment [ts, φ
]. If Tij contains terminals located
above the horizontal line (tm, φ
), then recursively call RoundStaircase to the
expanded staircase defined by [tm, φ
], the sub-path of P+
ii comprised between
φ
and ti, and the sub-path Mim of the monotone path Mij between tm and
α. Analogously, if Tij contains terminals located to the right of the vertical line
(ts, φ
), then recursively call RoundStaircase to the expanded staircase defined
by [ts, φ
], the sub-path of P+
jj comprised between φ
and tj, and the sub-path
Msj of the monotone path Mij between ts and β; see Fig. 7 for an illustration.
φ
φ
tm
ts
ti
φ
f
e
tj
tj
ti
P +
ii
P +
jj
Fig. 7. Procedure RoundStaircase

Let E0 denote the edges of Γ which belong to the boundary of the Pareto
envelope of T. Let E1 be the set of all edges picked by the procedure RoundStrip
and which do not belong to E0, and let E2 be the set of all edges picked by the
recursive procedure RoundStaircase and which do not belong to E0∪E1. Denote
by N∗
= (V ∗
, E0 ∪ E1 ∪ E2) the resulting rectilinear network. From Lemma 3.2
and the rounding procedures presented above we infer that N∗
is a Manhattan
network. Let x∗
be the integer solution of (1) associated with N∗
, i.e., x∗
e = 1 if
e ∈ E0 ∪ E1 ∪ E2 and x∗
e = 0 otherwise.
5 Analysis
In this section, we will show that the length of the Manhattan network N∗
is at
most twice the cost of the optimal fractional solution of (1
), i.e., that
cost(x∗
) =
e∈E
lex∗
e ≤ 2
e∈E
lexe = 2cost(x). (3)
Recall that xe = x∗
e = 1 holds for every edge e ∈ E0. To establish the inequality
(3), to every edge e ∈ E1 ∪ E2 we will assign a set Ee of parallel to e edges such
that (i)

e∈Ee
xe ≥ 1
2 and (ii) Ee ∩ Ef = ∅ for any two edges e, f ∈ E1 ∪ E2.
First pick an edge e ∈ E1, say e ∈ Pii for a strip Rii . If e belongs to a side
of this strip, then xe ≥ 1
2 , and in this case we can set Ee := {e}. Now, if e is the
switch of Rii , then Ee consists of anyone of the two edges of ∂P parallel to e.
From the definition of strips one conclude that no other switch can be parallel
to these edges of ∂P. Therefore each pair of parallel edges of ∂P may appear in
at most one set Ee for a switch e.
Finally suppose that e ∈ E2, say e belongs to the expanded staircase Sii|jj .
If e belongs to the segment [tm, φ
], then Ee consists of e and all parallel to
e edges of Sii|jj located below e; see Fig. 7. Since every l1-path between tm
and ti intersecting the path P+
ii contains an edge of Ee, we infer that the value
of the fmi
-flow traversing the set Ee is at least 1
2 , therefore

e∈Ee
xe ≥ 1
2 ,
thus establishing (i). Analogously, if f is an edge of the vertical segment [ts, φ
],
then Ef consists of f and all parallel to f edges of Sii|jj located to the left
of f. Obviously, Ee ∩ Ef = ∅. Since Ee and Ef belong to the region of Sii|jj
delimited by the segments [tm, φ
] and [ts, φ
] and the recursive calls of the
procedure RoundStaircase concern the staircases disjoint from this region, we
deduce that Ee and Ef are disjoint from the sets Ee for all edges e
picked by
the recursive calls of RoundStaircase to the staircase Sii|jj . Every edge of Γ
belongs to at most one staircase, therefore Ee ∩ Ef = ∅ if the edges e, f ∈ E
belong to different staircases. Finally, since there are no terminals of T located
below or to the left of the staircase Sii|jj , no strip traverses this staircase (a
strip intersecting Sii|jj either coincides with Rii and Rjj , or intersects the
staircase along segments of the boundary path Mij). Therefore, no edge from
E1 can be assigned to a set Ee for some e ∈ E2 ∩ Sii|jj , thus establishing (ii)
and the desired inequality (3). Now, we are in position to formulate the main
result of this note:

Theorem 1. The rounding algorithm described in Section 4 achieves an approx-
imation guarantee of 2 for the minimum Manhattan network problem.
6 Conclusion
In this paper, we presented a simple rounding algorithm for the minimum Man-
hattan network problem and we established that the length of the Manhattan
network returned by this algorithm is at most twice the cost of the optimal frac-
tional solution of the MMN problem. Nevertheless, experiences show that the
ratio between the costs of the solution returned by our algorithm and the opti-
mal solution of the linear programs (1
) and (2
) is much better than 2. We do
not know the worst integrality gap of (1) (the worst gap obtained by computer
experiences is about 1.087). Say, is this gap smaller or equal than 1.5? Does
there exist a gap in the case when the terminals are the origin and the corners
of a staircase?
References
1. M. Benkert, T. Shirabe, and A. Wolff, The minimum Manhattan network problem
- approximations and exact solutions, 20th EWCG, March 25-26, Seville (Spain),
2004.
2. G. Chalmet, L. Francis, and A. Kolen, Finding efficient solutions for rectilinear
distance location problems efficiently, European J. Operations Research 6 (1981)
117–124.
3. D. Eppstein, Spanning trees and spanners, Handbook of Computational Geometry,
J.-R. Sack and J. Urrutia, eds., Elsevier, 1999, pp. 425–461.
4. J. Gudmundsson, C. Levcopoulos, and G. Narasimhan, Approximating a mini-
mum Manhattan network, Nordic J. Computing 8 (2001) 219–232 and Proc. AP-
PROX’99, 1999, pp. 28–37.
5. R. Kato, K. Imai, and T. Asano, An improved algorithm for the minimum Man-
hattan network problem, ISAAC’02, Lecture Notes Computer Science, vol. 2518,
2002, pp. 344-356.
6. F. Lam, M. Alexanderson, and L. Pachter, Picking alignements from (Steiner)
trees, J. Computational Biology 10 (2003) 509–520.
7. K. Nouioua, Une approche primale-duale pour le problème du réseau de Manhattan
minimal, RAIRO Operations Research (submitted).
8. E. Tardos, A strongly polynomial algorithm to solve combinatorial linear programs,
Operations Research 34 (1986) 250–256.
9. R.E. Wendell, A.P. Hurter, and T.J. Lowe, Efficient points in location theory, AIEE
Transactions 9 (1973) 314-321.
10. M. Zachariasen, A catalog of Hanan grid problems, Networks 38 (2001) 76–83.

Packing Element-Disjoint Steiner Trees
Joseph Cheriyan1,
and Mohammad R. Salavatipour2,
1
Department of Combinatorics and Optimization, University of Waterloo
Waterloo, Ontario N2L3G1, Canada
jcheriyan@math.uwaterloo.ca
2
Department of Computing Science, University of Alberta
Edmonton, Alberta T6G2E8, Canada
mreza@cs.ualberta.ca
Abstract. Given an undirected graph G(V, E) with terminal set T ⊆ V
the problem of packing element-disjoint Steiner trees is to find the maxi-
mum number of Steiner trees that are disjoint on the nonterminal nodes
and on the edges. The problem is known to be NP-hard to approximate
within a factor of Ω(log n), where n denotes |V |. We present a random-
ized O(log n)-approximation algorithm for this problem, thus matching
the hardness lower bound. Moreover, we show a tight upper bound of
O(log n) on the integrality ratio of a natural linear programming relax-
ation.
1 Introduction
Throughout we assume that G = (V, E), with n = |V |, is a simple graph and
T ⊆ V is a specified set of nodes (although we do not allow multi-edges, these
can be handled by inserting new nodes into the edges). The nodes in T are
called terminal nodes or black nodes, and the nodes in V − T are called Steiner
nodes or white nodes. Following the (now standard) notation on approximation
algorithms for graph connectivity problems (e.g. see [16]), by an element we mean
either an edge or a Steiner node. A Steiner tree is a connected, acyclic subgraph
that contains all the terminal nodes (Steiner nodes are optional). The problem
of packing element-disjoint Steiner trees is to find a maximum-cardinality set of
element-disjoint Steiner trees. In other words, the goal is to find the maximum
number of Steiner trees such that each edge and each white node is in at most
one of these trees. We denote this problem by IUV. Here, I denotes identical
terminal sets for different trees in the packing, U denotes an undirected graph,
and V denotes disjointness for white nodes and edges.
By bipartite IUV we mean the special case where G is a bipartite graph with
node partition V = T ∪ (V − T ), that is, one of the sets of the vertex bipartition
consists of all of the terminal nodes. We will also consider the problem of packing
Steiner trees fractionally (or fractional IUV for short), with constraints on the

Supported by NSERC grant No. OGP0138432.

Supported by an NSERC postdoctoral fellowship, the Department of Combinatorics
and Optimization at the University of Waterloo, and a university start-up grant at
the University of Alberta.
c

Packing Element-Disjoint Steiner Trees 53
nodes which corresponds to a natural linear programming relaxation of IUV
(explained later in this section).
IUV captures some of the fundamental problems of combinatorial optimiza-
tion and graph theory. First, suppose that T consists of just two nodes s and t.
Then the problem is to find a maximum-cardinality set of element-disjoint s, t-
paths. This problem is addressed by one of the cornerstone theorems in graph
theory, namely Menger’s theorem [4, Theorem 3.3.1], which states that the max-
imum number of openly-disjoint s, t-paths equals the minimum number of white
nodes whose deletion leaves no s, t-path. The algorithmic problem of finding an
optimal set of s, t-paths can be solved efficiently via any efficient maximum s, t-
flow algorithm. Another key special case of IUV occurs for T = V , that is, all
the nodes are terminals. Then the problem is to find a maximum-cardinality set
of edge-disjoint spanning trees. This problem is addressed by another classical
min-max theorem, namely the Tutte/Nash-Williams theorem [4, Theorem 3.5.1].
The algorithmic problem of finding an optimal set of edge-disjoint spanning trees
can be solved efficiently via the matroid intersection algorithm. In contrast, the
problem IUV is known to be NP-hard [3, 7], and the optimal value cannot
be approximated within a factor of Ω(log n) modulo the P=NP conjecture [3].
Moreover, this hardness result applies also to bipartite IUV and even to the
problem of packing Steiner trees fractionally for the bipartite case (see (1) below
for more details) via [15, Theorem 4.1]. That is, the optimal value of this linear
programming relaxation of bipartite IUV cannot be approximated within a fac-
tor of Ω(log n) modulo the P=NP conjecture. This is discussed in more detail
later. For related results, see [2].
One variant of IUV has attracted increasing research interest over the last
few years, namely, the problem of packing edge-disjoint Steiner trees (find a
maximum-cardinality set of edge-disjoint Steiner trees); we denote this problem
by IUE. This problem in its full generality has applications in VLSI circuit
design (e.g., see [12, 21]). Other applications include multicasting in wireless
networks (see [6]) and broadcasting large data streams, such as videos, over
the Internet (see [15]). Almost a decade ago, Grötschel et al., motivated by the
importance of IUE in applications and in theory, studied the problem using
methods from mathematical programming, in particular polyhedral theory and
cutting-plane algorithms, see [8–12]. Moreover, there is significant motivation
from the areas of graph theory and combinatorial optimization, partly based
on the relation to the classical results mentioned above, and partly fueled by
an exciting conjecture of Kriesell [18] (the conjecture states that the maximum
number of edge-disjoint Steiner trees is at least half of an obvious upper bound,
namely, the minimum number of edges in a cut that separates some pair of
terminals). If this conjecture is settled by a constructive proof, then it may
give a 2-approximation algorithm for IUE. Recently, Lau [19] made a major
advance on this conjecture by presenting a 26-approximation algorithm for IUE
using new combinatorial ideas. Lau’s construction is based on an earlier result
of Frank, Kiraly, and Kriesell [7] that gives a 3-approximation for a special case
of bipartite IUV. (To the best of our knowledge, no other method for IUE gives
an O(1)-approximation guarantee, or even a o(|T |)-approximation guarantee).

54 Joseph Cheriyan and Mohammad R. Salavatipour
Here is a summary of the previous results in the area. Frank et al. [7] studied
bipartite IUV, and focusing on the restricted case where the degree of every
white node is ≤ Δ they presented a Δ-approximation algorithm (via the matroid
intersection theorem and algorithm). Recently, we [3] showed that (i) IUV is
hard to approximate within a factor of Ω(log n), even for bipartite IUV and even
for the fractional version of bipartite IUV, (ii) IUV is APX-hard even if |T |
is a small constant, and (iii) we gave an O(
√
n log n)-approximation algorithm
for a generalization of IUV. For IUE, Jain et al. [15] proved that the problem
is APX-hard, and (as mentioned above) Lau [19] presented a 26-approximation
algorithm, based on the results of Frank et al. for bipartite IUV1
. Another
related topic pertains to the domatic number of a graph and computing near-
optimal domatic partitions. Feige et al. [5] presented approximation algorithms
and hardness results for these problems. One of our key results is inspired by
this work.
Although IUE seems to be more natural compared to IUV, and although
there are many more papers (applied, computational, and theoretical) on IUE,
the only known O(1)-approximation guarantee for IUE is based on solving bi-
partite IUV. This shows that IUV is a fundamental problem in this area. Our
main contribution is to settle (up to constant factors) IUV and bipartite IUV
from the perspective of approximation algorithms. Moreover, our result extends
to the capacitated version of IUV, where each white (Steiner) node v has a
nonnegative integer capacity cv, and the goal is to find a maximum collection of
Steiner trees (allowing multiple copies of any Steiner tree) such that each white
node v appears in at most cv Steiner trees; there is no capacity constraint on
the edges, i.e., each edge has infinite capacity. The capacitated version of IUV
(which contains IUV as a special case) may be formulated as an integer program
(IP) that has an exponential number of variables. Let F denote the collection of
all Steiner trees in G. We have a binary variable xF for each Steiner tree F ∈ F.
maximize

F ∈F xF
subject to ∀v ∈ V − T :

F :v∈F xF ≤ cv
∀F ∈ F : xF ∈ {0, 1}
(1)
Note that in uncapacitated IUV we have cv = 1, ∀v ∈ V − T . The fractional
IUV (mentioned earlier) corresponds to the linear programming relaxation of
this IP which is obtained by relaxing the integrality condition on xF ’s to 0 ≤
xF ≤ 1.
Our main result is the following:
Theorem 1. (a) There is a polynomial time probabilistic approximation algo-
rithm with a guarantee of O(log n) and a failure probability of O(1)
log n for (unca-
pacitated) IUV. The algorithm finds a solution that is within a factor O(log n)
of the optimal solution to fractional IUV.
(b) The same approximation guarantee holds for capacitated IUV.
1
Although not relevant to this paper, we mention that the directed version of IUV
has been studied [3], and the known approximation guarantees and hardness lower
bounds are within the same “ballpark” according to the classification of Arora and
Lund [1].

We call an edge white if both its end-nodes are white, otherwise, the edge is
called black (then at least one end-node is a terminal). For our purposes, any
edge can be subdivided by inserting a white node. In particular, any edge with
both end-nodes black can be subdivided by inserting a white node. Thus, the
problem of packing element-disjoint Steiner trees can be transformed into the
problem of packing Steiner trees that are disjoint on the set of white nodes. We
prefer the formulation in terms of element-disjoint Steiner trees; for example,
this formulation immediately shows that IUV captures the problem of packing
edge-disjoint spanning trees; of course, the two formulations are equivalent.
For two nodes s, t, let κ(s, t) denote the maximum number of element-disjoint
s, t-paths (an s, t-path means a path with end-nodes s and t); in other words,
κ(s, t) denotes the maximum number of s, t-paths such that each edge and each
white node is in at most one of these paths. The graph is said to be k-element
connected if κ(s, t) ≥ k, ∀s, t ∈ T, s = t, i.e., there are ≥ k element-disjoint
paths between every pair of terminals. For a graph G = (V, E) and edge e ∈ E,
G − e denotes the graph obtained from G by deleting e, and G/e denotes the
graph obtained from G by contracting e; see [4, Chapter 1] for more details.
As mentioned above, bipartite IUV means the special case of IUV where every
edge is black. We call the graph bipartite if every edge is black.
Here is a sketch of our algorithm and proof for Theorem 1(a). Let k be the
maximum number such that the input graph G is k-element connected. Clearly,
the maximum number of element-disjoint Steiner trees is ≤ k (informally, each
Steiner tree in a family of element-disjoint Steiner trees contributes one to the
element connectivity). Note that this upper bound also holds for the optimal
fractional solution. We delete or contract white edges in G, while preserving the
element connectivity, to obtain a bipartite graph G∗
; thus, G∗
too is k-element
connected (details in Section 2). Then we apply our key result (Theorem 3 in
Section 3) to G∗
to obtain O(k/ log n) element-disjoint Steiner trees; this is
achieved via a simple algorithm that assigns a random colour to each Steiner
node – it turns out that for each colour, the union of T and the set of nodes with
that colour induces a connected subgraph, and hence this subgraph contains a
Steiner tree. Finally, we uncontract some of the white nodes to obtain the same
number of element-disjoint Steiner trees of G. Note that uncontracting white
nodes in a set of element-disjoint Steiner trees preserves the Steiner trees (up to
the deletion of redundant edges) and preserves the element-disjointness of the
Steiner trees.
2 Reducing IUV to Bipartite IUV
To prove our main result, we ﬁrst show that the problem can be reduced to
bipartite IUV while preserving the approximation guarantee. The next result
is due to Hind and Oellermann [14, Lemma 4.2]. We had found the result inde-
pendently (before discovering the earlier works), and have included a proof for
the sake of completeness.

Theorem 2. Given a graph G = (V, E) with terminal set T that is k-element
connected (and has no edge with both end-nodes black), there is a poly-time
algorithm to obtain a bipartite graph G∗
from G such that G∗
has the same
terminal set and is k-element connected, by repeatedly deleting or contracting
white edges.
Proof. Consider any white edge e = pq. We prove that either deleting or con-
tracting e preserves the k-element connectivity of G.
Suppose that G − e is not k-element connected. Then by Menger’s theorem
G−e has a set D of k −1 white nodes whose deletion “separates” two terminals.
That is, every terminal is in one of two components of G − D − e and each of
these components has at least one terminal; call these two components Cp and
Cq. Let s be a terminal in Cp and let t be a terminal in Cq. Let P(s, t) denote any
set of k element-disjoint s, t-paths in G, and observe that one of these s, t-paths,
say P1, contains e (since the k-set D ∪ {e} “covers” P(s, t)).
By way of contradiction, suppose that the graph G
= G/e, obtained from
G by contracting e, is not k-element connected. Then focus on G and note that,
again by Menger’s theorem, it has a set R of k white nodes, R ⊇ {p, q}, whose
deletion “separates” two terminals. That is, there are two terminals that are in
different components of G − R (R is obtained by taking a “cut” of k − 1 white
nodes in G
and uncontracting one node). This gives a contradiction because:
(1) for s, t as above, the s, t-path P1 in P(s, t) contains both nodes p, q ∈ R; since
|R| = k and P(s, t) has k element disjoint paths (by the Pigeonhole Principle)
another one of the s, t-paths in P(s, t) say Pk is disjoint from R; hence, G − R
has an s, t-path, and (2) for terminals v, w that are both in say Cp (or both in
Cq), G − R has a v, t path (arguing as in (1)) and also it has a w, t path (as in
(1)), thus G − R has a v, w path.
It is easy to complete the proof: we repeatedly choose any white edge and
either delete e or contract e, while preserving the k-element connectivity, until
no white edges are left; we take G∗
to be the resulting k-element connected
bipartite graph.
Clearly, this procedure can be implemented in polynomial time. In more
detail, we choose any white edge e (if there exists one) and delete it. Then we
compute whether or not the new graph is k-element connected by finding whether
κ(s, t) ≥ k in the new graph for every pair of terminals s, t; this computation
takes O(k|T |2
|E|) time. If the new graph is k-element connected, then we proceed
to the next white edge, otherwise, we identify the two end nodes of e (this has
the effect of contracting e in the old graph). Thus each iteration decreases the
number of white edges (which is O(|E|)), hence, the overall running time is
O(k|T |2
|E|2
).

3 Bipartite IUV
This section has the key result of the paper, namely, a randomized O(log n)-
approximation algorithm for bipartite IUV.

Theorem 3. Given an instance of bipartite IUV such that the graph is k-
element connected, there is a randomized poly-time algorithm that with prob-
ability 1 − 1
log n finds a set of O( k
log n ) element-disjoint Steiner trees.
Proof. Without loss of generality, assume that the graph is connected, and there
is no edge between any two terminals (if there exists any, then subdivide each
such edge by inserting a Steiner node).
For ease of exposition, assume that n is a power of two and k is an integer
multiple of R = 6 log n; here, R is a parameter of the algorithm. The algorithm
is simple: we color each Steiner node u.r. (uniformly at random) with one of k
R
super-colors i = 1, . . . , k/R. For each i = 1, . . . , k/R, let Di
denote the set of
nodes that get the super-color i. We claim that for each i, the subgraph induced
by Di
∪ T is connected with high probability, and hence this subgraph contains
a Steiner tree. If the claim holds, then we are done, since we get a set of k/R
element-disjoint Steiner trees.
For the purpose of analysis, it is easier to present the algorithm in an equiv-
alent form that has two phases. In phase one, we color every Steiner node u.r.
with one of k colors i = 1, . . . , k and we denote the set of nodes that get the
color i by Ci
(i = 1, . . . , k). In phase two, we partition the color classes into k/R
super-classes where each super-class Dj
(j = 1, . . . , k/R) consists of R consecu-
tive color classes C(j−1)R+1, C(j−1)R+2, . . . , CjR. We do this in R rounds, where
in round 1 ≤ ≤ R we have Dj
=
(j−1)R+
i=(j−1)R+1 Ci
; thus we have Dj
= Dj
R. Con-
sider an arbitrary super-class, say the first one D1
. For an arbitrary 1 ≤ R,
focus on the graph H induced by D1
∪ T . Let G1, . . . , Gd
be the connected
components of H; note that d ≥ 1 denotes the number of components of H.
Suppose that H is not connected, i.e. d 1.
Lemma 1. Consider any connected component of H, say G1. There is a set
U ⊆ V − T − V (G1) (of white nodes) with |U| ≥ k such that each node in U is
adjacent to a terminal in G1 and to a terminal in G − V (G1).
Proof. Let U ⊆ V − V (G1) be a maximum-size set of Steiner nodes such that
each node in U has a neighbour in each of G1 and G − V (G1); note that none
of the nodes in U is in G1. By way of contradiction, assume that |U| k.
Consider G − U. An important observation is that every edge of G between G1
and G − V (G1) is between a terminal of G1 and a Steiner node of G − V (G1);
this holds because G is bipartite and G1 is a subgraph induced by T and some
set of white nodes. From this, and by definition of U, there is no edge between
G1 and G − U − V (G1), i.e., G − U is disconnected (note that there is at least
one terminal in G1 and one terminal in G − U − V (G1)). This contradicts the
assumption that G is k element-connected.

Consider a set U as in the above lemma. If a vertex s ∈ U has the color
+ 1, then when we add C+1
to D1
, we see that s connects G1 and another
connected component of H, because s is adjacent to a terminal in G1 and to a

terminal in G − V (G1). For every node s ∈ U we have Pr[s ∈ C+1
] = 1
k . Thus,
the probability that none of the vertices in U has been colored + 1 is at most:

1 −
1
k
|U|
≤

1 −
1
k
k
≤ e−1
. (2)
This is an upper bound on the probability that when we add C+1
to D1
,
component G1 does not become connected to another connected component Ga,
for some 2 ≤ a ≤ d. If every connected component Gi, 1 ≤ i ≤ d, becomes
connected to another component, then the number of connected components of
H decreases to at most d
2 in round +1. If in every round and for every super-
class, the number of connected components decreases by a constant factor then,
after O(log n) rounds, every Di
∪ T forms a connected graph. We show that this
happens with sufficiently high probability.
By (2), in round , any fixed connected component of H becomes connected
to another component with probability at least 1−e−1
. So the expected number
of connected components of H that become connected to another component is
(1 − e−1
) · d. Thus, if d ≥ 2 then defining σ = 1+e−1
2 we have:
E[d+1 | d] ≤ σ · d. (3)
Define X = d − 1. Therefore, X1, X2, . . . , X, . . . , is a sequence of integer
random variables that starts with X1 = d1 − 1. Moreover, for every ≥ 1, we
have X ≥ 0, and if X = 0 then E[X+1] = 0 and if X ≥ 1 then
E[X+1|X] = E[d+1 − 1|d − 1 ≥ 1]
= E[d+1|d ≥ 2] − 1
≤ σd − 1 by (3)
= σX + σ − 1
≤ σX.
An easy induction shows that E[X+1] ≤ σ
X1. Since X1 ≤ n − 1 and σ 3
4 ,
we have E[XR] ≤ 1
n (recall that R = 6 log n). Therefore, Markov’s inequality
implies that Pr[XR ≥ 1] ≤ 1
n . This implies that Pr[dR ≥ 2] ≤ 1
n , i.e., the
probability that HR = D1
∪ T is not connected is at most 1
n . As there are k
R
super-classes, a simple union-bound shows that the probability that there is at
least one Dj
(1 ≤ j ≤ k
R ) such that Dj
∪ T is not connected is at most k
Rn ≤
1
log n . Thus, with probability at least 1 − 1
log n , every super-class Dj
(together
with T ) induces a connected graph, and hence, the randomized algorithm finds
Ω(k/ log n) element-disjoint Steiner trees.

4 IUV and Capacitated IUV
Now we complete the proof of Theorem 1 using Theorems 2 and 3.
First, we prove part (a). Let k be the maximum number such that the in-
put graph G is k-element connected. Clearly, the maximum number of element-
disjoint Steiner trees is at most k. Apply Theorem 2 to obtain a bipartite graph

G∗
that is k-element connected. Apply Theorem 3 to find Ω( k
log n ) element-
disjoint Steiner trees in G∗
. Then uncontract white nodes to obtain the same
number of element-disjoint Steiner trees of G. Moreover, it can be seen that the
optimal value of the LP relaxation is at most k (because there exists a set of
k white nodes whose deletion leaves no path between some pair of terminals).
Thus our integral solution is within a factor O(log n) of the optimal fractional
solution.
Now, we prove part (b) of Theorem 1. Our proof uses ideas from [3, 15, 19].
Consider the IP formulation (1) of capacitated IUV. The fractional packing
vertex capacitated Steiner tree problem is the linear program (LP) obtained by
relaxing the integrality condition in the IP to xF ≥ 0. As we said earlier, this
LP has exponentially many variables, however, we can solve it approximately.
Then we show that either rounding the approximate LP solution will result in
an O(log n)-approximation or we can reduce the problem to the uncapacitated
version of IUV and use Theorem 1,(a).
Note that the separation oracle for the dual of the LP is the problem of finding
a minimum node-weighted Steiner tree. Using this fact, the proof of Theorem
4.1 in [15] may be adapted to prove the following:
Lemma 2. There is an α-approximation algorithm for fractional IUV if and
only if there is an α-approximation algorithm for the minimum node-weighted
Steiner tree problem.
Klein and Ravi [17] (see also Guha and Khuller [13]) give an O(log n)-
approximation algorithm for the problem of computing a minimum node-
weighted Steiner tree. Their result, together with Lemma 2 implies that:
Lemma 3. There is a polynomial-time O(log n)-approximation algorithm for
fractional IUV.
Define ϕ and ϕf to be the optimal (objective) values for capacitated IUV
and for fractional capacitated IUV, respectively. Consider an approximately
optimal solution to fractional capacitated IUV obtained by Lemma 3. Let ϕ∗
denote the approximately optimal (objective) value, and let Y = {x1, . . . , xd}
denote the set of primal variables that have positive values. One of the features
of the algorithm of Lemma 3 (which is also a feature of the algorithm of [15]) is
that d (the number of fractional Steiner trees computed) is polynomial in n (even
though the LP has an exponential number of variables). If
d
i=1xi ≥ 1
2
d
i=1 xi
then Y
= {x1, . . . , xd} is an integral solution (i.e., a solution for capacitated
IUV) with value at least ϕ∗
2 , which is at least Ω(
ϕf
log n ), and this in turn is at
least Ω( ϕ
log n ). In this case the algorithm returns the Steiner trees corresponding
to the variables in Y
and stops. This is within an O(log n) factor of the optimal
solution. Otherwise, if
d
i=1xi 1
2
d
i=1 xi then
ϕ∗
=
d
i=1
xi =
d
i=1
xi +
d
i=1
(xi − xi)
ϕ∗
2
+ d.

Therefore ϕ∗
2d. This implies that for every Steiner node v, at most a value
of min{cv, O(d log n)} of the capacity of v is used in any optimal (fractional or
integral) solution. So we can decrease the capacity cv of every Steiner node
v ∈ V − T to min{cv, O(d log n)}. Note that this value is upper bounded by a
polynomial in n. Let this new graph be G
. We are going to modify this graph
to another graph G
which will be an instance of uncapacitated IUV. For every
Steiner node v ∈ G
with capacity cv we replace v with cv copies of it called
v1, . . . , vcv each having unit capacity. The set of terminal nodes stays the same
in G
and G
. Then for every edge uv ∈ G
we create a complete bipartite graph
on the copies of v (as one part) and the copies of u (the other part) in G
.
This new graph G
will be the instance of (uncapacitated) IUV. It follows that
the size of G
is polynomial in G. Also, it is straightforward to verify that G
has α element-disjoint Steiner trees if and only if there are α Steiner trees in
G satisfying the capacity constraints of the Steiner nodes. Finally, we apply the
algorithm of Theorem 1,(a) to graph G
.
We presented a simple combinatorial algorithm which ﬁnds an integral solu-
tion that is within a factor O(log n) of the optimal integral (and in fact op-
timal fractional) solution. Recently, Lau [20] has given a combinatorial O(1)-
approximation algorithm for computing a maximum collection of edge-disjoint
Steiner forests in a given graph. His result again relies on the result of Frank et
al. [7] for solving (a special case of) bipartite IUV. It would be interesting to
study the corresponding problem of packing element-disjoint Steiner forests.
Acknowledgments
The authors thank David Kempe, whose comments led to an improved analysis
in Theorem 3, and Lap chi Lau who brought reference [14] to our attention.
References
1. S.Arora and C.Lund, Hardness of approximations, in Approximation Algorithms
for NP-hard Problems, Dorit Hochbaum Ed., PWS Publishing, 1996.
2. J.Bang-Jensen and S.Thomassé, Highly connected hypergraphs containing no two
edge-disjoint spanning connected subhypergraphs, Discrete Applied Mathematics
131(2):555-559, 2003.
3. J. Cheriyan and M. Salavatipour, Hardness and approximation results for packing
Steiner trees, invited to a special issue of Algorithmica. Preliminary version in
Proc. ESA 2004, Springer LNCS, Vol 3221, pp 180-191.
4. R.Diestel, Graph Theory, Springer, New York, NY, 2000.
5. U. Feige, M. Halldorsson, G. Kortsarz, and A. Srinivasan, Approximating the do-
matic number, SIAM J.Computing 32(1):172-195, 2002. Earlier version in STOC
2000.

6. P. Floréen, P. Kaski, J. Kohonen, and P. Orponen, Multicast time maximization in
energy constrained wireless networks, in Proc. 2003 Joint Workshop on Foundations
of Mobile Computing, DIALM-POMC 2003.
7. A.Frank, T.Király, M.Kriesell, On decomposing a hypergraph into k connected sub-
hypergraphs, Discrete Applied Mathematics 131(2):373-383, 2003.
8. M.Grötschel, A.Martin, and R.Weismantel, Packing Steiner trees: polyhedral in-
vestigations, Math. Prog. A 72(2):101-123, 1996.
9. ——, Packing Steiner trees: a cutting plane algorithm, Math. Prog. A 72(2):125-
145, 1996.
10. ——, Packing Steiner trees: separation algorithms, SIAM J. Disc. Math. 9:233-257,
1996.
11. ——, Packing Steiner trees: further facets, European J. Combinatorics 17(1):39-52,
1996.
12. ——, The Steiner tree packing problem in VLSI design, Mathematical Program-
ming 78:265-281, 1997.
13. S. Guha and S. Khuller, Improved methods for approximating node weighted Steiner
trees and connected dominating sets, Information and Computation 150:57-74,
1999. Preliminary version in FSTTCS 1998.
14. H. R. Hind and O. Oellermann, Menger-type results for three or more vertices
Congressus Numerantium 113:179–204, 1996.
15. K. Jain, M. Mahdian, M.R. Salavatipour, Packing Steiner trees, in Proc. ACM-
SIAM SODA 2003.
16. K. Jain, I. Mandoiu, V. Vazirani and D. Williamson, A primal-dual schema based
approximation algorithm for the element connectivity problem, in Proc. ACM-SIAM
SODA 1999, 99–106.
17. P.Klein and R.Ravi, A nearly best-possible approximation algorithm for node-
weighted Steiner trees, Journal of Algorithms 19:104-115 (1995).
18. M.Kriesell, Edge-disjoint trees containing some given vertices in a graph, J. Com-
binatorial Theory (B) 88:53-65, 2003.
19. L. Lau, An approximate max-Steiner-tree-packing min-Steiner-cut theorem, In
Proc. IEEE FOCS 2004.
20. L. Lau, Packing Steiner forests, to appear in Proc. IPCO 2005.
21. A.Martin and R.Weismantel, Packing paths and Steiner trees: Routing of electronic
circuits, CWI Quarterly 6:185-204, 1993.

Approximating the Bandwidth of Caterpillars
Uriel Feige1
and Kunal Talwar2
1
Weizmann Institute and Microsoft Research
urifeige@microsoft.com
2
Microsoft Research
kunal@microsoft.com
Abstract. A caterpillar is a tree in which all vertices of degree three or
more lie on one path, called the backbone. We present a polynomial time
algorithm that produces a linear arrangement of the vertices of a cater-
pillar with bandwidth at most O(log n/ log log n) times the local density
of the caterpillar, where the local density is a well known lower bound
on the bandwidth. This result is best possible in the sense that there
are caterpillars whose bandwidth is larger than their local density by a
factor of Ω(log n/ log log n). The previous best approximation ratio for
the bandwidth of caterpillars was O(log n). We show that any further
improvement in the approximation ratio would require using linear ar-
rangements that do not respect the order of the vertices of the backbone.
We also show how to obtain a (1 + ) approximation for the bandwidth
of caterpillars in time 2Õ(
√
n/)
. This result generalizes to trees, planar
graphs, and any family of graphs with treewidth Õ(
√
n).
1 Introduction
To set the ground for presenting our results, let us first define the main terms
that we use.
Definition 1. A linear arrangement of a graph is a numbering of its vertices
from 1 to n. The bandwidth of a linear arrangement is the largest difference
between the two numbers given to endpoints of the same edge. The bandwidth of
a graph is the minimum bandwidth over all linear arrangements of the graph.
Definition 2. A caterpillar is a tree composed of a path, called the backbone,
and other paths, called strands, connected to the backbone. The connection point
of a strand to the backbone is called the root of the strand. Hence all vertices of
degree more than two are roots.
Definition 3. The unfolded bandwidth of a caterpillar is the minimum band-
width in a linear arrangement that respects the order of the vertices in the back-
bone.
Definition 4. Let B(v, r) denote the set {u ∈ V : d(u, v) ≤ r} where d(·, ·)
denotes the usual shortest path distance in G. The local density of a graph ρG
is defined as
c

Approximating the Bandwidth of Caterpillars 63
ρG = max
v∈V
max
r
|B(v, r)|
2r
It is easy to check that for any graph G, the local density ρG gives a lower bound
on the bandwidth of G.
Given an algorithm A for approximating the bandwidth of caterpillars, for
every caterpillar G we consider the following four quantities: its local density ρG;
its bandwidth bG; its unfolded bandwidth uG; and the bandwidth of the linear
arrangement found by algorithm A, denoted by AG.
Clearly, ρG ≤ bG ≤ uG ≤ AG. Previously, it was known [3] that bG can
be as large as Ω(ρG log n/ log log n). Also, a simple approximation algorithm [9]
obtained AG ≤ O(ρG log n). We present an approximation algorithm for which
AG = O(ρG log n/ log log n), which is best possible (with respect to ρG). We
also show that the gap of Ω(log n/ log log n) may be present between every two
adjacent quantities in the above list. The gap of Ω(log n/ log log n) between
bG and uG is especially important, because we show that uG can in fact be
approximated within a constant factor. Without this gap between uG and bG,
this would improve the approximation ratio for the bandwidth of caterpillars
beyond log n/ log log n.
In a result of a somewhat diﬀerent nature, we present an algorithm that
achieves a (1 + ) approximation ratio for the bandwidth of trees (and hence
also caterpillars) in time 2Õ(
√
n/)
. This result generalizes to families of graphs
that have recursive decomposition with separators of size Õ(
√
n), including for
example planar graphs. These results show in particular that in any reduction
from 3SAT showing the hardness of approximating the bandwidth of caterpillars
or trees, the number of vertices in the resulting graph will be at least quadratic
in the size of the 3CNF formula, unless 3SAT has subexponential algorithms.
1.1 Related Work
Chinn et al. [2] showed that trees with constant local density can have bandwidth
as large as Ω(log n). Chung and Seymour [3] exhibited a family of caterpillars
with constant local density and bandwidth Ω( log n
log log n ).
Monien [11] showed that the bandwidth problem on caterpillars is NP-hard.
We note here that the reduction crucially uses a gadget that forces the bandwidth
to be folded. Blache, Karpinski and Jürgen [1] showed that the bandwidth of
trees is hard to approximate within some constant factor. Unger [14] claimed
(without proof) that the bandwidth of caterpillars is hard to approximate within
any constant factor.
Haralambides et al. [9] showed that for caterpillars, folding strands to one
side is an O(log n)-approximation with respect to the local density. For gen-
eral graphs, Feige [5] gave an O(log3.5 √
log log n) algorithm with respect to
the local density lower bound (slightly improved in [10]). A somewhat im-
proved approximation ratio of O(log3
n
√
log log n) with respect to a semideﬁnite
programming lower bound was given by Dunagan and Vempala [4]. Gupta [8]

64 Uriel Feige and Kunal Talwar
gave an O(log2.5
n)-approximation algorithm on trees, that on caterpillars gives
an O(log n) approximation. Filmus [7] extended this O(log n)-approximation to
graphs formed by many caterpillars sharing a backbone vertex.
An exact algorithm for general graphs, running in time approximately nB
was given by Saxe [13] where B is the optimal bandwidth. Feige and Killian [6]
gave a 2O(n)
time exact algorithm.
2 New Results
2.1 More Definitions
Definition 5. A bucket arrangement of a graph is a placement of its vertices
into consecutive buckets, such that the endpoints of an edge are either in the same
bucket or in adjacent buckets. The bucketwidth is the number of vertices in the
most loaded bucket. The bucketwidth of a graph is the minimum bucketwidth of
all bucket arrangements of the graph.
The following lemma is well known.
Lemma 1. For every graph, its bandwidth and bucketwidth differ by at most a
factor of 2.
Proof. Given a linear arrangement of bandwidth b, make every b consecutive
vertices into a bucket. Given a bucket arrangement with bucketwidth b, create
a linear arrangement bucket by bucket, where vertices in the same bucket are
numbered in arbitrary order.

As we shall be considering approximation ratios which are much worse than 2,
we will consider bandwidth and bucketwidth interchangeably.
Definition 6. The unfolded bucketwidth of a caterpillar is the minimum buck-
etwidth in a bucket arrangement in which every backbone vertex lies in a different
bucket. Hence backbone vertices are placed in order.
Lemma 2. For every caterpillar, its unfolded bandwidth and unfolded buck-
etwidth differ by at most a factor of 2.
Proof. Given an unfolded linear arrangement of bandwidth b, let every backbone
vertex start a new bucket, called a backbone bucket. In regions with no backbone
vertex (the leftmost and rightmost regions of the linear arrangement), make every
b consecutive vertices into a bucket. It may happen that strands (say, k of them)
jump over a backbone bucket in this bucket arrangement, but then this backbone
bucket has load at most b − k. For each such strand, shift nodes so as to move
one vertex from the outermost bucket of the strand to the separating backbone
bucket. We omit the details from this extended abstract.
Given an unfolded bucket arrangement with bucketwidth b, create a linear
arrangement bucket by bucket, where vertices in the same bucket are numbered
in arbitrary order.

2.2 Approximating Unfolded Bandwidth
Theorem 1. The unfolded bucketwidth of a caterpillar can be approximated
within a constant factor.
Proof. Up to a factor of two in the resulting bandwidth, we may assume that
a strand is folded either to the right, or to the left (but not partly to the right
and partly to the left), i.e. all buckets containing a node from a strand H lie
on one side of the bucket containing its root. We call such an unfolded bucket
arrangement locally consistent. We now formulate the problem of finding a locally
consistent unfolded bucket arrangement of minimum bucketwidth as an integer
program.
The variables of the integer program are of the form xik, where i specifies a
vertex and k specifies a bucket number. Our intention is that xik = 1 if vertex
i is in bucket k, and xik = 0 otherwise. Hence we shall have the integrality
constraints:
xij ∈ {0, 1} (1)
For concreteness, we assume that vertices of the caterpillar are named in the
following order. First, the backbone vertices are named from left to right as 1 up
to (where is the number of backbone vertices), and then the strands of the
caterpillar are named one by one, where within a strand, vertices are named in
order of increasing distance from the root of the strand. (This naming should not
be confused with a linear arrangement. It is just a convention used in formulating
the integer program.) We now place the vertices of the backbone in consecutive
buckets. This gives the backbone constraints:
xii = 1 for all 1 ≤ i ≤ (2)
Since vertices of strands might be placed in buckets to the left and to the
right of the endpoints of the backbone, the parameter k in xik specifying the
bucket number need not be limited to the range 1 up to , but will be allowed
to range between −n and n.
Now we are ready to present the most important set of constraints. They
are more complicated than may appear necessary for the integer program, but
this will become useful once we relax the integer program to a polynomial time
solvable linear program.
Consider two vertices i, i + 1 on the same strand, rooted at the backbone
vertex r. Then we have the root constraints:
x(i+1)r ≤ xir (3)
Moreover, let k be a bucket to the right of bucket r, namely, k r. (Later we
will present analogous constraints for k r.) For every t ≥ k we have the right
consistency constraints:
t
l=k
x(i+1)l ≤
t
l=k−1
xil and
t
l=k
xil ≤
t+1
l=k
x(i+1)l (4)

The consistency constraints indicate that (i+1) appears in a bucket to the right
of r only if i preceded it, and that i appears only if (i + 1) follows it.
Similar constraints are written for buckets to the left of the root. Namely,
for k r and t ≤ k we have the left consistency constraints:
k
l=t
x(i+1)l ≤
k+1
l=t
xil and
k
l=t
xil ≤
k
l=t−1
x(i+1)l (5)
Finally, we introduce a constraint specifying that the locally consistent un-
folded bucketwidth is at most b. Namely, for every bucket k there is the buck-
etwidth constraint:
xik ≤ b (6)
The above integer program is feasible if and only if there is a locally consistent
bucket arrangement of bucketwidth at most b. We now relax the integer program
to a linear program by replacing the integrality constraints by nonnegativity
constraints xik ≥ 0 and choice constraints, namely, for every vertex i on a strand
of length t rooted at r we have:
r+t
k=r−t
xik = 1 (7)
As the linear program is a relaxation of the integer program, it is feasible when-
ever the integer program is. As linear programs can be solved in polynomial
time, we can obtain a feasible solution to the linear program of bucketwidth
at most b∗
, where b∗
is the minimum locally consistent unfolded bucketwidth.
The solution to the linear program is fractional, in the sense the a vertex may
fractionally belong to several buckets. We now show how to round the fractional
solution to an integer solution, loosing only a constant factor in the bucketwidth.
Consider an arbitrary solution xik to the linear program. Consider a strand
S rooted at r. Consider the smallest k r such that

i∈S xik ≤ 1/4. We claim
that

i∈S,k≥k xik ≤ |S|/4. This claim follows from the following inequality:
kk
xik ≤
j∈S;ji
xjk
This inequality can be proved by induction on i. For the ﬁrst vertex on the
strand, this inequality is true because then

kk xik = 0. Assume now that
the inequality was proved for vertex i and then the right consistency constraints
and the inductive hypothesis imply for vertex i + 1 that:
kk
x(i+1)k ≤
k≥k
xik ≤ xik +
j∈S;ji
xjk =
j∈S;ji+1
xjk
Similar to the above, consider the largest k r with

i∈S xik ≤ 1/4. It can be
shown that

i∈S,k≤k xik ≤ |S|/4.

It follows that there is some k (w.l.o.g. we will assume here that k r) such
that

i∈S;r≤kk xik ≥ |S|/4, and

i∈S xik ≥ 1/4 for all r ≤ k
k. Now place
the vertices of S one by one in the buckets starting at r and continuing to the
right, putting 4

i∈S xik vertices in bucket k
. The whole strand can be put
in these buckets before reaching bucket k. Each bucket suffered a multiplicative
factor of at most 8 (rounding up to the nearest integer contributes a factor of at
most 2, because 4

i∈S xik ≥ 1).
Finally, there is the following issue that we ignored in the proof above, and
this is the fact that for the root bucket r, it might not be the case that

i∈S xir ≥
1/4. In that case, the integer solution might put a vertex of S in bucket r
that we cannot charge against the fractional values of S that were in bucket r.
However, this cannot harm the approximation ratio by more than a constant
factor, because the number of vertices added by this effect to bucket r cannot
be more than twice the local density at r.

2.3 Upper Bound
In this section, we present an algorithm for the bandwidth problem on caterpil-
lars and show that it is an O( log n
log log n ) approximation.
We shall give a bucket arrangement of caterpillar T . More precisely, we shall
define maps x : V → [n] and y : V → [n] such that
– The function f : V → [n] × [n] defined as f(v) = (x(v), y(v)) is one-one.
– For every edge (u, v) ∈ E, |x(u) − x(v)| ≤ 1.
In other words, we shall place the vertices of the graph on the integer grid
so that adjacent vertices land either in the same column or in adjacent columns.
The goal would be to minimize the maximum height of any column.
Let T be a caterpillar with backbone B = {1, . . . , p} and strands H =
{H1, . . . , Hq}, where each Hi can be specified by its root ri ∈ B and length
li. Let k = 2 log n
log log n so that kk
n. For an assignment f and for a strand
Hi let the height αi of Hi be maxv∈Hi y(v) and let the range βi of Hi be
maxv∈Hi |x(v) − x(ri)|.
The algorithm proceeds as follows. Let Hs = {Hi ∈ H : ks−1
≤ li ks
}
(note that Hk is empty). We start out by setting f(j) = (j, 1) for j ∈ B. The
algorithm has k phases. In phase s, we consider the strand Hi ∈ Hs in increasing
order of ri (breaking ties arbitrarily). We fold each strand to its right so as to
minimize the maximum height of any column, and subject to that, minimize
its range. Amongst the arrangements minimizing the height and the range, we
break ties to the left, i.e. fill up columns from left to right.
We now show that the mapping f satisfies bwf ≤ O( log n
log log n ρG). We denote
by f−1
(X, Y ) = {v : f(v) = (x, y), x ∈ X, y ∈ Y } the set of nodes assigned to
columns X and rows Y .
Let It denote the interval [4(k + t − 1)ρG + 1, 4(k + t)ρG]. For a column j, let
Hj
t = {Hi ∈ Ht : Hi ∩ f−1
({j}, It) = φ} be the set of hair in Ht that contribute
to f−1
({j}, It).
The tie breaking rule ensures the following.

Observation 2. Just after Hi is assigned, if Hi contributes to columns j and
j
, j j
, then the height of j
is no larger than that of j.
Next we show a few simple lemmas
Lemma 3. Every strand Hi ∈ Hj
t has range βi at most kt−1
.
Proof. For every Hi ∈ Hj
t , it must be the case that for all j
: ri ≤ j
≤ j,
the column j
has height at least 4kρG after Hi was considered. Consider the
set of vertices f−1
([ri, j], [1, 4kρG]) assigned to rows ri through j and columns
1 through 4kρG in this partial assignment. Because of the order in which we
consider strands, each of these vertices comes from a strand of length less than
kt
and the hence the root of every such strand must lie in [ri −kt
, j]. Thus every
such vertex lies in B(ri, 2kt
) and there are at least 2kρG|j − ri| = 2kρGβi such
vertices. Since |B(ri, 2kt
)| ≤ 4kt
ρG, it follows that βi ≤ kt−1
for every Hi ∈ Hj
t .

Corollary 1. For every j, t, |Hj
t | 4ρG.
Proof. Every strand Hi ∈ Hj
t has its root in [j − kt−1
, j]. Since each has length
at least kt−1
, B(j, 2kt−1
) has size at least (1+|Hj
t |kt−1
). Hence, |Hj
t | 4ρG.

Lemma 4. At the end of phase s of the algorithm, the height of (partial) as-
signment is at most 4(k + s)ρG.
Proof. We show the claim by induction on s. In the base case s = 0, each column
has one node and since ρG is at least 1, the claim holds.
Suppose the claim is true at end of phase (t − 1). We wish to show that it is
also true at the end of phase t.
Consider column j and let Hj
t be as above. We first argue that each strand
in Hj
t contributes at most one vertex to column j. Assume the contrary and let
Hi be the first strand in Hj
t that contributes at least 2 vertices to column j.
Further, let j be the first such column. Then there is a column j
j such that
just before Hi was assigned, j
had height at least one more than that of j. Since
we look at strand in each class from left to right, every strand contributing to
f−1
(j
, It) at this point has its root to the left of j. Thus by observation 2, j
has height no larger than j, contradicting the assumption. Hence the claim.
By corollary 1, Hj
t has fewer than 4ρG strands and hence the induction holds.
The claim follows.

We remark that we have actually shown an O(log h/ log log h)-approximation,
where h is the longest strand length. Note that our algorithm outputs an unfolded
arrangement. We show in the next section that it is not possible to beat the
O( log n
log log n ) barrier when outputting an unfolded arrangement.
Finally, we note while this algorithm never folds the backbone, it could be
far from optimal for the unfolded bandwidth problem.
Proposition 1. The algorithm of the above theorem may output a linear ar-
rangement with bandwidth Ω(log n/ log log n) times the unfolded bandwidth.

Proof. The instance has a backbone of length 2k+1
− 1 and a strand of length
i2i
at vertex 2i
and vertex 2k+1
− 2i
. Folding (one vertex per bucket) the first
half of the strands to the right and the the second half of the strands to the left
gives an arrangement with unfolded bucketwidth O(log k). On the other hand,
our algorithm (in fact any algorithm folding all strands to the right) gives a
bucketwidth of k. Since k is O(log n), the claim follows.

2.4 Gap Between Bandwidth and Unfolded Bandwidth
We construct a sequence of caterpillars C1, C2, . . . such that Ck has bucketwidth
O(k) and unfolded bucketwidth Ω(k2
).
C1 is a caterpillar with a backbone of length three and p strands of length
1 attached to the middle node. The first and last nodes on the backbone are
designated s1 and t1 respectively (see figure 1).
Ck is constructed by joining in series, a path P1 of length (lk + wk), a copy
T1 of Ck−1, a path P2 of length 2wk, another copy T2 of Ck−1 and finally a path
P3 of length (lk + wk). Moreover, p strands each of length lk are attached at
the first and last vertices of P2 (see figure 1). The first vertex of P1 and the last
vertex of P3 are named sk and tk respectively. We refer to the backbone vertices
in T1, P2 and T2 as the core of Ck. The strands attached to P2 are referred to
as central strands.
Here wk and lk are parameters that we define as follows: l1 = 1,w1 = 0,
wk = 2lk−1 + 2wk−1, lk = 6klk−1. Note that the length of the backbone satisfies
the relation Bk = 4wk + 2lk + 2Bk−1. It follows that
Observation 3. The length of the backbone of Ck is at most Bk ≤ 4lk.
We first show that there is an arrangement with small bucketwidth.
Lemma 5. There is a valid bucket arrangement of Ck with bucketwidth p + 2k
into at most wk+1 = 2lk +2wk buckets with sk and tk in the first and last buckets
respectively.
Proof. The proof uses induction. The base case is immediate.
Suppose that there is such a bucket arrangement for Ck−1. We use it to
construct a bucket arrangement for Ck (see figure 2). We place the first path
P1 into buckets 1 through (lk + wk). The recursive arrangement of T1 is placed
in buckets (lk + wk + 1) through (lk + 2wk). We place the path P2 in buckets
(lk + 2wk) to (lk + 1). T2 is placed similarly in buckets (lk + 1) to (lk + wk) and
we place P3 starting at (lk + wk + 1) and ending at bucket 2(lk + wk). Each
central strand rooted at the endpoint of P2 in bucket (lk + 2wk) spans buckets
(lk +2wk +1) through 2(lk +wk) and each central strand rooted at the endpoint
in bucket lk + 1 goes into buckets lk through 1. It is easy to verify that this
is a legal bucket arrangement. Moreover, the maximum height of any bucket
increases by at most 2, and hence the induction holds.

We now show that the unfolded bucketwidth of Ck is large. Consider an
unfolded bucket arrangement f. Without loss of generality, sk falls to the left of

tk. A bucket is called a core bucket if it contains a vertex from the core. Other
buckets are referred to as peripheral. Note that the core buckets fall between the
left set and the right set of peripheral buckets.
Lemma 6. In any unfolded bucket arrangement of Ck, some core bucket has
load more than pk
3 .
Proof. We prove this by induction on k
In the base case k = 1, we simply use a local density argument. There are
p + 3 nodes and since the diameter is 2, they all fall in at most 3 buckets. The
claim follows.
Suppose the claim holds for Ck−1 and let f be an unfolded bucket arrange-
ment for Ck. Suppose that f has bucketwidth less than pk
3 . Consider the 2p
central strands of Gk.
Proposition 2. In any arrangement with bucketwidth less than (pk
3 ) at least
(2p
3 ) central strands extend to peripheral buckets.
Proof. Suppose not. Then at least 4p
3 central strands are wholly contained in the
core buckets. The number of core bucket is at most 2(wk + Bk−1) and each has
at most pk
3 nodes in the arrangement f. However, since each strand has length
lk and (4p
3 )lk 2(wk + Bk−1)(kp
3 ), we get a contradiction.

Thus 2p
3 strands extend to peripheral buckets. Without loss of generality, at
least (p
3 ) of these strands extend to the left set of peripheral buckets. Each of
these strands contributes at least 1 to the buckets containing the backbone of
T1. Since f also induces an unfolded bucket arrangement of T1, the induction
hypothesis implies that one of the core buckets of T1 must have load at least
p(k−1)
3 from vertices in T1. Adding the load due to the p
3 crossing strands, we
get an overall load of at least pk
3 . The claim follows.

Taking p to be k, we get a gap of Ω(k). Since lk ≤ 6k
k!, the number of
vertices in Ck is 2O(k log k)
. Hence we get a gap of Ω( log n
log log n ). Since (unfolded)
bucketwidth and (unfolded) bandwidth are within a constant factor of each other,
the gap between unfolded bandwidth and bandwidth is also Ω( log n
log log n ).
3 An Approximation Scheme
In this section, we give an approximation scheme for the bandwidth problem
on trees that gives a (1 + )-approximation in time 2Õ(
√n
)
. We note that the
algorithm described here is more complicated than needed; the extra work here
enables us to extend the algorithm to a larger class of graphs without any mod-
ifications.
We first guess the bandwidth B. If B ≤ n
, the dynamic programming
algorithm of Saxe [13] finds the optimal bandwidth arrangement in time nO(B))
which is 2Õ(
√n
)
for B as above.

Fig. 1. Construction of caterpillar Ck
Fig. 2. Low bucketwidth arrangement of caterpillar Ck
In case B ≥ n
, we run a different algorithm. We shall construct an assign-
ment f of V into K buckets 1, . . . , K such that
– For any bucket i, the number of nodes assigned to it |f−1
(i)| is at most B.
– For any u, v ∈ V such that (u, v) ∈ E, f(u) and f(v) differ by at most 1
.
It is easy to see that if G has bandwidth B, such an assignment always exists
for K = ( n
B ). Moreover, given such as assignment, we can find a bandwidth
(1 + )B arrangement of G by picking any ordering that respects the ordering
defined by f.
Note that any tree T has a vertex vT such that deleting v from T breaks it
up in components of size at most 2n
3 ; we call such a v a balanced separator of T .
We recurse on each component until each component is a leaf. This sequence of
decompositions can be represented in the form of a rooted decomposition tree τ
with the following properties:
– Each node x of the decomposition tree τ corresponds to a subtree Tx and a
balanced separator vx of Tx.
– The children of an internal node x correspond exactly to the components
resulting from deleting vx from the tree Tx.
– The root node r corresponds to T and the leaves correspond to singletons.
– The depth of the tree is O(log n).
A partial assignment is a partial function f that maps a subset Df ⊆ V into
K buckets 1, . . . , K. We say f is feasible if for every u, v ∈ Df , (u, v) ∈ E, f(u)

and f(v) differ by at most 1
. We say g extends f to D
if Dg = Df ∪ D
and f
agrees with g on Df . Given a partial assignment f, its profile pf is defined by
the K-tuple (|f−1
(1)|, |f−1
(2)|, . . . , |f−1
(K)|). For the purposes of our algorithm
two partial assignments are considered equivalent if they have the same profile.
Our algorithm does dynamic programming on the decomposition tree τ of T .
Let x be an internal node of τ with children y1, . . . , yk. Let f1
x, . . . , fnx
x be the
set of all feasible partial assignments with domain consisting of all the separator
nodes on the r-x path in τ, i.e. fa
x has domain Px = {vz : z is on r-x path in τ}.
Given x, a ≤ nx and j ≤ k, we compute the list La
x,j of all possible profiles of a
feasible extension g of fa
x to ∪i≤jTyi . We use the notation La
x for La
x,k if x has k
children in τ.
We populate the dynamic programming table from the leaves moving up the
tree as follows. We start with the obvious setting of La
x for each leaf x of τ,
for each a ∈ [nx]. For an internal node x, clearly La
x,0 is just the profile of fa
x .
Given La
x,(j−1) and Lb
yj
for every feasible extension fb
yj
of fa
x to vyj , we show
how to compute La
x,j. The crucial fact here, ensured by our construction of the
decomposition tree, is that all the edges leaving Tyj are incident on Px. Thus
given a feasible extension g1 of fa
x to ∪ijTyi and another feasible extension g2 of
fa
x to Tyj , they can be combined to give a feasible extension g3 of fa
x to ∪i≤jTyi .
Thus for every profile p1 in La
x,(j−1) and for every profile p2 in any of the lists
Lb
yj
where fb
yj
is a feasible extension of fa
x , we get a profile p3 to be placed in
La
x,j.
Finally, if T has bandwidth B, at least one of the lists La
r contains a profile
pf with |f−1
(i)| ≤ B for each i.
It remains to bound the running time of our algorithm. Since the depth of
the tree is O(log n), for every node x in τ, the number of partial assignments nx
is at most KO(log n)
. The number of nodes in τ is clearly at most n and hence
the algorithm only need compute nKO(log n)
table entries. The size of each list
La
x,j is bounded by the total number of possible profiles, which is no more than
nK
. Each entry of the table can thus be computed in time O(n2K
). Thus the
overall running time of the algorithm is 2O(K log n)
. Since K = n
B and B ≥ n
,
the running time of our algorithm is 2Õ(
√n
)
. Thus we have shown that
Theorem 4. The algorithm described above computes a (1 + ) approximation
to the bandwidth of a tree in time 2Õ(
√n
)
.
We note that it is easy to modify the algorithm to also give an assignment
having such a profile.
Moreover, note that the only property of the tree we have used is the exis-
tence of small separators. The algorithm can be naturally modified to work with
any family of graphs with (recursive) small separators—the number of partial
assignments to be considered for a table entry now goes up KO(t log n)
where
t is an upper bound on the size of the separator. This gives a 2Õ(t+
√n
)
time
(1 + )-approximation algorithm for graphs with tree-width t. Recall that pla-
nar graphs and other excluded minor families of graphs have separators of size
O(
√
n). Thus the running time of our algorithm for such graphs is 2Õ(
√n
)
.

Acknowledgements
We thank Moses Charikar, Marek Karpinski and Ryan Williams.
References
1. G. Blache, M. Karpinski, and J. Wirtgen. On approximation intractability of the
bandwidth problem. Technical report, University of Bonn, 1997.
2. P. Chinn, J. Chvatálová, A. Dewdney, and N. Gibbs. The bandwidth problem for
graphs and matrices - survey. Journal of Graph Theory, 6(3):223–254, 1982.
3. F. R. Chung and P. D. Seymour. Graphs with small bandwidth and cutwidth.
Discrete Mathematics, 75:113–119, 1989.
4. J. Dunagan and S. Vempala. On euclidean embeddings and bandwidth minimiza-
tion. In APPROX ’01/RANDOM ’01, pages 229–240, 2001.
5. U. Feige. Approximating the bandwidth via volume respecting embeddings. J.
Comput. Syst. Sci., 60(3):510–539, 2000.
6. U. Feige. Coping with the NP-hardness of the graph bandwidth problem. In SWAT,
pages 10–19, 2000.
7. Y. Filmus. Master’s thesis, Weizmann Institute, 2003.
8. A. Gupta. Improved bandwidth approximation for trees. In Proceedings of the
eleventh annual ACM-SIAM symposium on Discrete algorithms, pages 788–793,
2000.
9. J. Haralambides, F. Makedon, and B. Monien. Bandwidth minimization: An ap-
proximation algorithm for caterpillars. Mathematical Systems Theory, pages 169–
177, 1991.
10. R. Krauthgamer, J. Lee, M. Mendel, and A. Naor. Measured descent: A new em-
bedding method for ﬁnite metrics. In FOCS, pages 434–443, 2004.
11. B. Monien. The bandwidth minimization problem for caterpillars with hair length
3 is NP-complete. SIAM J. Algebraic Discrete Methods, 7(4):505–512, 1986.
12. C. Papadimitriou. The NP-completeness of the bandwidth minimization problem.
Computing, 16:263–270, 1976.
13. J. Saxe. Dynamic-programming algorithms for recognizing small-bandwidth graphs
in polynomial time. SIAM J. Algebraic Discrete Methods, 1:363–369, 1980.
14. W. Unger. The complexity of the approximation of the bandwidth problem. In
FOCS ’98: Proceedings of the 39th Annual Symposium on Foundations of Computer
Science, pages 82–91, 1998.

Where’s the Winner?
Max-Finding and Sorting with Metric Costs
Anupam Gupta1,
and Amit Kumar2
1
Dept. of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213
anupamg@cs.cmu.edu
2
Dept. of Computer Science Engineering, Indian Institute of Technology, Hauz Khas
New Delhi, India - 110016
amitk@cse.iitd.ernet.in
Abstract. Traditionally, a fundamental assumption in evaluating the performance
of algorithms for sorting and selection has been that comparing any two elements
costs one unit (of time, work, etc.); the goal of an algorithm is to minimize the
total cost incurred. However, a body of recent work has attempted to find ways to
weaken this assumption – in particular, new algorithms have been given for these
basic problems of searching, sorting and selection, when comparisons between
different pairs of elements have different associated costs.
In this paper, we further these investigations, and address the questions of max-
finding and sorting when the comparison costs form a metric; i.e., the comparison
costs cuv respect the triangle inequality cuv + cvw ≥ cuw for all input elements
u, v and w. We give the first results for these problems – specifically, we present
– An O(log n)-competitive algorithm for max-finding on general metrics, and
we improve on this result to obtain an O(1)-competitive algorithm for the
max-finding problem in constant dimensional spaces.
– An O(log2
n)-competitive algorithm for sorting in general metric spaces.
Our main technique for max-finding is to run two copies of a simple natural online
algorithm (that costs too much when run by itself) in parallel. By judiciously
exchanging information between the two copies, we can bound the cost incurred
by the algorithm; we believe that this technique may have other applications to
online algorithms.
1 Introduction
The questions of optimal searching, sorting, and selection lie at the very basis of the
field of algorithms, with a vast literature on algorithms for these fundamental prob-
lems [1]. Traditionally the fundamental assumption in evaluating the performance of
these algorithms has been the unit-cost comparison model – comparing any two ele-
ments costs one unit, and one of the goals is to devise algorithms that minimize the
total cost of comparisons performed. Recently, Charikar et al. [2] posed the following
problem: given a set V of n elements, where the cost of comparing u and v is cuv, how
should we design algorithms for sorting and selection so as to minimize the total cost of

Research partly supported by NSF CAREER award CCF-0448095, and by an Alfred P. Sloan
Fellowship.
c

Where’s the Winner? Max-Finding and Sorting with Metric Costs 75
all the comparisons performed? Note that these costs cuv are known to the algorithm,
which can use them to decide on the sequence of comparisons. The case where all com-
parison costs are identical is just the unit-cost model. To measure the performance of
algorithms in this model, Charikar et al. [2] used the framework of competitive analy-
sis: they compared the cost incurred by the algorithm to the cost incurred by an optimal
algorithm to prove the output correct.
The paper of Charikar et al. [2], and subsequent work by the authors [3] and Kannan
and Khanna [4] considered sorting, searching, and selection for special cost functions
(which are described in the discussion on related work). However, there seems to be
no work on the case where the comparison costs form a metric space, i.e., where costs
respect the triangle inequality cuv + cvw ≥ cuw for all u, v, w ∈ V . Such situations
may arise if the elements reside at different places in a communication network, and the
communication cost of comparing two elements is proportional to the distance between
them. An equivalent, and perhaps more natural way of enforcing the metric constraint
on the costs is to say that the vertices in V lie in an ambitent metric space (X, d) (where
V ⊆ X), where the cost cij of comparing two vertices i and j is the distance d(i, j)
between them.
Our Results. In this paper, we initiate the study of these problems: in particular, we
consider problems of max-finding and sorting with metric comparison costs. For the
max-finding problem, our results show that the lower bound of Ω(n) on the competitive
ratio arises only for arguably pathological scenarios, and substantially better guarantees
can be given for metric costs.
Theorem 1. Max-finding with metric costs has an O(log n)-competitive algorithm.
We improve the result for the special case when the points are located in d-dimensional
Euclidean space:
Theorem 2. There is an O(d3
)-competitive randomized algorithm for max-finding
when the nodes lie on the d-dimensional grid and the distance between points is given
by the ∞ metric; this yields an O(d3(1+1/p)
)-competitive algorithm for d-dimensional
p space.
For the problem of sorting n elements in a metric, we give an O(log n)-competitive
algorithm for the case of hierarchically well-separated trees (HSTs). We then use stan-
dard results of Bartal [5], and Fakcharoenphol et al. [6] from the theory of metric em-
beddings to extend our results to general metrics. Our main theorem is the following:
Theorem 3. There is an O(log2
n)-competitive randomized algorithm for sorting with
metric costs.
It can be seen that any algorithm for sorting with metric costs must be Ω(log n)-
competitive even when the points lie on a star or a line – indeed, one can model unit-cost
sorting and searching sorted lists in these cases. The question of closing the logarithmic
gap between the upper and lower bounds remains an intriguing one.
Our Techniques. For the max-finding algorithm for general metrics, we use a simple
algorithm that uses O(log n) rounds, eliminating half the elements in each round while

76 Anupam Gupta and Amit Kumar
paying at most OPT. Getting better results turns out to be non-trivial even for very
simple metrics: an illuminating example is the case where the comparison costs for
elements V = {v1, v2, . . . , vn} are given by cvivj = |i − j|; i.e., when the metric is
generated by a path. (Note that this path has no relationship to the total order on V : it
merely specifies the costs.)
Indeed, an O(1)-competitive algorithm for the path requires some work: one natural
idea is to divide the line into two pieces, recursively find the maximum in each of these
pieces, and then compare the two maxima to compute the overall maximum. However,
a closer look indicates that this algorithm also gives us a competitive ratio of Ω(log n).
To fix this problem and reduce the expected cost to O(OPT ), we make a simple yet
important change: we run two copies of the above algorithm in parallel, transferring a
small amount of information between the two copies after every round of comparisons.
Remarkably, this subtle change in the algorithm gives us the claimed O(1)-competitive
ratio for the line. In fact, this idea extends to the d-dimensional grid to give us O(d)-
competitive algorithms – while the algorithm remains virtually unchanged, the proof
becomes quite non-trivial for d-dimensions.
For the results on sorting, we first develop an algorithm for the case when the met-
ric is a k-HST (which is a tree where the edges from each vertex to its children are
k times shorter than the edge to its parent). For these HST’s, we show how to imple-
ment a “bottom-up mergesort” to get an (existentially optimal) O(log n)-competitive
algorithm; this is then combined with standard techniques to get the O(log2
n)-compe-
titiveness for general metrics.
Previous Work. The study of the arbitrary cost model for sorting and selection was
initiated by Charikar et al. [2]. They showed an O(n)-competitive algorithm for finding
the maximum for general metrics. Tighter upper and matching lower bounds (up to
constants) for finding the maximum were shown by Hartline et al. [7] and independently
by the authors [3].
In the latter paper [3], the authors considered the special case of structured costs,
where each element vi is assumed to have an inherent size si, and the cost cvivj of
comparing two elements vi and vj is of the form f(si, sj) for some function f; as ex-
pected, better results could be proved if the function f was “well-behaved”. Indeed, for
monotone functions f, they gave O(log n)-competitive algorithms for sorting, O(1)-
competitive algorithms for max-finding, and O(1)-competitive algorithms for selection
for the special cases of f being addition and multiplication. Subseqently, Kannan and
Khanna [4] gave an O(log2
n)-competitive algorithm for selection with monotone func-
tions f, and an O(1)-competitive algorithm when f was the min function.
Formal Problem Definition. The input to our problems is a complete graph G =
(V, E), with |V | = n vertices. These vertices represent the elements of the total or-
der, and hence each vertex v ∈ V has a distinct key value denoted by key(v). (We use
x ≺ y to denote that key(x) ≤ key(y).) Each edge e = (u, v) has non-negative length
or cost ce = cuv, which is the cost of comparing the elements u and v. We assume that
these edge lengths satisfy the triangle inequality, and hence form a metric space.
In this paper, we consider the problems of finding the element in V with the max-
imum key value, and the problem of sorting the elements in V according to their key
values. We work in the framework of competitive analysis, and compare the cost of the

comparisons performed by our algorithm to the cost incurred by the optimal algorithm
which knows the results of all pairwise comparisons and just has to prove that the so-
lution produced by it is correct. We shall denote the optimal solution by OPT ⊆ E.
Given a set of edges E
⊆ E, we will let c(E
) =

e∈E ce, and hence c(OPT) is
the optimal cost. Note that a proof for max-finding is a rooted spanning tree of G with
the maximum element at the root, and where the key values of vertices monotonically
increase when moving from any leaf to the root; for sorting, a proof is the Hamilton
path where key values monotonically increase from one end to the other.
2 Max-Finding in Arbitrary Metrics
For arbitrary metrics, we give an algorithm for finding the maximum element vmax of
the nodes in V ; the algorithm incurs cost at most O(log n) × c(OPT). Our algorithm
proceeds in stages. In stage i, we have a subgraph Gi = (Vi, Ei) such that Vi contains
vmax; here Vi ⊆ V , and Ei = Vi × Vi with the same costs as in G. We start with
G0 = G; in stage i, if Gi has a single node, then it must be vmax, else we do the
following steps.
1. Find a minimum cost almost-perfect matching Mi in Gi. (I.e., at most one node remains
unmatched.)
2. For every edge e = (u, v) ∈ Mi, compare the end-points of e. (If u is greater than v, then u
“wins” and v “loses”.)
3. Delete nodes which lost in the comparisons above to get the new set Vi+1 from Vi.
It is clear that the above algorithm correctly finds vmax; there are O(log n) rounds
since |Vi| =

n/2i

, and hence the following lemma immediately implies that the cost
incurred is O(log n) × c(OPT), this proving Theorem 1.
Lemma 1. The cost of edges in Mi is at most c(OPT).
Proof. Given any set of 2k vertices in a tree T , one can find a pairing of these vertices
into k pairs so that the paths in T between these pairs are edge disjoint (see, e.g., [8,
Lemma 2.4]). We use this to pair off the vertices of Vi in the tree OPT; the total cost
of the paths between them is an upper bound on a min-cost almost-perfect matching of
Vi. Finally, since the paths between the pairs are edge disjoint, their total cost is at most
c(OPT), and hence the cost of Mi is at most c(OPT) as well.

3 Finding the Maximum on a Line
To improve on the results of the previous section, let us consider the case of the line
metric; i.e., where the vertices V = {1, 2, . . ., n} lie on a line, and the cost of comparing
two elements in V is just the distance between them in this line. Let us assume that the
line is unweighted, and consecutive points in V are at unit distance from each other,
and hence cij = |i − j|; we will indicate how to remove this simplifying assumption at
the end of this section. We also assume that n is a power of 2. For an element x ∈ V
which is not the maximum, let g(x) be a nearest element to x which has a key greater
than key(x), and let d(x) be the distance between x and g(x). Observe that in OPT,
the parent of x must be at distance d(x) from x, and hence c(OPT) =

x=vmax
d(x).

Let us first look at a naı̈ve scheme: we start off with a division D1 of the line
into two-node segments {[1, 2], [3, 4], . . ., [n − 1, n]}. In next division D2, we pair off
segments of D1 to get n/4 segments {[1, 2, 3, 4], . . ., [n−3, n−2, n−1, n]}; similarly,
Di has n/2i
segments, each with 2i
nodes. We maintain the invariant that we know
the maximum key element in each segment of Di; when merging two segments, we
compute the maximum by comparing the maxima of the two segments. However, this is
just the algorithm of Section 2, and if we have 1 ≺ 2 ≺ · · · ≺ n, then c(OPT) = n − 1
but our scheme costs Ω(n log n).
An Algorithm Which Almost Works. A natural next attempt is to introduce random-
ization: to form the division D2 from D1, we toss an unbiased coin: if the result is
“heads”, we merge [1, 2], [3, 4] into one segment (which we denote using the notation
[1-4]), merge [5, 6], [7, 8] into the segment [5 − 8], and so on. If the coin comes up
“tails”, we shift over by one: [1, 2] forms a segment by itself, and from then on, we
merge every two consecutive segments of D1. Hence, with probability 1
2 , the segments
in D2 are {[1-4], [5-8], . . .}, and with probability 1
2 , they are {[1-2], [3-6], [7-10], . . .}.
To get division Di+1 from Di, we flip an unbiased coin and either merge every pair of
consecutive segments of Di beginning with the first segment, or merge every pair of
consecutive segments starting at the second one. It is easy to see that all segments in
Di, except perhaps the first and last ones, have 2i
nodes. Again, the natural randomized
algorithm is to maintain the maximum element in each segment of Di: when combin-
ing segments of Di to form segments of Di+1, we compare the two maxima to find the
maximum of the newly formed segment. (We use stage i to refer to the comparisons
performed whilst forming Di; note that stages begin at 1, and there are no comparisons
in the first stage.)
The correctness of the procedure is immediate; to calculate the expected cost in-
curred, we charge the cost of a comparison to the loser – note that each node except
vmax pays for exactly one comparison. We would like to show that the expected cost
paid by x ∈ V in our algorithm is O(d(x)). Let Si(x) denote the segment of Di con-
taining x; the size of |Si(x)| ≤ 2i
, and the length of Si(x) is |Si(x)| − 1. Note that if
2k
≤ d(x) 2k+1
, then x definitely wins (and hence does not pay) in stages 1 through
k; the following lemma bounds the probability that it loses in any stage t ≥ k + 1.
(Recall that depending on the coin tosses, x may nor may lose to g(x).)
Lemma 2. Let 2k
≤ d(x) 2k+1
. Then Pr[x loses in stage t] ≤ 2−(t−k−2)
.
Proof. Note that the lemma is vacuously true for t ≤ k + 2. Since d(x) 2k+1
, nodes
x and g(x) must lie in the same or consecutive segments in stage k + 1. Now for x
to lose in stage t, it must not have lost in stages {k + 2, k + 3, . . . , t − 1}, and hence
the segments containing x and g(x) must not have merged in these stages. Since we
make independent decisions at each stage, the probability of this event happening is
(1/2)(t−1)−(k+2)+1
= 2−(t−k−2)
.

Since x may have to pay as much as 2t+1
if it loses in stage t, the expected cost for x is

t≥k Θ(2k
), which may be as large as Θ(2k
·(log n−k)). Before we indicate how to fix
the problem, let us note that our analysis is tight: for the example with 1 ≺ 2 ≺ · · · ≺ n,
the randomized algorithm incurs a cost Ω(n log n).

Two Copies Help: The Double-Random Algorithm. Let us modify the above algo-
rithm to maintain two independent copies of the line, which we call L and L
. The par-
titions in L will be denoted by D1, D2, . . ., while those in L
will be called D
1, D
2, . . ..
These partitions in the two lines are chosen independent of each other. Again, we main-
tain the maximum element of each segment in Di and D
i, but also exchange some
information between the lines. Consider the step of merging segments S1 and S2 to get
a segment S in Di+1, and let xi be the maximum element of Si. Before we compare
x1 and x2, we check if x1 has lost to an element y ∈ S2 in some previous stage in L
:
in this case, we know that x1 ≺ y ≺ x2, and hence can avoid comparing x1 and x2.
Similarly, if x2 has previously lost in L
to some element z ∈ S1, we can declare x1
to be the maximum element of S. Only if neither of these fortuitous events occur, we
compare x1 and x2. (The process for L
is similar, and uses the information from L’s
previous rounds.) The correctness of the algorithm follows immediately.
Notice that each element x now loses exactly twice, once in each line, but the second
loss may be implicit (without an actual comparison being performed). As before, we say
that a node x loses in stage i of L (or L
) if this is the ﬁrst stage in which x loses in L
(or L
). The node x pays in stage i of L (or L
) if x loses in stage i of L (or L
) and an
actual comparison was made. While x loses twice, it may possibly pay only once.
Lemma 3. If x, y ∈ V are at distance d(x, y), then the probability (in either line) that
x and y lie in different segments in Di is at most d(x, y)/2i−1
.
Proof. Let 2k
≤ d(x, y) 2k+1
; the statement is vacuously true for i − 1 ≤ k. In
stage i − 1 k, the nodes x and y must lie in either the same or consecutive segments
in Di−1. Now, if they were in different segments in Di−1 (which inductively happens
with probability at most d(x, y)/2i−2
), the chance that these segments do not merge in
stage i is exactly 1
2 , giving us the bound of d(x, y)/2i−2
× 1
2 .

Let the node g(x) lie to the left of x; the other case is symmetric, and proved identically.
Let the distance d(x) = d(x, g(x)) satisfy 2k
≤ d(x) 2k+1
. Let h(x) be the nearest
point to the right of x such that x ≺ h(x), and let 2m
≤ d(x, h(x)) 2m+1
. Note that if
x pays in stage t of L, then t ≤ m+3. Indeed, if the segment Sm+3(x) is the leftmost or
the rightmost segment, then it either contains g(x) or h(x), so it must have paid by then.
Else, the length of Sm+3(x) = 2m+3
− 1, and since d(g(x), h(x)) 2m+1
+ 2k+1
=
2m+2
, the segment Sm+3(x) must contain one of g(x) or h(x), so t ≤ m+3. Moreover,
since St(x) must contain either g(x) or h(x), it follows that t k. The following key
lemma shows us that the probability of paying in stage t ∈ [k + 1, m] is small. (An
identical lemma holds for L
.)
Lemma 4. For t ∈ [k + 1, m], Pr[x pays in stage t of L] ≤ 2−2(t−k)+5
.
Proof (Lemma 4). Note that if x pays in stage t ≤ m of L, then x must have lost to
some element to its left in L, since d(x, h(x)) ≥ 2m
. Depending on whether x loses in
L
before stage t or not, there are two cases.
Case I: x has not lost in D
t−1. This implies that x and g(x) lie in different segments
in L
, which by Lemma 3 has probability ≤ d(x, g(x))/2t−2
≤ 2−(t−k−3)
. Now the
chance that x loses in L in stage t is 2−(t−k−2)
(by Lemma 2). Since the partitions are
independently chosen, the two events are independent, which proves the lemma.

Case II: x has lost in stage l ≤ t − 1 in L
. Since l ≤ m, x must have lost to some
element y to its left in L
; this y is either g(x), or lies to the left of g(x). Consider
stage t − 1 in L: since the distance d(y, x) 2l
≤ 2t−1
, the three elements x, g(x)
and y lie in the union of two adjacent segments in Dt. Furthermore, x must lie in a
different segment from y and g(x), otherwise x would have already lost in L in stage
t − 1. Recall that if x loses in stage t in L, it must lose to a node to its left – since
t ≤ m, h(x) is too far to the right. But this implies that St−1(x) must merge in L with
St−1(y) = St−1(g(x)); in this case, no comparisons would be performed since x had
already lost to y in L
.

Note that this lemma implies that the expected payment of x for stages k+1 through
m is at most
m
t=k+1 2−2(t−k)+5
×2t
= O(2k
). The expected payment in stages m+1
to m + 3 is at most 3 · O(2m
) = O(2k
) by Lemma 2, which proves:
Theorem 4. The Double-Random algorithm is an O(1)-competitive algorithm for
max-ﬁnding on the line.
To end, note that the assumption of unit length edges can be removed: by scaling and
translation, all distances can be made integers. We can add dummy vertices at all inte-
gers i that do not correspond to a vertex in V , where all these vertices have key values
less than those of all the non-dummy vertices. Running Double-Random on this aug-
mented line allows us to run the algorithm without increasing the cost. (We can even
space and time overhead by maintaining only segments containing at least one non-
dummy node.)
4 Max-Finding for Euclidean Metrics
In this section, we extend our algorithm for the line metric to arbirary Euclidean met-
rics: the basic idea of running two copies of the algorithm and judiciously exchanging
information will be used again, but the proof becomes substantially more involved. We
give the proof for the 2-d case; the proof for the general case is deferred to the ﬁnal
version of the paper.
The General Double-Random Algorithm. As in the case of the line, we begin with the
simplifying assumption that the nodes in V form a subset of the unit-weight n×n grid;
we refer to this underlying grid as M = {1, 2, . . ., n}×{1, 2, . . ., n}. (This assumption
can be easily discharged, as for the case of the line; we omit the details here.) Hence
each point v ∈ V corresponds to a point (vx, vy) ∈ M, with 1 ≤ vx, vy ≤ n. In fact, if
Px
denotes the path along the x-axis from 1 to n, and Py
denotes a similar path along
the y-axis, then we can identify the grid M with the cartesian product Px
× Py
. To
construct partitions D1, D2, . . . of the grid, we build stage-i partitions Dx
i and Dy
i for
the paths Px
and Py
: the rectangles in M’s partition correspond to the products of the
segments in Dx
i and Dy
i , and hence a square in Di+1 is formed by merging at most four
squares in Di
1
. The random partitioning schemes for Px
and Py
evolve independently
of each other.
1
We abuse notation and say “squares” even though the pieces may be rectangles.

Again, we maintain the invariant that we know the maximum element in each square
of the partition Di, and as in the case of the line, we do not want to perform three com-
parisons when merging squares in Di to get Di+1. Hence we maintain two independent
copies M and M
of the grid, with Di and D
i being the partitions in the two grids at
stage i. Suppose {x1, x2, x3, x4} are the four maxima of four squares Si being merged
into a new square S in M: for each i ∈ [1, 4], we check if xi has lost to some y ∈ S in
a previous stage in M
, and if so, we remove xi from consideration; we finally compare
the xi’s that remain. The correctness of the algorithm is immediate, and we just have to
bound the costs incurred.
4.1 Cost of the Double-Random Algorithm in Two Dimensions
We charge the cost of each comparison to the node that loses in that comparison, and
wish to upper bound the cost charged to any node p ∈ M. Let G(p) = {q | p ≺ q} be
the set of nodes with keys greater than p; fix a vertex g(p) ∈ G(p) closest to p, and let
d(p) be the distance between p and g(p), with 2
≤ d(p) 2+1
. Since we focus on the
node p for the entire argument, we shift our coordinate system to be centered at p: we
renumber the vertices on the paths Px
and Py
so that the vertex p lies at the “origin” of
the 2-d grid. Formally, we label the nodes px ∈ Px
and py ∈ Py
as 0; the other vertices
on the paths are labeled accordingly. This naturally defines four quadrants as well. Let
Dx
i be the projection of the partition Di on the line Px
, and Dy
i be its projection of Py
.
(Dx
i

and Dy
i

are defined similarly for the grid M
.)
Let us note an easy lemma, bounding the chance that p and g(p) are separated in the
partition Di in M. (Such a lemma holds for partition D
i of M
, of course.)
Lemma 5. Pr[p and g(p) lie in different squares of Di] ≤ 2−(i−−3)
.
Proof. Let the distance from p to g(p) along the two axes be dx = d(px, g(p)x) and
dy = d(py, g(p)y) with max{dx, dy} = d(p). By Lemma 3, the projections px and
g(p)x do not lie in the same stage-i interval of Px
with probability at most dx/2i−1
.
Similarly, py and g(p)y do not lie in the same stage-i interval of Py
with probability
at most dy/2i−1
; a trivial union bound implies that the probability that 2d(p)/2i−1

2−(i−−3)
.

We now prove that the expected charge to a node p is O(2
), where the distance
between p and a closest point g(p) in the set G(p) = {q ∈ V | p ≺ q} lies in the
interval [2
, 2+1
). Let H(p) = G(p) − {g(p)}. Let Si(p) and S
i(p) be the squares in
Di and D
i respectively that contain p. We will consider two events of interest:
1. Let Ai be the event that p pays in M in stage i, and the square Si(p) contains at
least one point from H(p), but does not contain g(p).
2. Let Bi be the event that p pays in M in stage i, and Si contains g(p).
Note that Ai ∩ Bi = ∅; also, if p pays in stage i in M, then either Ai or Bi must occur.
Also, Pr[Ai ∪ Bi] 0 only when p and some element of G(p) lie in the same square
in stage i in M: since any two points in such a square are at ∞ distance ≤ 2
− 1 from
each other, and each element of G(p) has ∞ distance at least 2
from p, it suffices to

consider the case i . Theorems 5 and 6 will show that

i Pr[Ai]×2i
+

i Pr[Bi]×
2i
= O(2
). This shows that p pays only O(2
) in M; a similar bound holds for M
,
which proves the claim that Double-Random is O(1)-competitive in the case of two-
dimensional grids.
Theorem 5.

i Pr[Ai] × 2i
= O(2
).
Proof. Let us define two events Ex and Ey. Let Ex be the event that px and g(p)x lie in
different segments in Dx
i , and Ey be the event that py and g(p)y lie in different segments
in Dy
i . Note that Ai (Ex ∪ Ey) = ∅, and hence
Pr[Ai] ≤ Pr[Ai ∩ Ex] + Pr[Ai ∩ Ey] (4.1)
Let us now estimate Pr[Ai ∩ Ex], the argument for the other term is similar. Assume
(w.l.o.g.) that g(p)x lies to the left of px, and let the points between g(p)x and px
(including px, but not including g(p)x) in Px
be labeled p1
x, p2
x, . . . , pk
x = px from left
to right. Define Fj as the event that the segment Sx
i (p) in Dx
i containing px has pj
x as its
left end-point. Note that the events Fj are disjoint, and Ex = ∪k
j=1Fj. Thus it follows
that
Pr[Ai ∩ Ex] =

j Pr[Ai ∩ Fj] =

j Pr[Ai | Fj] Pr[Fj]. (4.2)
If Fj occurs then the end-points of the edge connecting pj
x and pj−1
x (where p0
x =
g(p)x) lie in different segments of Dx
i . Lemma 3 implies that this can happen with
probability at most 1
2i−1 . Thus, we get
Pr[Ai ∩ Ex] ≤ 2−(i−1)
×

j Pr[Ai | Fj]. (4.3)
Define Ij
i as the segment of length 2i
in Px
containing px and having pj
x as its left end-
point. Let q(i, j) ∈ H(p) be such that q(i, j)x ∈ Ij
i and |q(i, j)y − py| is minimum; in
other words, the point closest to the x-axis whose projection lies in Ij
i . If no such point
exists, then we say that q(i, j) is undefined. Let δ(i, j) = |q(i, j)y − py| if q(i, j) is
defined, and ∞ otherwise. Notice that for a fixed j, δ(i, j) is a decreasing function of i.
Assume Fj occurs for some fixed j: for Ai = ∅, Si(p) must contain a point in
H(p), and hence δ(i, j) ≤ 2i
; let i(j) be the smallest value of i for which δ(i, j) ≤ 2i
.
Due to δ(i, j) being a decreasing function in i, δ(i, j) 2i
for all i i(j), and for all
i ≥ i(j), δ(i, j) ≤ 2i
. Now suppose i i(j); note the strict inequality, which ensures
that q(i − 1, j) exists. Again assume that Fj occurs: now for Ai to occur, the square
Si−1(p) cannot contain any point of H(p). In particular, it cannot contain q(i − 1, j).
Lemma 6. If Fj occurs and i +1, then px and q(i−1, j)x lie in the same segment
of Dx
i−1.
Proof. It will suffice to show the claim that segment containing px in Dx
i−1 also has
pj
x as the left end-point; since q(i − 1, j)x also lies in this segment, the lemma follows.
To prove the claim, note that the distance |pj
x − px| ≤ |g(p)x − px| − 1 ≤ (2+1
−
1) − 1 = 2+1
− 2. Since i , it follows that px and pj
x must lie in the same or
in adjacent segments of Dx
i−1; we claim that the former is true. Indeed, suppose they
were in different segments: since the segment of Dx
i−1 containing pj
x must have width
2i−1
− 1 ≥ 2+1
− 1 which is greater than |pj
x − px|, it must happen that pj
x lies in the
interior of this segment, and hence Fj could not occur.

Note that since the projections of p and q(i − 1, j) on the x-axis lie in the same
segment implies that the projections py and q(i−1, j)y on the y-axis must lie in different
segments of Dy
i−1. Since this event is independent of Fj, we can use Lemma 3 to bound
the probability: indeed, we get that for i i(j),
Pr[Ai | Fj] ≤ δ(i−1,j)
2i−1 . (4.4)
We are now ready to prove the theorem.

i 2i
· Pr[Ai ∩ Ex] ≤ 2

i

j Pr[Ai | Fj] (from (4.3))
= 2

j

i≥i(j) Pr[Ai | Fj]
≤ 2

j

1 +

ii(j)
δ(i−1,j)
2i−1

(from (4.4))
≤ 2

j 3 ≤ 6 · 2
.
where penultimate inequality follows from the fact that δ(i, j) is a decreasing function
of i, and hence

ii(j)
δ(i−1,j)
2i−1 is a dominated by a geometric sum. A similar calcu-
lation proves that

i 2i
· Pr[Ai ∩ Ey] is O(2
), which in turn completes the proof of
Theorem 5.

Now that we have bounded the charge to p due to the events Ai by O(2
), we turn
our attention to the events Bi, and claim a similar result for this case.
Theorem 6.

i 2i
· Pr[Bi] ≤ O(2
).
Proof (Theorem 6). Recall that if p loses in stage i, then i : hence we deﬁne a set
of events E+1, . . . , Ei−3, where Ej occurs if p loses in stage j of M
. Also, deﬁne the
event E0 occur if p does not lose in M
till stage i − 3. Note that exactly one of these
events can occur, and hence
Pr[Bi] = Pr[Bi | E0]Pr[E0] +
i−3
j=+1 Pr[Bi | Ej]Pr[Ej]. (4.5)
The next two lemmas give us bounds on the probability of each of the terms in the
summation.
Lemma 7. If i + 1, then Pr[Bi | E0]Pr[E0] ≤ 2−(2i−2−10)
.
Proof. Lemma 5 implies that Pr[E0] ≤ 2−((i−3)−−3)
. Now given E0, p must not lose
till stage i − 1 in M for Bi to occur. But this event is independent of E0, and hence
Lemma 5 implies that Pr[Bi | E0] is at most 2−((i−1)−−3)
. Multiplying the two com-
pletes the proof.

Lemma 8. Pr[Bi | Ej]Pr[Ej] ≤ 2−(2i−2−9)
.
Proof. For the event Ej to occur, p does not lose till stage j − 1 in M
; now applying
Lemma 5 gives us that Pr[Ej] ≤ 2−((j−1)−−3)
. Also, note that j ≤ i − 3 for us to
be in this case.
Now let us condition on Ej occurring: let p lose to some q in stage j of M
, and hence
|px −qx|, |py −qy| 2j
. Now consider stage i−1 of M. We claim that px, qx and g(p)x

do not all lie in the same segment of Dx
i−1. Indeed, since the distance |py − g(p)y|
2+1
≤ 2i−2
, the triangle inequality ensures that |qy − g(p)y| ≤ |py − g(p)y| + |py −
qy| ≤ 2i−1
, and hence the distance between any two points in the set {py, qy, g(p)y}
is at most 2i−1
. Thus two of these points must lie in the same segment in Dy
i−1 in M.
If all three lay in the same segment of Dx
i−1, two of these points would lie in the same
square in Di−1. Now if p was one of these points, then p would lose before stage i and
Bi would not occur. If g(p) and q would lie in the same square of Di−1, then p and q
would be in the same square in Di, and then p would not pay. Therefore, all three of
px, qx and g(p)x cannot lie in the same segment of Dx
i−1; similarly, py, qy and g(p)y
can not lie in the same segment of Dy
i−1.
Hence one of the following two events must happen: either (1) px, g(p)x lie in dif-
ferent segments of Dx
i−1 and py, qy lie in different segments of Dy
i−1, or (2) px, qx lie in
different segments of Dx
i−1 and py, g(p)y lie in different segments of Dy
i−1. Lemma 3
implies that the probability of either of these events is at most 2−(i−−2)
·2−(i−j−2)
, and
hence Pr[Bi |Ej]≤2−(2i−−j−5)
. Finally, multiplying this with Pr[Ej]≤2−((j−1)−−3)
completes the proof.
Now combining (4.5) with Lemmas 7 and 8, we see that if i + 1, then Pr[Bi] ≤
O
i−l
22i−2

. Thus,

i≥ 2i
· Pr[Bi] ≤ O

2
·

i
i−l
2i−l

= O(2
). (4.6)
This completes the proof of Theorem 6.

5 Sorting with Metric Comparison Costs
We now consider the problem of sorting the points in V according to their key values.
Let OPT be the set of n − 1 edges going between consecutive nodes in sorted order.
A rooted tree T is called a 2-HST if the lengths of all edges at any level of T are the
same, and the lengths of consecutive edges on any root-leaf path decrease by a factor
of exactly 2. We assume that each internal node of T has at least 2 children. Indeed,
if a node has exactly one child, we can contract this edge – this will change distances
between leaves up to a constant factor only. Let us denote the set of leaves of the 2-HST
tree T by V , and let |V | = n. The following theorem is the main technical result of this
section.
Theorem 7. Given n elements, and the metric generated by the leaves of a 2-HST, there
is an algorithm to sort the elements with a cost of O(log n) × c(OPT).
Using standard results on approximating arbitrary metrics by probability distributions
on metrics generated by HSTs [5, 6], the above theorem immediately implies Theo-
rem 3.
Proof (Theorem 7). For any rooted subtree H of T , let OPT(H) denote the optimal
set of comparisons to sort the leaves in H, and let c(OPT(H)) be their cost. Let h be
the root of H, and h’s children be h1, . . . , hr; let the subtree rooted at hi be Hi. Con-
sider OPT(H), and let a segment of OPT(H) be a maximal sequence of consecutive

vertices in OPT(H) belonging to the same sub-tree Hi for some i. Clearly, we can
divide OPT(H) uniquely into node-disjoint segments – let segs(H) denote the num-
ber of these disjoint segments. Let d(H) denote the cost of an edge joining h to one
of its children; recall that all these edges have the same cost. We omit the proof of the
following simple lemma.
Lemma 9. c(OPT(H)) ≥
r
i=1 c(OPT (Hi)) + (segs(H) − 1) · d(H).
Our algorithm sorts the leaves of T in a bottom-up manner, by sorting the leaves of
various subtrees, and then merging the results. For subtrees which just consist of a leaf,
there is nothing to do. Now let H, h, Hi, hi be as above, and assume we have sorted the
leaves of Hi for all i: we want to merge these sorted lists to get the sorted list for the
leaves of H. The following lemma, whose proof we omit, shows that we can do this
without paying too much.
Lemma 10. There is an algorithm to merge the sorted lists for Hi while incurring a
cost of O(segs(H) · log n · d(H)).
We now complete the proof of Theorem 7. If cost(H) is the cost incurred to sort the
subtree H, we claim cost(H) ≤ α · log n · c(OPT (H)) for some constant α. The
proof is by induction on the height of the tree: the base case is when H is a leaf, and
cost(H) = c(OPT(H)) = 0. If H, Hi are as above, and if our claim is true for Hi,
then Lemma 10 implies that
cost(H) ≤

i cost(Hi) + O(segs(H) · log n · d(H))
≤

i α · log n · c(OPT (Hi)) + O(segs(H) · log n · d(H)
≤ α · log n[

i c(OPT(Hi)) + (segs(H) − 1)] (5.7)
provided α is large enough. (The last inequality used the fact that since segs(H) ≥
2, segs(H) = O(segs(H)−1). But (5.7) is at most α·log n·c(OPT (H)), by Lemma 9,
which proves Theorem 7.

References
1. Knuth, D.E.: The art of computer programming. Volume 3: Sorting and searching. Addison-
Wesley Publishing Co., Reading, Mass. (1973)
2. Charikar, M., Fagin, R., Guruswami, V., Kleinberg, J., Raghavan, P., Sahai, A.: Query strate-
gies for priced information. In: Proc. 32nd ACM STOC. (2000) 582–591
3. Gupta, A., Kumar, A.: Sorting and selection with structured costs. In: Proc. 42nd IEEE FOCS
(2001) 416–425
4. Kannan, S., Khanna, S.: Selection with monotone comparison costs. In: Proc. 14th ACM-
SIAM SODA (2003) 10–17
5. Bartal, Y.: Probabilistic approximations of metric spaces and its algorithmic applications. In:
Proc. 37th IEEE FOCS. (1996) 184–193
6. Fakcharoenphol, J., Rao, S., Talwar, K.: A tight bound on approximating arbitrary metrics by
tree metrics. In: Proc. 35th ACM STOC (2003) 448–455
7. (Hartline, J., Hong, E., Mohr, A., Rocke, E., Yasuhara, K.) As reported in [3].
8. Kleinberg, J.: Detecting a network failure. Internet Math. 1 (2003) 37–55

What About Wednesday?
Approximation Algorithms
for Multistage Stochastic Optimization
Anupam Gupta1,
, Martin Pál2,
,
Ramamoorthi Ravi3,†
, and Amitabh Sinha4
1
Dept. of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213
anupamg@cs.cmu.edu
2
DIMACS Center, Rutgers University, Piscataway, NJ
mpal@acm.org
3
Tepper School of Business, Carnegie Mellon University, Pittsburgh PA 15213
ravi@cmu.edu
4
Ross School of Business, University of Michigan, Ann Arbor MI 48109
amitabh@umich.edu
Abstract. We study the problem of multi-stage stochastic optimization
with recourse, and provide approximation algorithms using cost-sharing
functions for such problems. Our algorithms use and extend the Boosted
Sampling framework of [6]. We also show how the framework can be
adapted to give approximation algorithms even when the inﬂation pa-
rameters are correlated with the scenarios.
1 Introduction
Many problems in planning involve making decisions under uncertainty that un-
ravels in several stages. Demand for new services such as cable television and
broadband Internet access originate over time, leading to interesting questions in
the general area of installing infrastructure to support demand that evolves over
time. For instance, a communications company may want to solve the problem
of constructing a network to serve demands as they arise but also keep potential
future growth in hot-spots in mind while making investments in costly optic ﬁber
cables. While traditional ways to solve such problems involve casting them as
network loading and network expansion problems in each stage and solving for
them sequentially, advances in forecasting methods have made a more integrated
approach feasible: with the availability of forecasts about how future demands
evolve, it is now preferable to use the framework of multistage stochastic opti-
mization with recourse to model such problems.

Full paper available at
http://www.tepper.cmu.edu/andrew/ravi/public/pubs.html.

Supported in part by NSF CAREER award CCF-0448095 and an Alfred P. Sloan
Fellowship.

Supported by NSF grant EIA 02-05116.
†
Supported in part by NSF grant CCR-0105548 and ITR grant CCR-0122581.
c

What About Wednesday? 87
Before we talk about the multistage optimization, let us describe the basic
ideas via the example of the two-stage stochastic Steiner tree problem: Given
a metric space with a root node, the Steiner tree problem is to find a mini-
mum length tree that connects the root to a set of specified terminals S. In the
two-stage stochastic version considered recently by several authors [6, 7, 9, 16],
information about the set of terminals S is revealed only in the second stage
while the probability distribution π from which this set is drawn known in the
first stage (either explicitly [7, 9, 16], or as a black box from which one can sam-
ple efficiently [6]). One can now purchase some edges F1 in the first stage – and
once the set of terminals S ⊆ V is revealed, one can buy some more edges F2(S)
so that F1 ∪ F2(S) contains a rooted tree spanning S. The goal is to minimize
the cost of F1 plus the expected cost of the edges F2. Note however that the
edges purchased in the second stage F2 are costlier by a factor of σ 1; it is
this that motivates the purchase of some anticipatory edges F1 in an optimal
solution. In this paper, we will consider the following k-stage problem, and will
give a 2k-approximation for this problem.
The k-Stage Problem. In the k-stage problem, we are allowed to buy a set of
edges in each stage to augment our current solution in reaction to the updated
information received in that stage in the form of a signal; however, the cost of
buying any edge increases with each stage. Let us use σi to denote the inflation
factor of stage i; i.e., how much more expensive each edge is in comparison to
stage i−1. Assume σ1 = 1; hence purchasing an edge e in stage i costs ce
i
j=1 σi.
We assume that costs are non-decreasing, which corresponds to σi ≥ 1.
– At the beginning of the i-th stage (where 1 ≤ i ≤ k − 1), we receive a signal
si that represents the information gained about future terminals that will
arise. After this observation si, we know that future signals, as well as the
set S of eventual terminals, will come from a revised distribution conditioned
on seeing this signal. After observing the signal si, we can purchase some
more edges Fi at cost
i
j=1 σjc(Fi).
– Finally, in the k-th stage we observe the realization of the random variable
S = S of terminals, and have to buy the final set Fk so that ∪k
i=1Fk is a
Steiner tree spanning S. Our goal is to minimize the expected cost incurred
in all stages together, namely E[
k
i=1 (

j≤i σj) c(Fi) ].
This multistage framework can be naturally extended to model problems
like Vertex Cover and Facility Location as well; we give the formal definitions
in Section 2. We then extend the Boosted Sampling framework from [6] to the
multistage situation, and use this to give approximation algorithms for Stochastic
Steiner Tree, Facility Location, and Vertex Cover.
1.1 Informal Description of Results
In this paper we extend the Boosted Sampling framework for two-stage stochastic
optimization problems with recourse. The framework and terminology is defined
in Section 2. This framework had been proposed in our earlier paper [6]; given an

88 Anupam Gupta et al.
1. Boosted Sampling: Sample σ times from the distribution π to get sets of clients
D1, . . . , Dσ.
2. Building First Stage Solution: Build an α-approx. solution for the clients D = ∪iDi.
3. Building Recourse: When actual future in the form of a set S of clients appears
(with probability π(S)), augment the sol’n. of Step 2 to a feasible solution for S.
Fig. 1. Algorithm Boost-and-Sample(Π)
inflation parameter σ, and a distribution π over second-stage scenarios, the fol-
lowing procedure was used to translate optimization algorithms for deterministic
problems to their two-stage stochastic variants:
In [6], we proved that given an α-approximation algorithm for the deter-
ministic version Det(Π) of any problem Π, boosted sampling would yield an
approximation algorithm for the two-stage stochastic version Stoc(Π), as long
as the deterministic algorithm satisfied certain technical cost-sharing conditions;
moreover, the sampling framework worked even if the probability distribution π
over the second-stage client sets was specified using a black-box from which one
could draw samples efficiently.
Multistage Results. We first show (in Section 3) how to extend the boosted
sampling framework to handle k-stage stochastic variants of problems (for all
k ≥ 2), and give the technical conditions under which effective approximations
can be obtained. (See Theorems 1 and 2 for the precise statements). As in [6],
these technical conditions are phrased in terms of the existence of certain “good”
cost-sharing functions related to the approximation algorithms. In particular, we
want the cost-shares to satisfy both strictness and cross-monotonicity; details
and definitions appear in Section 2.
Our results, together with the “good” cost-shares for some problems, give
us constant-factor approximation algorithms when the number of stages is a
constant. In particular, such results are obtainable for the Steiner tree problem
(where we can get a 2k-approximation for the k-stage problem), Facility Location
(an approximation of 3 · 2k
), and Vertex Cover (at most 4k
). A more precise
summary of our results for the k-stage versions of these problems is in the last
column of Figure 2.
Correlated Inflation. As outlined in Figure 1, boosted sampling assumes a
fixed deterministic inflation parameter σ. If this inflation parameter σ
σ
σ is random
Problem Approximation c-Strictness Is ξ k-Stage
ratio α w.r.t. ξ of ξ X-mono? Stochastic Approx.
Steiner Tree 2 1 Yes 2k
Facility Location 3 2 Yes 3(2k
− 1)
Vertex Cover 2 2 No 2
3
(4k
− 1)
Fig. 2. A summary of the results obtainable from this work

but independent of the distribution π over scenarios, one can just use E[σ
σ
σ] in
the place of σ. This independence assumption is somewhat restrictive, as the
equipment prices often correlate with demands. Here, using the expected value
can lead to very poor approximations.
In Section 4, we give a simple way to extend the Boosted Sampling framework
to the case when σ
σ
σ is arbitrarily correlated with the distribution π without losing
anything in the appproximation ratios.
1.2 Related Work
There is a huge body of work in the Operations Research community on multi-
stage stochastic optimization with recourse; the study of stochastic optimiza-
tion [2, 12] dates back to the work of Dantzig [3] and Beale [1] in 1955. Stochastic
linear programming was defined in these papers, and have been very widely stud-
ied since, with gradient-based and decomposition-based approaches being known
for some versions of stochastic linear programming. On the other hand, only
moderate progress has been reported for stochastic integer (and mixed-integer)
programming in both theoretical and computational domains; see [13, 17] for
details.
The study of stochastic versions of NP-hard problems has received some at-
tention lately in the theoretical computer science community, and approximation
algorithms for two-stage stochastic programming versions of a variety of com-
binatorial optimization problems have been devised actively in several recent
papers [4, 6, 7, 9, 16, 18].
Shmoys and Swamy [19] have recently shown that for a broad class of multi-
stage stochastic linear programs, a (1 + )-approximate solution can be found in
polynomial time using a Sampled-Average-Approximation approach. They show
that for several problems (including Facility Location, Set Cover and Vertex
Cover), the LP solution for each stage can be rounded to an integer solution in-
dependently of other stages. In contrast, our technique requires strict cost shares,
but does not depend on the existence of a suitable LP relaxation, or the ability
to round each stage independently.
Independently of our work, Hayrapetyan et al. [8] have also devised approxi-
mation algorithms for the multistage version of the Stochastic Steiner tree prob-
lem that we consider, using a reduction to a variant of an information network
gathering problem which they address in their paper. They also provide an O(k)-
approximation algorithm for multistage Stochastic Steiner tree (our approxima-
tion ratio is 2k). However, their techniques do not seem to extend to the other
covering problems that we address in this paper.
2 Basic Model and Notation
Let us define an abstract combinatorial optimization problem Π that we will
adapt to a stochastic setting. The optimization problem Π is defined by U, the
universe of clients (or demands), and the set X of elements we can purchase. For
a subset F ⊆ X of elements, let c(F) =

e∈F ce denote the cost of F. Given a

set S of clients, a solution F that satisfies each client j ∈ S is labeled feasible
for S. The definition of satisfaction naturally depends on the problem; e.g., in
the (rooted) Steiner tree problem on a graph G = (V, E), the universe of clients
is the set of possible terminals (U = V ), the element set X is the set of edges
E, and a terminal j ∈ S is satisfied by F ⊆ X if F contains a path from j to
the root vertex r. The cost of a set of edges F ⊆ X is c(F) =

e∈F ce.
Given a set S ⊆ U of clients, we let Sols(S) ⊆ 2X
be the set of feasible
solutions for S. Given a client set S ⊆ U, the deterministic version Det(Π) of Π
asks us to find a solution F ∈ Sols(S) of minimum cost. We denote by OPT(S)
the cost of this minimum cost solution.
Definition 1. A problem Π is sub-additive if for any S and S
being two sets
of clients with solutions F ∈ Sols(S) and F
∈ Sols(S
), we have that (i) S ∪ S
is a legal set of clients for Π, and (ii) F ∪ F
∈ Sols(S ∪ S
).
As in previous papers which give approximation algorithms for two-stage
stochastic optimization problems [6, 9, 16], we restrict our attention to sub-
additive problems. (Note that the sub-additivity in the rooted Steiner tree prob-
lem is ensured by the presence of the root r.)
Given any problem Π, we study the variant when the set of clients (or re-
quirements) is not known in advance, but is revealed gradually. We proceed to
build the solution in stages; in each stage, we gain a more precise estimate of
the requirements of clients, and then can buy or extend a partial solution (at
gradually increasing cost) in response to this updated information. Ultimately,
we learn the entire set S of clients or requirements, and then must complete the
existing partial solution to a feasible solution F ∈ Sols(S).
Multi-stage Stochastic Optimization Problems. We can now define stochastic
variants Stoc(Π) of the problem Π. In this model, we obtain increasingly precise
forecasts about user demands over several stages as in the Steiner tree example.
In the k-stage problem, we are allowed to buy a set of elements in each stage
to augment our current solution in reaction to the updated information received
in that stage; however, the cost of buying an element e ∈ X is increasing with
each stage. Extending the existing terminology, we use σi to denote the inflation
factor of stage i; i.e., how much more expensive each element is in comparison
to stage i − 1. For completeness, we define σ1 = 1. Hence purchasing an element
e ∈ X in stage i costs ce
i
j=1 σi. We assume that costs are non-decreasing,
which corresponds to σi ≥ 1.
– At the beginning of the i-th stage (where 1 ≤ i ≤ k − 1), we receive a signal
si that represents the information gained that we can use to correct our
anticipation of the demands. Formally, the signal si is a random variable
correlated with S and in stage i we observe si, a realization of si. (Note that
the signal s1 is a dummy signal, but we use it to simplify notation.) After this
observation si, we know that future signals, as well as the set S of demands
will come from the conditional distribution [π|s1 = s1, s2 = s3, . . . , si = si].
After observing the signal si, we can purchase some more elements Fi ⊆ X
at cost
i
j=1 σjc(Fi).

– Finally, in the k-th stage we observe the realization of the random variable
S = S, and have to buy the final set Fk so that ∪k
i=1Fk ∈ Sols(S). Again, our
goal is to minimize the expected cost incurred in all stages together, that is,
Z = E
k
i=1 (

j≤i σj) c(Fi)

.
Note that each of the partial solutions Fi = Fi(s1, s2, . . . , si) may depend on
signals up to stage i, but not on signals observed in the subsequent stages.
2.1 Cost Sharing Functions
We now define the notion of cost shares that we will crucially use to analyze our
approximation algorithms. Loosely, a cost-sharing function ξ divides the cost of
a solution F ∈ Sols(S) among the clients in S. Cost-sharing functions have long
been used in game-theory (see, e.g., [11, 14, 15, 20]).
Cost-shares with a closer algorithmic connection were recently defined by
Gupta et al. [5] to analyze a randomized algorithm for the Rent-or-Buy network
design problem. In this paper, we have to redefine strict cost-sharing functions
slightly: in contrast to previously used definitions, our cost-shares are defined
by, and relative to, an approximation algorithm A for the problem Π.
A cost-sharing algorithm A for a problem Π takes an instance (X, S) of Π
and outputs (a) a solution F ⊆ X with F ∈ Sols(S), and (b) a real value
ξ(X, S, j) ≥ 0 for each client j ∈ S. This value ξ(X, S, j) is called the cost-share
of client j, and the function ξ(·, ·, ·) computed by A is the cost sharing function
associated with A. A cost-sharing function ξ is cross-monotone if for every pair
of client sets S ⊆ T and client j ∈ S, we have ξ(X, T, j) ≤ ξ(X, S, j).
We also require all cost-sharing functions to provide a lower bound on the
cost of the optimal solution, in order to use the cost-sharing function to provide
bounds on the cost of our solution. A cost-sharing function ξ is competitive if
for every client set S, it holds that

j∈S ξ(X, S, j) ≤ OPT(X, S). For a subset
of clients S
⊆ S, let ξ(X, S, S
) denote the sum

j∈S ξ(X, S, j); thus compet-
itiveness is the property that ξ(X, S, S) ≤ OPT(X, S). In this paper, we will
focus solely on competitive ξ.
Crucial to our proofs is the notion of strictness [6] that relates the cost
of extending a solution on S so as to serve more clients T to the cost shares
of T . Formally, given a set X of elements and S of clients, let F = A(X, S)
denote the solution found by algorithm A. We create a new reduced instance
of the problem Π by zeroing out the cost of all elements in F; this instance is
denoted by X/F. A cost-sharing algorithm A is β-strict if for any sets of clients
S, T there exists a solution FT ⊆ X constructible in polynomial time such that
A(X, S) ∪ FT ∈ Sols(T ) and c(FT ) ≤ β × ξ(X, S ∪ T, T ).
A subtly different notion of strictness is c-strictness; here the “c” is supposed
to emphasize the enhanced role of the cost-shares. Let S, T ⊆ U be sets of clients,
and let X be an instance of Π. The cost-sharing function ξ given by an algorithm
A is β-c-strict if ξ(X/A(X, S), T, T ) ≤ β × ξ(X, S ∪ T, T ).
In other words, the total cost shares for the set T of clients in the reduced
instance X/A(X, S) is at most β times the cost-shares for T if the clients in S

were present as well. Note that any competitive β-strict cost sharing function ξ
is also β-c-strict.
Recall that A is an α-approximation algorithm if c(A(X, S)) ≤ α OPT(X, S).
In this paper, we will need the following stronger guarantee (which goes hand-
in-hand with c-strictness): An algorithm A (with cost-sharing function ξ) is an
α-approximation algorithm with respect to ξ if c(A(X, S)) ≤ α ξ(X, S, S).
If ξ is competitive, then ξ(X, S, S) ≤ c(OPT(S)), and thus an α-approxima-
tion algorithm w.r.t. ξ is also simply an α-approximation algorithm. This, to-
gether with c-strictness, implies that an α-approximation algorithm w.r.t. β-c-
strict ξ is (αβ)-strict in the sense of the first definition of strictness.
2.2 New Results on c-Strictness and Multi-stage Approximations
Having laid down the crucial definitions, we can finally state the main theorems
for multi-stage stochastic covering problems.
Theorem 1. There is an α ·
k−1
i=0 βi
-approximation algorithm for the k-stage
stochastic problem Stock(Π) if the corresponding problem Π has an α-approxi-
mation algorithm A with respect to a β-c-strict cost-sharing function ξ, and this
ξ is cross-monotone.
A slightly weaker version of the above theorem can be proved without cross-
monotone cost-shares: this is useful for problems like Vertex Cover for which
good cross-monotone cost-shares do not exist [10].
Theorem 2. There is an α ·
k−1
i=0 β1β2
i
-approximation algorithm for the k-
stage stochastic problem Stock(Π) if the corresponding problem Π has an α-
approximation algorithm A with respect to a β1-c-strict cost-sharing function ξ,
and A is also β2-strict with respect to this ξ.
Finally, we improve the previous results on strictness in [6], and adapt them
to the new notion of c-strictness. Due to space constraints, proofs of Theorems
2 and 3 only appear in the full version of the paper. Figure 2 summarizes the
results.
Theorem 3. 1. There is a 2-approximation algorithm for the Minimum Steiner
Tree problem w.r.t. a 1-c-strict cost sharing function ξ. Furthermore, this ξ
is also cross-monotone.
2. The Uncapacitated Facility Location problem admits a 3-approximation al-
gorithm w.r.t. a 2-c-strict ξ. This cost-sharing function ξ is also cross-
monotone.
3. The Vertex Cover problem has 2-approximation algorithm w.r.t. an associ-
ated 2-strict (and hence 2-c-strict) cost-sharing function ξ.
3 Multiple Stage Stochastic Optimization
For ease of exposition, let us state the algorithm assuming that the inflation
factors σi are deterministically known in advance. With some extra work as in

Section 4, randomly varying inflation factors can be handled – the details are
deferred to a full version of this paper.
Let us first outline the key idea of the algorithm. In the two-stage Boosted
Sampling framework with inflation being σ (see Figure 1), the first stage involved
simulating σ independent runs of the second stage and building a solution that
satisfied the union of these simulations. We use the same basic idea for the
k-stage problem: in each stage i, we simulate σi+1 “copies” of the remaining
(k − i)-stage stochastic process; each such “copy” provides us with a (random)
set of clients to satisfy, and we build a solution that satisfies the union of all
these clients. The “base case” of this recursive idea is the final stage, where we
get a set S of clients, and just build the required solution for S.
1. (Base case.) If i = k, draw one sample set of clients Sk from the conditional
distribution [π|s1, . . . , sk]. Return the set Sk.
2. (New samples.) If i k, draw σi+1 samples of the signal si+1 from the conditional
distribution [π|s1, . . . , si]. Let s1
, . . . , sn
be the sampled signals (where n = σi+1 ).
3. (Recursive calls.) For each sample signal sj
, recursively call Recur-Sample(Π,i +
1, s1, . . . , si, sj
) to obtain a sample set of clients Sj
. Return the set Si = ∪n
j=1Sj
.
Fig. 3. Procedure Recur-Sample(Π, stage i, s1, . . . , si)
Our algorithm, in each stage i (except the final, k-th stage), uses a very
natural recursive sampling procedure that emulates σi+1 executions of itself on
the remaining (k − i) stages. (This sampler is specified in Figure 3.) Having
obtained a collection of sampled sets of clients, it then augments the current
partial solution to a feasible solution for these sampled sets. Finally, in the k-th
stage it performs the ultimate augmentation to obtain a feasible solution for the
revealed set of demands. The expected number of calls to the black box required
in stage i will be
k
j=i+1 σj.
The definition of the sampling routine is given in Figure 3, and the procedure
Multi-Boost-and-Sample(Π, i) to be executed in round i is specified in Figure 4.
Note that if we set k = 2, we get back precisely the Boosted Sampling framework
from our previous paper [6].
3.1 The Analysis
We will now show that this extended framework can be used to translate an
approximation algorithm A for the deterministic version Det(Π) of a problem Π
to its k-stage stochastic version Stock(Π). The quality of this translation depends
on the approximation guarantee of A with respect to some c-strict cost-sharing
function. The main result of this section is the following:
Theorem 4. Given a problem Π, if A is an α-approximation algorithm w.r.t. a
β-c-strict cost-sharing function ξ, and if ξ is cross-monotone, then Multi-Boost-

1. (External signal.) If the current stage i k, observe the signal si. If this is the
final stage i = k, observe the required set of clients S instead.
2. (Sample.) If i = k, let Dk := S. Else i k, and then use procedure Recur-
Sample(Π, i, s1, . . . , si) to obtain a sample set of clients Di.
3. (Augment solution.) Let Bi = ∪i−1
j=1Fj be the elements that were bought in earlier
rounds. Set the costs of elements e ∈ Bi to zero. Using algorithm A, find a set of
elements Fi ⊆ X Bi to buy so that (Fi ∪ Bi) ∈ Sols(Di).
Fig. 4. Algorithm Multi-Boost-and-Sample(Π,i)
and-Sample(Π) is an α ·
k−1
i=0 βi
-approximation algorithm for the k-stage sto-
chastic problem Stock(Π).
Before we prove Theorem 4, we set the stage for the proof by providing a
brief overview of the proof technique, and proving a couple of lemmas which
provide useful bounds. A naı̈ve attempt to prove this result along the lines of
our previous paper[6] does not succced, since we have to move between the cost-
shares and the cost of the solutions Fi, which causes us to lose factors of ≈ α at
each step. Instead, we bound all costs incurred in terms of the ξ’s: we first argue
that the expected sum of cost-shares paid in the first stage is no more than the
optimum total expected cost Z∗
, and then bound the sum of cost-shares in each
consecutive stage in terms of the expected cost shares from the previous stage
(with a loss of a factor of β at each stage). Finally, we bound the actual cost of
the partial solution constructed at stage i by α times the expected cost shares
for that stage, which gives us the geometric sum claimed in the theorem.
Let F∗
be an optimal solution to the given instance of Stock(Π). We denote
by F∗
i the partial solution built in stage i; recall that F∗
i = F∗
i (s1, s2, . . . , si)
is a function of the set of all possible i-tuples of signals that could be observed
before stage i. The expected cost of this solution can be expressed as Z∗
=
σ1E[c(F∗
1 (s1))] + σ1σ2E[c(F∗
2 (s1, s2))] + · · · + σ1 . . . σnE[c(F∗
k (s1, s2, . . . , sk))].
Lemma 1. The expected cost share E[ξ(X, D1, D1)] is at most the total opti-
mum cost Z∗
.
Proof. Consider D1, the sample set of clients returned by Recur-Sample(Π, 1, s1).
We claim there is a solution
F(D1), such that E[c(
F (D1))] ≤ Z∗
(the expec-
tation is over the execution of the procedure Recur-Sample). To construct the
solution
F(D1), we consider the tree of recursive calls of the procedure Recur-
Sample. For each recursive call Recur-Sample(Π, i, s1, . . . , si), we add the set of
elements F∗
i (s1, . . . , si) to
F(D1). It is relatively straightforward to establish
that (1)
F(D1) is a feasible solution for the set D1 and (2) the expected cost
E[c(
F(D1))] ≤ E
k
i=1

j≤i σj

c(F∗
i )

= Z∗
.
Since the expected cost of a feasible solution for D1 is bounded above by Z∗
,
the competitiveness of ξ implies that this bound must hold for the sum of cost
shares as well.

Lemma 2. Let F̂ = F1 ∪ · · · ∪ Fi−1 be the solution constructed in a particular
execution of the first i − 1 stages, and let si be the signal observed in stage i.
Let Di and Di+1 be the random variables denoting the samples returned by the
procedure Recur-Sample in Stages i and i+1, and let Fi be the (random) solution
constructed by A for the set of clients Di. Then, E[ξ(X/(
F ∪Fi), Di+1, Di+1)] ≤
β
σi+1
· E[ξ(X/
F, Di, Di)].
Proof. Recall that the sampling procedure Recur-Sample(Π, i, s1, . . . , si) gets
n = σi+1 independent samples s1
, s2
, . . . sn
of the signal si+1 from the distri-
bution π conditioned on s1, . . . , si, and then for each sampled signal calls itself
recursively to obtain the n sets S1
, . . . , Sn
. Note that the set Di =
n
j=1 Sj
is
simply the union of these n sets. On the other hand, the set Di+1 is obtained by
observing the signal si+1 (which is assumed to come from the same distribution
[π|s1, . . . , si]), and then calling Recur-Sample with the observed value of si+1.
We now consider an alternate, probabilistically equivalent view of this pro-
cess. First take n+1 samples s1
, . . . , sn+1
of the signal si+1 from the distribution
[π|s1, . . . , si]. Call the procedure Sample(π, i + 1, s1, . . . , si, sj
) for each sj
to
obtain sets S1
, . . . , Sn+1
. Pick an index j uniformly at random from the set of
integers 1, . . . , n + 1. Let Di+1 = Sj
, and let Di be the union of the remaining n
sets. This process of randomly constructing the pair of sets (Di, Di+1) is clearly
equivalent to the original process. Note that Di ∪ Di+1 =
n+1
l=1 Sl
.
To simplify notation, let us denote X/
F by
X. By the definition of β-c-
strictness, we first get ξ(
X/Fi, Di+1, Di+1) ≤ β · ξ(
X, Di ∪ Di+1, Di+1).
By the relation that Di+1 is equivalent to Sj
sampled uniformly from n +
1 alternates in the equivalent process above and that Di is the union of the
remaining n sets, we have E[ξ(
X, Di ∪Di+1, Sj
)] ≤ E
1
n × ξ(
X, Di ∪ Di+1, Di)

.
Now we use cross-monotonicity of the cost shares (which says that the cost
shares of Di should not increase when the elements of Di+1 Di join the fray),
and finally get E

ξ(
X, Di ∪ Di+1, Di)

≤ E

ξ(
X, Di, Di)

.
Chaining the above inequalities proves the lemma.
Proof. (of Theorem 4) Recall that the expected cost of the solution given by
Algorithm Multi-Boost-and-Sample(Π) is: E[Z] = E
k
i=1
i
j=1 σj

c(Fi)

Using cross-monotonicity and the fact that A is an α-c-approximation al-
gorithm with respect to the β-c-strict cost-shares ξ, we obtain the following:
E[Z] ≤ αE
k
i=1
i
j=1 σj

ξ(X/Bi, Fi, Fi)

Using Lemma 2 inductively on ξ(X/Bi, Fi, Fi), we find that ξ(X/Bi, Fi, Fi) ≤
βi
i
j=1 σj
ξ(X, D1, D1). Using this inequality in the bound for E[Z] above: E[Z] ≤
αE
k
i=1 βi
ξ(X, D1, D1)

Lemma 1 bounds ξ(X, D1, D1) from above by Z∗
. Using this bound in the
inequality above completes the proof of Theorem 4.

4 Correlated Inflation Factors
In this section, we show how to extend the basic Boosted Sampling framework
to work in the case where the inflation factor σ is a random variable arbitrarily
correlated with the random scenarios. For brevity, we describe the idea only
for the two-stage setting, though the same idea can be used for the multi-stage
framework of Section 3.
Formally, let us assume that we have access to a distribution π
over R≥1×2U
,
where π
(σ, S) is the probability that the set S arrives and the inflation factor
is σ. We assume that we know an integer M ∈ Z which is an upper bound on
the value of the inflation parameter σ; i.e., with probability 1, it should be the
case that σ ≤ M holds. (Note that choosing a pessimistic value of M will only
increase the running time, but not degrade the approximation guarantee of the
framework.)
1. Boosted Sampling: Draw M independent samples from the joint distribution π
of
(σ
σ
σ, S). Let (σ1, S1), (σ2, S2),. . . , (σM , SM ) denote this collection of samples.
2. Rejection Stage: For i = 1, . . . , M, accept the sample Si with probability σi/M.
Let Si1 , Si2 , . . . , Sik be the accepted samples, and let S =

j Sij .
3. First-stage Solution: Using the algorithm A, construct an α-approximate first-stage
solution F1
∈ Sols(S).
4. Second-stage Recourse: Let T be the set of clients realized in the second stage.
Use an augmenting algorithm to compute the second-stage solution F2
such that
F1
∪ F2
∈ Sols(T).
Fig. 5. Algorithm General-Boost-and-Sample(Π)
The algorithm General-Boost-and-Sample(Π) is given in Figure 5. Note that
if σ
σ
σ is a constant, then we would behave identically to the original Boosted
Sampling framework. To get some intuition for the new steps, note that if the
sampled inflation factor σi is large, this indicates that we want to handle the
associated Si in the first stage; on the other hand, if the σi is small, we can
afford to wait until the second-stage to handle the associated Si – and this is
indeed what the algorithm does, albeit in a probabilistic way. The following is
an extension of the main structural theorem proved in our earlier paper [6]:
Theorem 5. Consider a sub-additive combinatorial optimization problem Π,
and let A be an α-approximation algorithm for its deterministic version Det(Π).
If A admits a β-strict cost sharing function, then General-Boost-and-Sample(Π)
is an (α + β)-approximation algorithm for Stoc(Π).
Proof. Let us transform the “random inflation” stochastic problem instance
(X, π
) to one with a fixed inflation factor thus: the distribution
π(σ, S) =
π
(σ, S) × (σ/M); note that this ensures that

σ,S
π(σ, S) ≤ 1, and hence we
can increase the probability
π(1, ∅) so that the sum becomes exactly 1 and
π is

a well-defined probability distribution. The inflation factor for this new instance
is set to M, and hence the σ output by
π is only for expositional ease.
Now the objective for this new problem is to minimize the expected cost
under this new distribution, which is c(F0) +

σ,S
π(σ, S) M c(FS) = c(F0) +

σ,S π
(σ, S) (σ/M) M c(FS), which is the same as the original objective func-
tion; hence the two problems are identical, and running Boost-and-Sample on
this new distribution
π with inflation parameter M would give us an (α + β)-
approximation.
Finally, note that one can implement
π given black-box access to π
by just
rejecting any sample (σ, S) with probability σ/M. Including this implementation
within Boost-and-Sample gives us precisely the above General-Boost-and-Sample,
which completes the proof of the theorem.
This immediately implies a 3.55-approximation for Stochastic Steiner tree,
a 4-approximation for Stochastic Vertex Cover, and a 5.45-approximation for
Stochastic Facility Location even when the second-stage inflation factors are
drawn from a distribution that may be arbitrarily correlated to the set of clients
materializing in the second stage.
A naı̈ve attempt to extend the Boosted Sampling algorithm of [6] might
proceed by obtaining E[σ
σ
σ] samples and invoking the algorithm. Unfortunately,
this approach is doomed to fail, as the following example shows. Consider the
case where with probability 1
2 , the inflation σ = M 1 but S = ∅, and with
probability 1
2 , the inflation σ ≈ 1 but S = ∅; the average value of σ is ≈ M/2,
but in fact we should be ignoring the high σ’s.
While the algorithms described in this paper can provide approximation al-
gorithms with performance guarantee linear in the number of stages for some
problems (Stochastic Steiner Tree), this requires 1-c-strict cost-shares, which
may not always exist. The question of which k-stage stochastic optimization
problems can be approximated within ratios linear in k and which require expo-
nential dependence on k is an intriguing one. We also note that the running time
of our algorithm is exponential in k due to the recursive sampling procedure; we
leave open the question whether a sub-exponential collection of samples can be
used to construct the partial solutions in earlier stages while still resulting in an
algorithm with a provably good approximation ratio.
References
1. E. M. L. Beale. On minimizing a convex function subject to linear inequalities. J.
Roy. Statist. Soc. Ser. B., 17:173–184; discussion, 194–203, 1955. (Symposium on
linear programming.).
2. John R. Birge and François Louveaux. Introduction to stochastic programming.
Springer Series in Operations Research. Springer-Verlag, New York, 1997.

3. George B. Dantzig. Linear programming under uncertainty. Management Sci.,
1:197–206, 1955.
4. Kedar Dhamdhere, R. Ravi, and Mohit Singh. On two-stage stochastic minimum
spanning trees. In Proceedings of the 11th Integer Programming and Combinatorial
Optimization Conference, pages 321–334, 2005.
5. Anupam Gupta, Amit Kumar, Martin Pál, and Tim Roughgarden. Approxima-
tions via cost-sharing. In Proceedings of the 44th Annual IEEE Symposium on
Foundations of Computer Science, pages 606–615, 2003.
6. Anupam Gupta, Martin Pál, R. Ravi, and Amitabh Sinha. Boosted sampling:
Approximation algorithms for stochastic optimization problems. In Proceedings of
the 36th ACM Symposium on the Theory of Computing (STOC), pages 417–426,
2004.
7. Anupam Gupta, R. Ravi, and Amitabh Sinha. An edge in time saves nine: LP
rounding approximation algorithms for stochastic network design. In Proceedings
of the 45th Symposium on the Foundations of Computer Science (FOCS), pages
218–227, 2004.
8. A. Hayrapetyan, C. Swamy, and E. Tardos. Network design for information net-
works. In Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 933–942, 2005.
9. Nicole Immorlica, David Karger, Maria Minkoff, and Vahab Mirrokni. On the costs
and benefits of procrastination: Approximation algorithms for stochastic combi-
natorial optimization problems. In Proceedings of the 15th Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 691–700, 2004.
10. Nicole Immorlica, Mohammad Mahdian, and Vahab Mirrokni. Limitations of cross-
monotonic cost-sharing schemes. In Proceedings of the 16th ACM-SIAM Sympo-
sium on Discrete Algorithms (SODA), 2005.
11. Kamal Jain and Vijay Vazirani. Applications of approximation algorithms to coop-
erative games. In Proceedings of the 33rd Annual ACM Symposium on the Theory
of Computing (STOC), pages 364–372, 2001.
12. Peter Kall and Stein W. Wallace. Stochastic programming. Wiley-Interscience Se-
ries in Systems and Optimization. John Wiley Sons Ltd., Chichester, 1994.
13. Willem K. Klein Haneveld and Maarten H. van der Vlerk. Stochastic Programming.
Department of Econometrics and OR, University of Groningen, Netherlands, 2003.
14. Hervé Moulin and Scott Shenker. Strategyproof sharing of submodular costs: bud-
get balance versus efficiency. Econom. Theory, 18(3):511–533, 2001.
15. Martin Pál and Éva Tardos. Group strategyproof mechanisms via primal-dual al-
gorithms. In Proceedings of the 44th Annual IEEE Symposium on Foundations of
Computer Science, pages 584–593, 2003.
16. R. Ravi and Amitabh Sinha. Hedging uncertainty: Approximation algorithms for
stochastic optimization problems. In Proceedings of the 10th Integer Programming
and Combinatorial Optimization Conference, pages 101–115, 2004.
17. R. Schultz, L. Stougie, and M. H. van der Vlerk. Two-stage stochastic integer
programming: a survey. Statist. Neerlandica, 50(3):404–416, 1996.
18. David Shmoys and Chaitanya Swamy. Stochastic optimization is (almost) as easy
as deterministic optimization. In Proceedings of the 45th Symposium on the Foun-
dations of Computer Science (FOCS), pages 228–237, 2004.
19. David Shmoys and Chaitanya Swamy. Sampling-based approximation algorithms
for multi-stage stochastic optimization. Manuscript, 2005.
20. H. P. Young. Cost allocation. In R. J. Aumann and S. Hart, editors, Handbook of
Game Theory, volume 2, chapter 34, pages 1193–1235. North-Holland, 1994.

The Complexity of Making Unique Choices:
Approximating 1-in-k SAT
Venkatesan Guruswami1,
and Luca Trevisan2
1
Dept. of Computer Science Engineering
University of Washington, Seattle, WA 98195
venkat@cs.washington.edu
2
Computer Science Division
University of California at Berkeley, Berkeley, CA 94720
luca@cs.berkeley.edu
Abstract. We study the approximability of 1-in-kSAT, the variant of
Max kSAT where a clause is deemed satisfied when precisely one of its
literals is satisfied. We also investigate different special cases of the prob-
lem, including those obtained by restricting the literals to be unnegated
and/or all clauses to have size exactly k. Our results show that the 1-
in-kSAT problem exhibits some rather peculiar phenomena in the realm
of constraint satisfaction problems. Specifically, the problem becomes
substantially easier to approximate with perfect completeness as well as
when negations of literals are not allowed.
1 Introduction
Boolean constraint satisfaction problems (CSP) arise in a variety of contexts and
their study has generated a lot of algorithmic and complexity-theoretic research.
An instance of a Boolean CSP is given by a set of variables and a collection of
Boolean constraints, each on a certain subset of variables, and the objective is to
find an assignment to the variables that satisfies as many constraints as possible.
The most fundamental problem in this framework is of course Max SAT where
the constraints are disjunctions of a subset of literals (i.e., variables and their
negations), and thus are satisfied if at least one of the literals in the subset are
set to 1. When the constraints are disjunctions of at most k literals, we get the
Max kSAT problem.
In this work we consider constraint satisfaction problems where each con-
straint requires that exactly one literal (from the subset of literals constrained
by it) is set to 1. This is a natural variant of SAT which is also NP-hard. We
study the version of this problem when all constraints contain at most k literals
(for constant k ≥ 3). We call this problem 1-in-kSAT. For k = 3, this problem
has often found use as a more convenient starting point compared to 3SAT for
NP-completeness reductions. For each k ≥ 3, determining if all constraints can
be satisfied is NP-hard [9]. We study the approximability of this problem and
some of its variants.

Supported in part by NSF Career Award CCF-0343672.
c

100 Venkatesan Guruswami and Luca Trevisan
In addition to being a natural satisfiability problem that merits study for its
own sake, the underlying problem of making unique choices from certain specified
subsets is salient to a natural pricing problem. We describe this connection at
the end of the introduction.
We now formally define the problems we consider. Let {x1, x2, . . . , xn} be
a set of Boolean variables. For a set S ⊆ {xi, xi | 1 ≤ i ≤ n} of literals, the
constraint ONE(S) is satisfied by an assignment to {x1, . . . , xn} if exactly one
of the literals in S is set to 1 and the rest are set to 0.
Definition 1 (1-in-kSAT and 1-in-EkSAT). Let k ≥ 2. An instance of 1-in-
kSAT consists of a set {x1, x2, . . . , xn} of Boolean variables, and a collection of
constraints ONE(Si), 1 ≤ i ≤ m for some subsets S1, S2, . . . , Sm of {xi, x̄i | 1 ≤
i ≤ n}, where |Si| ≤ k for each i = 1, 2, . . . , m. When each Si has size exactly
k, we get an instance of Exact 1-in-k Satisfiability, denoted 1-in-EkSAT.
Definition 2 (1-in-kHS and 1-in-EkHS). Let k ≥ 2. An instance of 1-in-kHS
consists of a set {x1, x2, . . . , xn} of Boolean variables, and a collection of con-
straints ONE(Si), 1 ≤ i ≤ m for some subsets S1, S2, . . . , Sm of {xi | 1 ≤ i ≤ n},
where |Si| ≤ k for each i = 1, 2, . . . , m. When each Si has size exactly k, we get
an instance of Exact 1-in-k Hitting Set, denoted 1-in-EkSAT.
Note that 1-in-kHS (resp. 1-in-EkHS) is a special case of 1-in-kSAT (resp. 1-in-
EkSAT) where no negations are allowed. Also, 1-in-kHS is simply the variant of
the Hitting Set problem where we are given a family of sets each of size at most k
from a universe, and the goal is to pick a subset of the universe which intersects
a maximum number of sets from the input family in exactly one element. The 1-
in-EkHS problem corresponds to the case when each set in the family has exactly
k elements. For every k ≥ 3, 1-in-EkHS is NP-hard [9].
Clearly, the following are both orderings of the problems in decreasing order
of their generality: (i) 1-in-kSAT, 1-in-EkSAT, 1-in-EkHS; and (ii) 1-in-kSAT,
1-in-kHS, 1-in-EkHS.
The 1-in-3SAT problem was considered in Schaefer’s work on complexity of
satisfiability problems [9]. An inapproximability factor of 6/5 − ε was shown for
1-in-E3SAT in [6]. We are unaware of any comprehensive prior investigation into
the complexity of approximating 1-in-kSAT and its variants for larger k.
Our Results. For a maximization problem (such as a maximum constraint sat-
isfaction problem), we define an α-approximation algorithm to be one which
always delivers solution whose objective value is at least a fraction 1/α of the
optimum. A random assignment clearly satisfies at least a fraction k2−k
fraction
of constraints in any 1-in-kSAT instance. This gives a 2k
/k-approximation algo-
rithm, and no better approximation algorithm appears to be known for general
k. We prove that, for sufficiently large k, it is NP-hard to approximate 1-in-
EkSAT (and hence also the more general 1-in-kSAT) within a 2k−O(
√
k)
factor.
The result uses a gadget-style reduction from the general Max kCSP problem,
but the analysis of the reduction uses a ”random perturbation” technique which
is a bit different from the standard way of analyzing gadget-based reductions.

The Complexity of Making Unique Choices: Approximating 1-in-k SAT 101
The easiest of the problems we consider, namely 1-in-EkHS, has a simple e-
approximation algorithm. We prove that this is the best possible – obtaining an
(e−ε)-approximation is NP-hard, for every ε 0 (for large enough k). Using the
algorithm for 1-in-EkHS, one can give an O(log k)-approximation algorithm for
1-in-kHS. (Recently, in [4], a hardness result for approximating 1-in-kHS within
a factor logσ
n has been shown. Here n is the size of the universe, k is allowed
to grow with the input, and σ 0 is an absolute constant.)
The 2k−O(
√
k)
inapproximability factor for general 1-in-kSAT says that algo-
rithms that are substantially better than picking a random assignment do not
exist, assuming P = NP. For satisfiable instances of 1-in-kSAT, however, we are
able to give an e-approximation algorithm. Specifically, when given a 1-in-kSAT
instance for which an assignment that satisfies every clause exists, the algorithm
finds an assignment that satisfies at least a fraction 1/e of the constraints. This
is again the best possible, since our (e − ε)-inapproximability result holds for
satisfiable instances of 1-in-EkHS.
Our results highlight the following peculiar behavior of 1-in-kSAT which is
unusual for constraint satisfaction problems. The 1-in-kSAT problem becomes
much easier to approximate when negations are not allowed (the 1-in-kHS vari-
ant, which has a O(log k)-approximation algorithm), or when restricted to sat-
isfiable instances (which has an e-approximation algorithm).
A Pricing Problem. The 1-in-kHS problem is related to a natural optimization
problem concerning pricing that was recently considered in [7]. Consider the
following pricing question: there is a universe of n items, each in unlimited supply
with the seller. There are m customers, and each customer is single-minded and
wants to buy a precise subset of at most k items (and this subset is known
to the seller). Each customer values his/her subset at one dollar, and thus will
buy the subset if and only if it costs at most one dollar in total. The goal is
set prices to the items that maximizes the total revenue. If the prices are all
either 0 or 1 dollars, then this problem is exactly 1-in-kHS. But the seller could
price items in cents, and thereby possibly generating more revenue. However, it
can be shown that the optimum with fractional prices is at most e times that
with 0, 1-prices (this follows from Lemma 6). Therefore, this problem admits a
constant factor approximation (with approximation ratio independent of k) if
and only if 1-in-kHS admits such an algorithm.
The Problem of Constructing “Ad-Hoc Selective Families.” Another
application of the 1-in-kHS problem is to the computation of ad-hoc selective
families. An (n, h)-selective family is a combinatorial object defined and studied
in [2] to deal with a broadcast problem in a radio network of unknown topology.
An (n, h)-selective family is a collection S of subsets of [n] such that for every
set F ⊆ [n] such that |F| ≤ h there is a set S ∈ S such that |F ∩ S| = 1. More
generally, a collection S of subsets of [n] is ad-hoc selective for a collection F of
subsets of [n] if for every set F ∈ F there is a set S ∈ S such that |S ∩ F| = 1
(in such a case, we say that S selects F). The notion of ad-hoc selective family,
introduced in [3], has applications to the broadcast problem in radio networks of
known topology. Here the computational problem of interest is, given a family

F (that is related to the topology of the radio network) to find a family S of
smallest size that is ad-hoc selective for F: the family S determines a schedule for
the broadcast on the radio network and the time needed to realize the broadcast
depends on the number of sets in the family S. Clementi et al. [3] observe that
given an approximation algorithm for 1-in-kHS one can get an approximation
algorithm for the ad-hoc selective family problem as follows. Think of F an
instance of 1-in-kHS, where the correspondence is that the instance of 1-in-kHS
has a variable for every element of the universe [n] of F, and a constraint for
every set of F. An approximation algorithm for 1-in-kHS will find a set S that
“selects” a large number of sets in F: then one deletes those sets from F and
repeats the above process. In order to optimize this process, Clementi et al. [3]
introduce the idea of dividing F into sub-families of sets having approximately
the same size, and our O(log k) approximation algorithm for 1-in-kHS is based
on a similar idea.
2 Approximation Algorithms
In this section we present a randomized algorithm that delivers solutions within
factor 1/e of the optimum solution for 1-in-EkHS. We also present a randomized
algorithm that given an instance of 1-in-kSAT that is satisfiable, finds an assign-
ment that satisfies an expected fraction 1/e of the clauses. Both these algorithms
are the best possible in terms of approximation ratio, for large k, as we show in
Section 3.
2.1 Approximation Algorithm for 1-in-EkHS
For the simplest variant 1-in-EkHS, there is a trivial e-approximation algorithm.
Theorem 3. For every integer k ≥ 2, there is a polynomial time e-approxi-
mation algorithm for 1-in-EkHS. The claim holds also when k is not an absolute
constant but an arbitrary function of the universe size.
Proof. Set each variable to 1 independently with probability 1/k. The probability
that a clause of k variables has exactly one variable set to 1 equals k · 1
k (1 −
1/k)k−1
= (1−1/k)k−1
≥ 1/e. Therefore the expected fraction of clauses satisfied
by such a random assignment is at least 1/e. The algorithm can be derandomized
using the method of conditional expectations. Note that we did not use the exact
value of k in the above argument, only that all sets had size k.
Algorithm for 1-in-kHS. We now consider the case when not all sets have
exactly k elements. For this case we do not know any way to approximate within
a factor that is independent of k.
Theorem 4. There exists c 0 such that for every integer k ≥ 2, there is a
polynomial time c log k-approximation algorithm for 1-in-kHS. The claim holds
even when k is not a constant but grows with the universe size.

Proof. Let m be the number of sets in the 1-in-kHS instance. We partition the
sets in the 1-in-kHS instance according to their size, placing the sets of size in
the range [2i−1
, 2i
) in collection Fi partition, for i = 1, 2, . . ., log k. Pick the
partition that has a maximum number of sets, say Fj, breaking ties arbitrarily.
Clearly, |Fj| ≥ m
log k . Set each element to 1 with probability 1/2j
. Fix a set set
in Fj that has x elements, 2j−1
≤ x 2j
. The probability that it has exactly
one variable set to 1 equals x
2j

1 − 1
2j
x
, which is easily seen to be at least 1
2e .
Thus, expected fraction of sets in Fj that have exactly one element set to 1 is at
least 1
2e . Therefore, we satisfy at least m
2e log k sets. The algorithm can again be
derandomized using conditional expectations, and the argument holds for every
k in the range 1 ≤ k ≤ n, where n is the universe size.
Remark: Note that in the above algorithm, the upper bound on optimum we
used was the total number of sets. With this upper bound, the best approxi-
mation factor we can hope for is O(log k). This is because there are instances
of 1-in-kHS with m sets whose optimum is at most O( m
log k ). In fact, it can be
shown that an instance obtained by picking an equal number of sets in each of
the buckets at random will have this property with high probability.
2.2 Approximation Algorithm for 1-in-kSAT
with Perfect Completeness
So far we saw algorithms for the case when negations were not allowed. We now
give an approximation algorithm for 1-in-kSAT for the case when the instance
is in fact satisfiable. Later on, in Section 3, we will show that without this
restriction, a strong inapproximability result for 1-in-kSAT holds. We will also
show that the factor e is the best possible, even with this restriction.
Theorem 5. For every k ≥ 2, there is an e-approximation algorithm for sat-
isfiable instances of 1-in-kSAT. The claim holds even when k is not an absolute
constant but an arbitrary function of the number of variables.
Proof. The approach is to use a linear programming relaxation, and apply ran-
domized rounding to it to obtain a Boolean assignment to the variables. Let
(V, C) be an instance of 1-in-kSAT, where V = {x1, x2, . . . , xn} is the set of vari-
ables and C = {C1, C2, . . . , Cm} is the set of clauses. For j = 1, 2, . . ., m, define
pos(Cj) ⊆ V to be those variables that appear positively (i.e., unnegated) in
Cj, and neg(Cj) to be those variables that appear negated in Cj. Consider the
linear program P with the following constraints in variables p1, p2, . . . , pn:
0 ≤ pi ≤ 1 for i = 1, 2, . . . , n ,
i:xi∈pos(Cj )
pi +
i:xi∈neg(Cj)
(1 − pi) = 1 for j = 1, 2, . . ., m.
The above program has a feasible solution. Indeed, let a : V → {0, 1} be an as-
signment that satisfies all clauses in C. Then clearly pi = a(xi) for i = 1, 2, . . ., n
satisfies all the above constraints.

Solve the linear program (P) to find a feasible solution p∗
i , 1 ≤ i ≤ n in
polynomial time. We need to convert this solution into an assignment a∗
: V →
{0, 1}. We do this using randomized rounding. That is, for each xi independently,
we set
a(xi) =

1 with probability p∗
i
0 with probability (1 − p∗
i )
Now consider the probability that a particular clause, say C1, is satisfied. Let
C1 depend on r variables, xi1 , xi2 , . . . , xir . For 1 ≤ ≤ r, define q = p∗
i
if xi
appears unnegated in C1, and equal to 1−p∗
i
if xi
appears negated in C1. Then
we have q1 + q2 + . . . + qr = 1. The probability that C1 is satisfied equals
q1(1 − q2)(1 − q3) · · · (1 − qr) + (1 − q1)q2(1 − q3) · · · (1 − qr) + · · ·
· · · + (1 − q1)(1 − q2) · · · (1 − qr−1)qr
By Lemma 6, this quantity is minimized when q1 = q2 = · · · = qr = 1/r, and
thus is at least (1 − 1/r)r−1
≥ 1/e. Therefore the expected fraction of clauses
satisfied by the randomized rounding is at least 1/e, proving the theorem.
We now state the inequality that was used in the above proof. An elegant
proof of this inequality was shown to us by Chris Chang. For reasons of space,
we omit the proof here.
Lemma 6. Let r ≥ 2 and let q1, q2, . . . , qr be non-negative integers that sum up
to 1. Then the quantity
q1(1−q2) · · · (1−qr) + (1−q1)q2(1−q3) · · · (1−qr) + · · · + (1−q1) · · · (1−qr−1)qr
attains its minimum value when q1 = q2 = · · · = qr = 1/r. In particular, the
quantity is at least (1 − 1/r)r−1
.
3 Inapproximability Results
3.1 Factor 2Ω(k)
Hardness for 1-in-EkSAT
We now prove that, if P = NP, then, for sufficiently large k, 1-in-EkSAT cannot
be approximated within a factor of 2k−O(
√
k)
. The result uses a gadget-style
reduction from the constraint satisfaction problem Max EkAND, but the analysis
of the reduction uses a ”random perturbation” technique which is a bit different
from the standard way of analyzing gadget-based reductions.
Preliminaries. We first define the Max EkAND problem. An instance of Max
EkAND consists of a set of Boolean variables, and a collection of AND constraints
of the form l1 ∧ l2 ∧ · · · ∧ lk where each lj is a literal. The goal is to find an
assignment that satisfies a maximum number of the AND constraints.

Theorem 7 ([8]). If P = NP, for every ε 0 and for every integers q ≥ 1 and
k such that 2q+1 ≤ k ≤ 2q+q2
, there is no (2k−2q
−ε)-approximation algorithm
for Max EkAND. In particular, for every k ≥ 7, ε 0, there is no (2k−2
√
k
−ε)-
approximate algorithm for Max kAND. Furthermore, if ZPP = NP, then for
every ε there is a constant c such that there is no n1−ε
-approximate algorithm
for Max (c log n)AND.
The furthermore part follows from the result of [8] plus the use of a random-
ized reduction described in [1].
Inapproximability Result. We describe a reduction from Max EkAND to
1-in-EkSAT.
Lemma 8. Suppose that there is a polynomial time β-approximation algorithm
for 1-in-EkSAT, for some k ≥ 3. Then there is an 2ekβ-approximate algorithm
for Max EkAND.
Proof. We describe how to map an instance ϕAND of Max EkAND into an instance
ϕoik of 1-in-EkSAT. Let l1 ∧· · ·∧lk be a clause of the Max EkAND instance ϕAND.
We introduce the constraints ONE(l1, negl2, . . . , neglk), ONE(negl1, l2, . . . , neglk),
. . . , ONE(negl1, negl2, . . . , lk), where neg denotes the negation of the literal .
If originally we had m AND constraints, now we have km constraints of 1-in-
EkSAT.
Claim 1 If there is an assignment that satisfies t constraints in ϕAND, then the
same assignment satisfies at least tk constraints in ϕoik.
Proof of Claim. For each clause l1 ∧· · ·∧lk that is true, then the k corresponding
ONE constraints are satisfied.
Claim 2 If there is an assignment that satisfies t
constraints in ϕoik, then there
is an assignment that satisfies at least t
/(2ek2
) constraints in ϕAND.
Proof of Claim. Take the assignment a that satisfies t
constraints in ϕoik and
then randomly perturb it in the following way: for each variable independently,
flip its value with probability 1/k and leave the value unchanged with probability
1−1/k; we will estimate the expected number of clauses of ϕAND that are satisfied
by this random assignment R.
Let us look at a clause l1 ∧ · · · ∧ lk of ϕAND and the k corresponding clauses
of ϕoik. We consider different cases, depending upon how many of the literals
l1, l2, . . . , lk are set to 1 by the assignment a.
– If a satisfies all literals, then it satisfies all k clauses in ϕoik and the random
assignment satisfies l1 ∧ · · · ∧ lk with probability at least (1 − 1/k)k
≥ 8/27,
for k ≥ 3.
– If a satisfies k − 2 literals, then it satisfies 2 clauses in ϕoik, and the random
assignment satisfies l1 ∧· · ·∧lk with probability at least (1−1/k)k−2
·k−2
≥
1/(ek2
)

– In all other cases, a satisfies none of the clauses corresponding to l1 ∧· · · ∧lk
in ϕoik.
In each case, we have that the probability that the assignment satisfies l1 ∧· · ·∧lk
is at least 1/(2ek2
) times the number of constraints corresponding to l1 ∧· · ·∧lk
in ϕoik satisfied by a. If a satisfies t
constraints, it follows that the random
assignment satisfies, on average, at least t
/(2ek2
) clauses of ϕAND.
Suppose now that we have a β-approximate algorithm for 1-in-EkSAT and
that we are given in input an instance ϕAND of Max EkAND. Suppose that the
optimum solution for ϕAND has cost opt. Then we construct an instance ϕoik
of 1-in-EkSAT as described above, and give it to the approximation algorithm.
By Claim 1, the optimum of ϕoik is at least k · opt, and so the algorithm will
return a solution of cost at least k · opt/β. By Claim 2, we get a distribution
over assignments for ϕAND that, on average, satisfies at least opt/(2eβk) con-
straints. An assignment that satisfies at least as many constraints can be found
deterministically using the method of conditional expectations.
Theorem 9. If P = NP, for every sufficiently large k and for every ε 0,
there is no polynomial time (2k−2
√
k
/(2ek) − ε)-approximation algorithm for
1-in-EkSAT. Furthermore, if ZPP = NP, for every ε there is a c such that there
is no polynomial time n1−ε
approximation algorithm for 1-in-(c log n)SAT.
Proof. Follows from Theorem 7 and from the reduction of Lemma 8. For the
“furthermore” part one should observe that the reduction does not increase the
size of the input by more that a logarithmic factor.
3.2 Factor e − ε Hardness for 1-in-EkHS
In this section, we prove the following hardness result, which shows that the
results of Theorems 3 and 5 are tight in terms of the approximation ratio.
Theorem 10. For every ε 0, for sufficiently large k, there is no polynomial
time (e − ε)-approximation algorithm for 1-in-EkHS, unless P = NP. Further-
more, the result holds even when the instance of 1-in-EkHS is satisfiable.
Multiprover Systems. Our proof will use the approach behind Feige’s hard-
ness result for set cover [5]. We will give a reduction from the multiprover proof
system of Feige, which we state in a form convenient to us below. In what follows,
we use [m] to denote the set {1, 2, . . ., m}.
Definition 11 (p-prover game). For every integer p ≥ 2 and a parameter u
that is an even integer, an instance I of the p-prover game of size n is defined
as follows.
– The instance consists of a p-uniform p-partite hypergraph H with the follow-
ing properties:

• [Vertices] The vertex set of H is given by W = Q1∪Q2∪· · ·∪Qp, where Qi
is the vertices in the i’th part (or prover), and |Qi| = Q = nu/2
(5n/3)u/2
for i = 1, 2, . . . , p.
• [Hyperedges] There are R = (5n)u
hyperedges in H, labeled by r ∈ [R].
Denote the r’th hyperedge, for r ∈ [R], by (qr,1, qr,2, . . . , qr,p), where qr,i ∈
Qi for i ∈ [p].
• [Regularity] Each vertex in W belongs to precisely R/Q hyperedges.
– Define B = 4u
and A = 2u
. For each r ∈ [R], and i ∈ [p], the instance
consists of projections πr,i : [B] → [A] each of which is (B/A)-to-1.
The goal is to find a labeling a : W → [B] that “satisfies” as many hyperedges of
H as possible, where we define the notion of when a hyperedge is satisfied below.
– [Strongly satisfied hyperedges] We say that a labeling a : W → [B] strongly
satisfies a hyperedge r ∈ [R] if
πr,1(a(qr,1)) = πr,2(a(qr,2)) = · · · = πr,p(a(qr,p)) .
– [Weakly satisfied hyperedges] We say that a labeling a : W → [B] weakly
satisfies a hyperedge r ∈ [R] if at least two elements of the tuple
πr,1(a(qr,1)), πr,2(a(qr,2)), · · · , πr,p(a(qr,p))
are equal.
Feige’s result on the above p-prover games can be stated as follows.
Theorem 12. There exists a constant 0 c 1 such that for every p ≥ 2 and
all large enough u, given an instance I of the p-prover game with parameter u,
it is NP-hard to distinguish between the following two cases, when it is promised
that one of them holds:
– Yes Instances: There is a labeling that strongly satisfies all hyperedges.
– No Instances: No labeling weakly satisfies more than a fraction p2
cu
of the
hyperedges.
Note that difference between Yes and No instances is not just in the fraction of
satisfied hyperedges, but also in how the hyperedge is satisfied (strong vs. weak).
Reduction to 1-in-EkHS. The result of Theorem 10 clearly follows from The-
orem 12 and the reduction guaranteed by the following lemma.
Lemma 13. For every ε 0, there exists a positive integer p such that for
all large enough even u the following holds. Let k = 4u
. There is polynomial
time reduction from p-prover games with parameter u to 1-in-EkHS that has the
following properties:
– [Completeness]: If the original instance of the p-prover game is a Yes in-
stance (in the sense of Theorem 12), then the instance of 1-in-EkHS produced
by the reduction is satisfiable.

– [Soundness]: If the original instance of the p-prover game is a No instance
(in the sense of Theorem 12), then no assignment satisfies more than a
fraction (1/e+ε) of the constraints in the instance of 1-in-EkHS produced by
the reduction.
Proof. We begin with describing the reduction. Suppose we are given an instance
I of the p-prover game with parameter u. In the following, we will use the
notation and terminology from Definition 11. We define an instance of 1-in-EkHS
on the universe
U
def
= {(i, q, a) | i ∈ [p], q ∈ Qi, a ∈ [B]} .
Thus the universe simply corresponds to all possible vertex, label pairs. The
collection of sets in the instance is given by {Sr,x} as r ranges over [R] and x
over {1, 2, . . ., p}A
, where
Sr,x
def
= {(i, qr,i, a) ∈ U | xπr,i(a) = i} .
Note that the size of each Sr,x equals B = 4u
= k. This is because Sr,x =

j∈[A]{(i, qr,i, a) | i = xj, πr,i(a) = j}, and for each j ∈ [A], there are precisely
B/A elements a ∈ [B] such that πr,i(a) = j (since the projections are (B/A)-to-
1).
Let us first argue the completeness (this will also help elucidate the rationale
for the choice of the sets Sr,x).
Claim 3 (Completeness) Let a : W → [B] be an assignment that strongly
satisfies all hyperedges of I. Consider the subset C = {(i, q, a(q)) | i ∈ [p], q ∈
Qi}. Then |C ∩ Sr,x| = 1 for every r ∈ [R] and x ∈ [p]A
.
Proof of Claim. Since a strongly satisfies every hyperedge r ∈ [R], we have
πr,1(a(qr,1)) = πr,2(a(qr,2)) = · · · = πr,p(a(qr,p)), and let jr ∈ [A] denote
this common value. Also let ir = xjr ∈ [p]. Then it is not hard to check that
C ∩ Sr,x = {(ir, qr,ir , a(qr,ir )}.
Claim 4 (Soundness) Suppose that some C ⊆ U satisfies |C ∩ Sr,x| = 1 for
at least a fraction 1/e + ε of the sets Sr,x. Then, provided p ≥ 1 + 3/(eε) and
cu
ε/(2p8
), I is not a No instance.
Proof of Claim. For each i ∈ [p] and each vertex q ∈ Qi, define Aq = {a ∈ [B] |
(i, q, a) ∈ C}, i.e,, Aq consists of those labels for q that the subset C “picked”.
We will later use the sets Aq, q ∈ W to prove that a good labeling exists for I.
Call r ∈ [R] to be nice if at least a fraction (1/e + ε/2) of the sets Sr,x, as
x ranges over [p]A
, satisfy |C ∩ Sr,x| = 1. By an averaging argument, at least a
fraction ε/2 of r ∈ [R] are nice.
Let us now focus on a specific r that is nice. Define
Dr = {(i, b) | i ∈ [p], b ∈ [A], (i, qr,i, a) ∈ C for some a s.t. πr,i(a) = b} .

That is Dr consists of the projections of the assignments in Aqr,i for all vertices
qr,i belonging to hyperedge r. Let |Dr| = M and (i1, b1), (i2, b2), . . . , (iM , bM ) be
the elements of Dr.
Now if |C ∩ Sr,x| = 1, then exactly one of the events xbj = ij must hold as j
ranges over [M]. Since r is nice we know that the fraction of such x ∈ [p]A
is at
least (1/e + ε/2). We consider the following cases.
Case A: M p3
. Then there are at least M def
= M/p p2
distinct values
among b1, b2, . . . , bM . For definiteness, assume that b1, b2, . . . , bM are distinct. If
exactly one of the events xbj = ij holds as j ranges over [M], then certainly at
most one j in the range 1 ≤ j ≤ M
satisfies xbj = ij. The fraction of x ∈ [p]A
for which

{j | j ∈ [M
], xbj = ij}

≤ 1 is at most

1 −
1
p
M
+ M

1 −
1
p
M
−1
≤ e−(M
−1)/p
(M
+ 1) ≤ e−p
(p2
+ 2) 1/e
for p ≥ 10. This contradicts that fact that r is nice. Therefore this case cannot
occur and we must have M ≤ p3
.
Case B: M ≤ p3
and all the bj’s are distinct. Clearly, the fraction of x ∈ [p]A
for xbj = ij for exactly one choice of j ∈ [M] is precisely M
p

1 − 1/p
M−1
. Now
we bound this quantity as follows:
M
p

1 − 1/p
M−1
=
p
p − 1
M
p

1 − 1/p
M
≤
p
p − 1
M
p
e−M/p
(using 1 − x ≤ e−x
for x ≥ 0)
≤
p
p − 1
1
e
(using xe−x
≤ 1/e for x ≥ 0)
≤
1
e
+
ε
3
provided p ≥ 1+3/(eε). Again, this contradicts that fact that r is nice. Therefore
this case cannot occur either.
Therefore we can conclude the following: if r is nice, then |Dr| ≤ p3
and there
exist ir,1 = ir,2 ∈ [p] such that for some br ∈ [A], {(ir,1, br), (ir,2, br)} ⊆ Dr.
Consider the following labeling a to W. For each q ∈ W, set a(q) to be a
random, uniformly chosen, element of Aq (if Aq is empty we set a(q) arbitrarily).
Consider a nice r. With probability at least 1/p6
, we have
πr,ir,1

a(qr,ir,1 )

= πr,ir,2

a(qr,ir,2 )

= br
and thus hyperedge r is weakly satisfied by the labeling a.
In particular, there exists a labeling that weakly satisfies at least a fraction
1/p6
of the nice r, and hence at least a fraction ε
2p6 of all r ∈ [R]. If u is large
enough so that p2
cu
ε
2p6 (where c is the constant from Theorem 12), we know
that I is not a No instance.
The completeness and soundness claims together yield the lemma.

4 Conclusions
The 1-in-EkSAT problem, while hard to approximate within a 2Ω(k)
factor, be-
comes substantially easier and admits an e-approximation in polynomial time
with either one of two restrictions: (i) do not allow negations (which is the 1-in-
EkHS problem), (ii) consider satisfiable instances. Such a drastic change in ap-
proximability under such restrictions is quite unusual for natural constraint satis-
faction problems (discounting problems which become polynomial-time tractable
under these restrictions).
We conclude with two open questions:
– Does 1-in-kHS admit a polynomial time o(log k)-approximation algorithm1
?
– Does the 2Ω(k)
hardness for 1-in-EkSAT hold for near-satisfiable instances
(for which a fraction (1 − ε) of the constraints can be satisfied by some
assignment)?
Acknowledgments
We would like to thank Chris Chang for his proof of Lemma 6 and Ziv Bar-Yossef
for useful discussions.
References
1. M. Bellare, O. Goldreich, and M. Sudan. Free bits, PCP’s and non-approximability
– towards tight results. SIAM Journal on Computing, 27(3):804–915, 1998.
2. B. S. Chlebus, L. Gasieniec, A. Gibbons, A. Pelc, and W. Rytter. Deterministic
broadcasting in unknown radio networks. In Proceedings of the 11th ACM-SIAM
Symposium on Discrete Algorithms, pages 861–870, 2000.
3. A. Clementi, P. Crescenzi, A. Monti, P. Penna, and R. Silvestri. On computing
ad-hoc selective families. In Proceedings of RANDOM’01, 2001.
4. E. Demaine, U. Feige, M. Hajiaghayi, and M. Salavatipour. Combination can be
hard: approximability of the unique coverage problem. Manuscript, April 2005.
5. U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM,
45(4):634–652, 1998.
6. V. Guruswami. Query-efficient checking of proofs and improved PCP characteriza-
tions of NP. Master’s thesis, MIT, 1999.
7. V. Guruswami, J. Hartline, A. Karlin, D. Kempe, C. Kenyon, and F. McSherry. On
profit-maximizing envy-free pricing. In Proceedings of the 16th ACM-SIAM Sympo-
sium on Discrete Algorithms (SODA), January 2005.
8. A. Samorodnitsky and L. Trevisan. A PCP characterization of NP with optimal
amortized query complexity. In Proceedings of the 32nd ACM Symposium on Theory
of Computing, 2000.
9. T. J. Schaefer. The complexity of satisfiability problems. In Proceedings of the 10th
ACM Symposium on Theory of Computing, pages 216–226, 1978.
1
A recent work [4] presents some evidence that perhaps such an algorithm does not
exist.

Approximating the Distortion
Alexander Hall1
and Christos Papadimitriou2
1
ETH Zürich, Switzerland
alex.hall@gmail.com
2
UC Berkeley, USA
christos@cs.berkeley.edu
Abstract. Kenyon et al. (STOC 04) compute the distortion between
one-dimensional ﬁnite point sets when the distortion is small; Papadim-
itriou and Safra (SODA 05) show that the problem is NP-hard to ap-
proximate within a factor of 3, albeit in 3 dimensions. We solve an open
problem in these two papers by demonstrating that, when the distor-
tion is large, it is hard to approximate within large factors, even for
1-dimensional point sets. We also introduce additive distortion, and show
that it can be easily approximated within a factor of two.
1 Introduction
The distortion problem is the following: Given two d-dimensional ﬁnite points
sets S, T ⊆ Rd
with |S| = |T | and a real number δ ∈ R (the distortion) is there
a bijection f : S → T such that we have expansion(f) · expansion(f−1
) ≤ δ,
with expansion(f) := maxx,y∈S (d(f(x), f(y))/d(x, y))? Here d(x, y) denotes the
Euclidean distance between two points x, y ∈ Rd
.
The distortion problem was introduced by Kenyon et al. [1], who gave an op-
timal algorithm for 1-dimensional points sets that are known to have distortion
less than 3 + 2
√
2. Their elaborate dynamic programming algorithm crucially
relies on the small distortion guarantee to establish and exploit certain restric-
tions on the bijection between the two point sets. Papadimitriou and Safra [2]
present NP-hardness and inapproximability results, which hold for both small
and large distortions – albeit for the 3 dimensional case. In both papers the
question of whether the distortion of 1-dimensional point sets can be computed
or approximated if it is not known to be small was proposed as an open problem.
In this paper we resolve this question by establishing several strong NP-
hardness and inapproximability results. In Section 2.1 we show that the dis-
tortion problem is NP-hard in the 1-dimensional case when the distortion is
at least |S|ε
, for any ε 0. The proof is a surprisingly simple reduction from
the (strongly NP-complete) 3-partition problem. By appropriately modifying the
proof we show that, in the same range, even the logarithm of the distortion can-
not be approximated within a factor better than 2 (Theorem 2). This answers

Work partially supported by European Commission – Fet Open project DELIS IST-
001907 Dynamically Evolving Large Scale Information Systems.
c

112 Alexander Hall and Christos Papadimitriou
an open question posed in [1]. For larger distortions (growing faster than |S|) we
show a further inapproximability result: The distortion cannot be approximated
within a ratio better than L1−ε
T , where LT is the ratio of the largest to the
smallest distance in point set T ; we point out that an approximation ratio of L2
T
is always trivial, in any metric space. We make more precise the inapproxima-
bility bounds given in [2] for 3 dimensions and arbitrary distortion, by making
explicit the dependency of the inapproximability ratio on the magnitude of the
distortion. An overview of our inapproximability results:
Dimension Distortion Inapproximable within
|S| ≥ δ ≥ |S|ε
δ1−ε
d ≥ 1 δ ≥ |S|
√
δ · |S|
1
2
−ε
Thm. 2
δ ≥ |S|1+ε
L1−ε
T
d ≥ 3 δ 1 9 − 8/δ2 − ε
Thm. 3
unbounded d δ 1 δ − ε
Thm. 4
Motivated by these strong inapproximability results, we introduce a novel
variant of the problem that we call the additive distortion problem: Given two
finite points sets S, T ⊆ Rd
with |S| = |T |, find the smallest Δ ∈ R (the
additive distortion) such that there is a bijection f : S → T with d(x, y) −
Δ ≤ d(f(x), f(y)) ≤ d(x, y) + Δ, for all x, y ∈ S. By a modification of the
Papadimitriou-Safra construction, it is not hard to see that the additive dis-
tortion is NP-hard to approximate by a factor better than 3 in 3 dimensions.
In Section 3 we present a 2-approximation algorithm for this problem in the
1-dimensional case and a 5-approximation algorithm for the more general case
of embedding an arbitrary metric space onto an 1-dimensional point set. Finally,
we conclude by pointing out several open questions raised by this work.
Remark. The first three and the last inapproximability results can be strength-
ened by a power of 2, if we impose the stronger restriction of expansion(f) ≤
√
δ
and expansion(f−1
) ≤
√
δ (which implies a distortion of ≤ δ). In particular the
third bound becomes L2−ε
T , which is near optimal since a ratio of L2
T is trivial
also in this more restricted setting.
Related Work. Especially in view of the drastic increase in the number of pub-
lications concerning embeddings of metric spaces, it is astonishing that the low
distortion problem was only introduced very recently [1]. Most Computer Sci-
ence related work in this area focuses on the setting where a given finite metric
space is to be embedded into an infinite host space, usually a low dimensional
Euclidean space. Although methods from this area do not seem to apply to our
setting of embedding a finite metric onto another finite metric, we give a brief
overview of related work.
From a theoretical point of view there has been a large interest in finding
worst case bounds for the distortion of embedding a class of metrics, e.g. see
the surveys [3–5]. The problem of finding a good embedding for a given metric
(“good” compared to an optimal embedding of this, same metric) is practically

Approximating the Distortion 113
more relevant and consequently the vast majority of research in this area – also
referred to as multi-dimensional scaling – has been on devising good heuris-
tics. See the web-page of the working group on multi-dimensional scaling [6] for
an overview and an extensive list of references. An important theoretical result
is Linial et al.’s [7] adaption of Bourgain’s construction [8]. They present an
approximation algorithm based on semidefinite programming for finding a min-
imum distortion embedding of a given finite metric. Kleinberg et al. [9] consider
approximate embeddings of metrics for which only a small subset of the distances
are known. Slivkins [10] recently can improve on the results. Also recently Bădoiu
et al. [11] give several approximation algorithms for low distortion embeddings
of metrics into R1
and R2
. The notion of additive distortion has also been con-
sidered for the case where a finite metric is to be embedded into an infinite host
space. Håstad et al. [12] give a 2-approximation for the case of embedding into R
and prove that the problem cannot be approximated within 4/3, unless P = NP.
Later Bădoiu [13] and Bădoiu et al. [14] gave an approximation algorithm and
a weakly quasi-polynomial time algorithm, respectively, for 2 dimensions.
Other research loosely related to the distortion problem is on the minimum
bandwidth problem (see e.g. [15]) and the maximum similarity problem [16].
2 The Inapproximability of Distortion
2.1 NP-Hardness
Theorem 1. The distortion problem is NP-hard for 1 dimension and any fixed
δ ≥ |S|ε
, for any constant ε 0.
Proof. We reduce the well known 3-partition problem to it. In this problem we
are given a set A of 3n items A = {1, . . . , 3n} with associated sizes a1, . . . , a3n ∈
N, and a bound B ∈ N, with B/4 ai B/2, for each i, and
3n
i=1 ai = n B,
and we must decide whether A can be partitioned into n disjoint sets I0, . . . , In−1
such that

i∈Ij
ai = B, for j = 0, . . . , n − 1. Note that due to the bounds for
the item sizes ai, all sets Ij must have cardinality 3.
We now describe how to construct the point sets S and T on the line (see the
left part of Figure 1). The point set S consists of 3n blobs of points S1, . . . , S3n,
where blob Si has ai points. Points in a blob are distributed regularly along
the line with distance x := 1/
√
δ · B from one point to the next. The blobs
themselves are also distributed regularly along the line with distance 1 from
one to the next, i.e. two neighboring points in different blobs have distance 1.
The point set T is very similar: it consists of n blobs T1, . . . , Tn of size B. Here
the distance of neighboring points within a blob is 1/
√
B and if they are in
different blobs it is again 1. Finally, we add two points to both S and T , far
away from the blobs. Their distance in T is 1 and their distance in S is
√
δ
(clearly the points can be added such that they need to be mapped onto each
other in order to obtain distortion ≤ δ). This ensures that expansion(f−1
) =
maxx,y∈S

d(x,y)
d(f(x),f(y))

≥
√
δ.

1
1
blob T1:
blob Sj1 : blob Sj3 :
blob Sj2 :
blob Tj :
1
√
B
T:
S:
x := 1
√
δB
blob S1:
a1 points
blob S2:
a2 points
blob S3n:
a3n points
B points
blob Tn:
B points
blob T2:
B points
Fig. 1. Left: the point sets S and T on a line constructed from a given instance of
3-partition. Right: the mapping of three blobs in S to one blob in T.
Our aim is to show how to derive a low distortion mapping of points from S
to T given a solution of the 3-partition instance and vice versa, assuming that
δ ≥ c · n2
· B, for an appropriately chosen constant c 1. We start with the
forward direction. The mapping is straight forward: we map blobs Sj1 , Sj2 , Sj3
to blob Tj (simply one blob next to the other, as shown in Figure 1 on the
right), if Ij = {j1, j2, j3}. In order to see that the distortion δ is not violated
we check the largest possible changes in distance (relatively speaking) and show
that expansion(f) ≤
√
δ and expansion(f−1
) ≤
√
δ hold, leading to three cases:
1. Two neighboring blobs Sa and Sb are spread as far apart as possible. The
distance between two points hereby increases from 1 to less than
n ·

B ·
1
√
B
+ 1

= n · (
√
B + 1) ≤
√
δ,
under the assumption δ ≥ c · n2
· B with some constant c, which is chosen
large enough.
2. The leftmost and the rightmost blob in S, i.e. S1 and S3n, are mapped next
to each other into the same blob in T . The distance decreases from less than
n · (B/
√
δ · B + 3) to 1/
√
B. This leads to an upper bound for the relative
distance change of
n ·

B
δ + 3

1
√
B
≤ n
√
B

1
√
c · n
+ 3

≤
√
δ,
again with the assumption δ ≥ c · n2
· B and some appropriately chosen
constant c (which clearly exists).
3. What about the distance increase for two points in the same blob? All dis-
tances of points within a blob increase by exactly (1/
√
B)/(1/
√
δ · B) =
√
δ.

For the other direction we only need to check that a blob Si cannot be split
up and mapped to two or more different blobs in T .
4. Two neighboring points in a blob in S cannot be mapped to two different
blobs in T . The relative increase in distance would be 1/ 1
√
δ·B

√
δ. Together
with the two “extra” points that force expansion(f−1
) ≥
√
δ this would yield
a distortion δ.
This ensures that in a low distortion mapping from S to T always exactly
three blobs from S will be mapped to one blob from T , as wanted.
In order to obtain the nice lower bound for δ we now add a huge blob to
both of them which is far away from all other blobs and whose points are very
close to each other. Clearly this can be done in such a manner that all points
in this huge blob in S will be mapped directly to the corresponding huge blob
in T ; without interfering with the mapping of all other points. We started with
|S| = n · B points and increase the number of points in the huge blob until
|S|ε
c · n2
· B. Since ε is a constant the resulting input size will still be
bounded by a polynomial. ♦
2.2 Inapproximability
We now describe how to modify the proof in order to obtain the strong inapprox-
imability results for large δ. Let LT :=
maxx,y∈T d(x,y)
minx,y∈T d(x,y) be the ratio of maximum
to minimum distance in T .
Theorem 2. For any ε, ε
0, the 1-dimensional distortion problem for |S| ≥
δ ≥ |S|ε
is inapproximable within a factor of δ1−ε
, for δ ≥ |S| is inapproximable
within a factor of
√
δ · |S|
1
2 −ε
, and for δ ≥ |S|1+ε
is inapproximable within a
factor of L1−ε
T , unless P = NP.
Proof. First of all we introduce a new distance g to be defined presently and
replace the inter-blob distances (the distance of 1 from one blob to the next)
in T by this distance g.
From the list of cases in the proof of Theorem 1, case 2. and case 3. remain
untouched by this change. For case 1. (where the “1” in the expression is now
a “g”) we choose c in our assumption δ ≥ c · n2
· B such that
√
δ/2 ≥ n
√
B and
we choose g such that
√
δ/2 ≥ n · g. Choosing g :=
√
δ/2n will be fine.
For case 4. the relative increase in distance between two neighboring points in
a blob in S would be at least y := g/ 1
√
δ·B
if they were split onto two blobs in T .
This would amount to a distortion of y ·
√
δ (note that expansion(f−1
) ≥
√
δ).
The optimum solution is only “allowed” a distortion of δ. Thus, unless P = NP,
there can be no approximation algorithm with ratio better than
y ·
√
δ
δ
=
g ·
√
δ · B ·
√
δ
δ
=
√
δ ·
√
B
2n
≥
√
δ · B
1
2 (1−ε
)
, (1)

with our choice of g :=
√
δ/2n above. For the inequality we make the assumption
of B ≥ n4/ε
, which yields B
4n2 ≥ B
4Bε/2 = 1
4 B1−ε
/2
≥ B1−ε
. Making this
assumption poses no problem, since B can easily be increased (i.e. by “blowing
it up” similarly to what we do with S), if this should not hold. Let us consider
the first statement in the theorem.
δ ≥ |S|ε
, δ ≤ |S|. We blow up S and T again, as in the proof of Theorem 1, but
we proceed more carefully. Before adding any points to the huge blob, we know
that δ ≤ |S| ≤ c·n2
·B (the first by assumption, the second since |S| = n·B holds
in the beginning). Now we increase the blob and thereby δ (since by assumption
δ ≥ |S|ε
) until we have equality δ = c
· n2
· B for some appropriately chosen
constant c
≥ c. We obtain δ ≤ 2 · c
· Bε
/2
· B ≤ B1+ε
and with (1) thus
y ·
√
δ
δ
≥
√
δ · B
1
2 (1−ε
)
≥
√
δ · δ
1
2 · 1−ε
1+ε
≥ δ1−ε
.
This gives the first bound for the approximation ratio.
δ ≥ |S|. For the second bound in the theorem we lower bound B in terms
of |S|. To obtain a good bound we will blow up the extra blob of S and T
only slightly: we increase the blob until |S| = B1+ε
, which is enough to ensure
δ ≥ B1+ε
≥ c · B1+ε
/2
≥ c · n2
· B as needed. Note that we assumed B ≥ n4/ε
,
as before. We insert |S| = B1+ε
into (1):
y ·
√
δ
δ
≥
√
δ · |S|
1
2 · 1−ε
1+ε
≥
√
δ · |S|
1
2 −ε
.
δ ≥ |S|1+ε
. For the third statement in the theorem we will again search for a
lower bound of B in (1), but now by an expression in LT . In this case we will not
blow up S and T with an extra blob, but instead stick to the original construc-
tion. The two “extra” points which were added to ensure expansion(f−1
) ≥
√
δ
can be added at distance 1 from the other blobs in S and at distance g from
the other blobs in T . Clearly, if a blob Si is split apart and partly mapped onto
these two points, this again yields the distortion given in (1).
For the maximum ratio of distances we have for the set T :
LT ≤
1 + g + n ·

B
√
B
+ g

1
√
B
= (1+
√
δ/2n)
√
B +n·B +
√
δ · B/2 ≤ c
·
√
δ · B ·n
holds, with g :=
√
δ/2n, our assumption δ ≥ c · n2
· B, and an appropriate
constant c
. Note that given LT we need to adjust the minimum size of the input
(|S|) accordingly.
We replace the assumption made above for B’s size by B ≥ nmax{8/ε
,1/ε}
.
The second term in the “max” expression ensures that c · n2
· B ≤ c · n · Bε
·
B ≤ (n · B)1+ε
≤ |S|1+ε
. Since δ ≥ |S|1+ε
holds, we obtain δ ≥ c · n2
· B as

needed. With help of the first term in the “max” expression we obtain for (1):
y·
√
δ
δ ≥
√
δ·B
1
2 (1−ε
/2)
. We plug in LT
√
δ
≤ c
·
√
B·n ≤ c
·
√
B·Bε
/8
≤ B
1
2 (1+ε
/2)
:
√
δ · B
1
2 (1−ε
/2)
≥
√
δ ·

LT
√
δ
1−ε/2
1+ε/2
≥ (LT )
1−ε/2
1+ε/2 ≥ L1−ε
T .
This concludes the proof. ♦
In connection to the last bound, it is interesting to note that any embedding
f : S → T has a distortion of at most L2
T · δ, where δ is the optimal distortion,
even if (S, dS) and (T, dT ) are arbitrary metric spaces. A short proof can be
found in the full version of the paper.
2.3 Higher Dimensions
For the 3-dimensional problem we can show the following explicit dependence of
the inapproximability ratio on the distortion.
Theorem 3. For any fixed δ 1 it is NP-hard to distinguish whether two given
3-dimensional point sets S and T have distortion ≤ δ or ≥
√
9δ2 − 8.
Proof. By a more detailed analysis of a slight modification of the construction
in [2] which we omit here. ♦
Notice that, in view of the previous theorem, this result is relevant when
the distortion is small. Finally, when the dimension is unbounded (that is, for
general finite metrics), the reduction of the previous subsection can be adapted
to establish that the distortion is even harder to approximate:
Theorem 4. For any fixed δ 1 it is NP-hard to distinguish whether two given
finite metrics S and T have distortion ≤ δ or ≥ δ2
.
Proof. (Sketch.) Repeat the construction with all points in the same blob having
distance of 1 from each other, while points belonging to different blobs are at
distance δ. This holds for both S and T . If a 3-partition exists, then distances
of δ are shrunk to 1, but no distances are dilated, and so the distortion is δ. If
a 3-partition does not exist, then certain distances are both shrunk and dilated
by δ, and so the distortion is δ2
. ♦
3 Additive Distortion
For the 1-dimensional additive distortion problem we will show that the simplest
possible strategy already yields a 2-approximation. The Sweep-or-Flip algo-
rithm either maps the points in S := {s1, . . . , sm} ⊆ R from left to right onto
T := {t1, . . . , tm} ⊆ R, or flips the point set and maps them from right to left.
In other words, we check the bijections f(si) = ti and f(si) = tm−i+1 and keep

T :
a μ
μ a
Δ
S:
Fig. 2. An example showing that the Sweep-or-Flip strategy does not necessarily
yield an optimal distortion. The embedding has additive distortion Δ (note that 2(a +
μ) ≤ Δ). The “left-to-right” embedding has a larger distortion of a + Δ − μ. Clearly
the other direction has a larger distortion as well.
the better one. It is easy to see (Figure 2) that this is not optimal. In fact, by
choosing a + μ = Δ/2 in the figure we get a gap of a+Δ−μ
Δ = 1.5 − 2μ/Δ when
comparing the optimal to the Sweep-or-Flip embedding, for arbitrarily small
μ 0.
For the setting where we are given an arbitrary metric space (S, dS) and want
to embed onto points T := {t1, . . . , tm} ⊆ R we will present a straightforward
5 + ε-approximation algorithm.
3.1 A 2-Approximation
Before proving that Sweep-or-Flip is a 2-approximation, we give two defini-
tions:
Crossing Points. Consider two points x, y ∈ S with x y and for which their
counterparts are f(x) f(y). We say x and y cross in the mapping f.
Relative Movement. For fixed S and T , define the relative movement of the i-th
point to be μi := ti − si.
Theorem 5. The Sweep-or-Flip algorithm yields an additive distortion at
most 2 · Δ, where Δ is the optimum additive distortion.
Proof. The proof idea is to fix an optimal embedding f∗
and to consider different
cases for the relative movements μi. We then either show that a distortion of 2·Δ
can be obtained by a “left-to-right” or a “right-to-left” embedding, or arrive at
a contradiction by showing that f∗
has distortion Δ. With each of the four
steps of the following case analysis we narrow down the situations we need to
consider, in terms of the actual relative movements, and also in terms of the
mapping of the first and last point in f∗
. We start with the latter:
f∗
(s1) f∗
(sm), i.e. s1 and sm cross: If this is the case, we can completely
flip T , e.g. by negating all elements of the set without affecting the perfor-
mance of Sweep-or-Flip. Thus we can assume f∗
(s1) f∗
(sm).

|μi| ≤ Δ, for all i ∈ {1, . . . , m}: Clearly, if we embed left-to-right f(si) =
ti, the obtained additive distortion is bounded by 2 · Δ. To see this take
any two points si, sj ∈ S and note that due to the bounded movement the
distance of ti and tj can differ by at most 2·Δ from the distance of si and sj.
∀i, j ∈ {1, . . . , m} : |μi − μj| ≤ 2 · Δ: Let μi be the largest relative move-
ment, and by the previous case assume μi Δ. Then translate all points
in T to the left, until μi = Δ. For all j = i and the new relative movements
we still have |μi − μj| ≤ 2 · Δ. Thus we have μj ≥ −Δ and μj ≤ μi = Δ, the
former since μi = Δ and the latter since μi is the largest relative movement.
In other words, we modified the instance such that the previous case holds. If
in the beginning the smallest relative movement is less than −Δ, we proceed
analogously translating all points in T to the right.
∃i, j ∈ {1, . . . , m} : |μi − μj| 2 · Δ: Assume i j and μi μj, other-
wise exchange the roles of S and T (the relative movements are negated).
Translate all points in T in order to have μi = Δ and μj −Δ. Due to a
simple counting argument, there must be k ≤ i with f∗
(sk) ≥ ti and thus
f∗
(sk) − sk ≥ μi. Analogously there must be a l j with f∗
(sl) ≤ tj and
thus f∗
(sl) − sl ≤ μj. We distinguish the following cases:
sk and sl do not cross: See the top picture of Figure 3 for an example.
We have d(sk, sl) = sl −sk ≥ sj −si and d(f∗
(sk), f∗
(sl)) ≤ tj −ti. This
gives a contradiction: d(sk, sl) − d(f∗
(sk), f∗
(sl)) ≥ μi − μj 2 · Δ. See
the top picture of Figure 3 for an example.
sk and sl cross while sm and sk do not: See the middle part of Fig-
ure 3. Since f∗
projects sk by at least μi = Δ to the right, we must
have f∗
(sm) ≥ sm, otherwise the distance d(f∗
(sk), f∗
(sm)) would be
less than d(sk, sm)−Δ. Similarly since f∗
projects l by at least μj −Δ
to the left, we must have f∗
(sm) sm, which gives the contradiction.
sk and sl cross while s1 and sl do not: Same as the previous case.
k 1 and l m, sk and sl cross, s1 crosses sl, sm crosses sk: See
the bottom part of Figure 3. Let us start by making sure that this is the
only case left. Due to the very first case we know that s1 and sm do not
cross. Since sk and sl do cross, either k 1 or l m must hold. Assume
the former, then since s1 crosses sl (which it must due to the previous
case) we also have the latter (and again due to the last but one case, sm
must cross sk).
Now we consider the distance a, b, c, d, e, and f as given in Figure 3,
bottom. Since we have an additive distortion of Δ,
d + e + f ≤ b + Δ (2)
must hold. Similarly we have f ≥ b + c − Δ, d ≥ a + b − Δ, and e ≥
a + b + c − Δ, which together gives
d + e + f ≥ 2a + 3b + 2c − 3Δ.
Subtracting (2) we obtain the contradiction
0 ≥ 2s + 2b + 2c − 4Δ ≥ 2b − 4Δ 0.

The last inequality holds since b ≥ μi − μj 2 · Δ. We conclude that
there cannot be i, j ∈ {1, . . ., m} with |μi − μj| 2 · Δ.
This gives the stated result. ♦
sk and sl do not cross:
sk sl
S:
T :
sm
sk sl
S:
T :
sm
s1
a b c
d e f
sk and sl cross and
sm does not cross with sk:
k 1 and l m, sk and
sl cross, s1 crosses sl,
sm crosses sk:
si si
sk sl
S:
T :
μi μj
Fig. 3. Three examples for the cases concerning that i, j ∈ {1, . . . , m} exist such that
μi = Δ and μj −Δ.
3.2 Embedding an Arbitrary Metric Space to 1D
We are given an arbitrary finite metric space (S, dS) and T ⊆ R. The algorithm
Intervals below finds a mapping of the points in S to the points in T within
a factor of 5 of the optimum additive distortion. For ease of exposition we start
by assuming that we know the optimum distortion Δ. Below we note why the
algorithm also works for the case where the distortion is not given. We also
assume that we know the point x ∈ S mapped to t1 – in fact, we iterate over all
x ∈ S.
Feasible Intervals. For y ∈ S {x} we then define its feasible interval as: Iy :=
[t1 + dS(x, y) − Δ, t1 + dS(x, y) + Δ]. The additive distortion for the pair x, y
is ≤ Δ if and only if f(y) ∈ Iy.

Algorithm: Intervals
1. Given Δ 0 and x ∈ S, compute the feasible intervals Iy for y = x, and
map the remaining nodes of S as follows:
2. Process the feasible intervals by increasing left boundary. For each Iy map
y greedily to the leftmost point z in Iy that has not yet been mapped to.
Theorem 6. The Intervals algorithm yields an additive distortion of 5 · Δ,
where Δ is the optimum additive distortion.
Proof. Consider the mapping f∗
that achieves Δ, and assume f∗
(x) = t1. Since
there is a bijection that maps each y = x to a point z ∈ Iy (f∗
is an example
of such a bijection), and since all intervals have the same length, it is clear that
the algorithm will find such a bijection, call it f. f can increase the distance
between any point pair by at most 4·Δ, since the intervals have a width of 2·Δ.
Therefore, the additive distortion of f is within a factor of 5 of the additive
distortion. ♦
If Intervals succeeds in finding a mapping, it will simply do a left to right
mapping of the remaining points S {x} in step 2. Therefore, by iterating over
all x ∈ S and checking for each x the left to right mapping of the rest, we find
the same mapping as Intervals without knowing Δ in advance.
4 Open Problems
We have made significant progress towards understanding the complexity of com-
puting the distortion of bijections between point sets. But many open problems
remain:
– What is the complexity of the distortion problem on the line for large con-
stant values of distortion? The dynamic programming approach seems to
exhaust itself after 3 + 2
√
2, yet NP-completeness also seems very difficult.
– In view of our results, it seems that, for large distortions, the right quantity
to approximate is not δ but log δ. By Theorem 2 we know that it cannot be
approximated by a factor better than 2. Is a constant factor possible? Or is
there a generalization of our proof (by some kind of hierarchical 3-partition
problem) that shows impossibility?
– In connection to the last open problem, we may want to define the following
relaxation of the distortion problem: We are given, besides S, T , and δ, an
ε 0, and we are asked whether there is a bijection between all but an ε
fraction of S and T such that the distortion of this partial map is δ or less. We
conjecture that for any ε there is a polynomial algorithm that approximates
log δ by a factor of 2.
– Are there better approximation algorithms for the 1-dimensional additive
distortion problem? And can one prove the 1-dimensional problem to be
NP-complete?

References
1. Kenyon, C., Rabani, Y., Sinclair, A.: Low distortion maps between point sets. In:
Proceedings of the 36th STOC. (2004) 272–280
2. Papadimitriou, C., Safra, S.: The complexity of low-distortion embeddings between
point sets. In: Proceedings of the 16th SODA. (2005) 112–118
3. Indyk, P.: Algorithmic applications of low-distortion geometric embeddings. In:
Tutorial at the 42nd FOCS. (2001) 10–33
4. Linial, N.: Finite metric spaces – combinatorics, geometry and algorithms. In:
Proceedings of the International Congress of Mathematicians III, Beijing (2002)
573–586
5. Matousek, J.: Lectures on Discrete Geometry. Springer-Verlag, Graduate Texts in
Mathematics, Vol. 212 (2002)
6. Web-page of the working group on multi-dimensional scaling.
http://dimacs.rutgers.edu/SpecialYears/2001 Data/Algorithms
/AlgorithmsMS.htm (2005)
7. Linial, N., London, E., Rabinovich, Y.: The geometry of graphs and some of its
algorithmic applications. Combinatorica 15 (1995) 215–245
8. Bourgain, J.: On lipschitz embedding of ﬁnite metric spaces into hilbert space.
Isreal Journal of Mathematics 52 (1985) 46–52
9. Kleinberg, J., Slivkins, A., Wexler, T.: Triangulation and embedding using small
sets of beacons. In: Proceedings of the 45th FOCS. (2004) 444–453
10. Slivkins, A.: Distributed approaches to triangulation and embedding. In: Proceed-
ings of the 16th SODA. (2005) 640–649
11. Bădoiu, M., Dhamdhere, K., Gupta, A., Rabinovich, Y., Räcke, H., Ravi, R.,
Sidiropoulos, A.: Approximation algorithms for low-distortion embeddings into
low-dimensional spaces. In: Proceedings of the 16th SODA. (2005) 119–128
12. Håstad, J., Ivansson, L., Lagergren, J.: Fitting points on the real line and its
application to RH mapping. Lecture Notes in Computer Science 1461 (1998) 465–
467
13. Bădoiu, M.: Approximation algorithm for embedding metrics into a two-
dimensional space. In: Proceedings of the 14th SODA. (2003) 434–443
14. Bădoiu, M., Indyk, P., Rabinovich, Y.: Approximate algorithms for embedding
metrics into lowdimensional spaces. Manuscript (2003)
15. Feige, U.: Approximating the bandwidth via volume respecting embeddings. Jour-
nal of Computer and System Sciences 60 (2000) 510–539
16. Akutsu, T., Kanaya, K., Ohyama, A., Fujiyama, A.: Point matching under non-
uniform distortions. Discrete Applied Mathematics 127 (2003) 5–21

Approximating the Best-Fit Tree
Under Lp Norms
Boulos Harb
, Sampath Kannan
, and Andrew McGregor
Department of Computer and Information Science, University of Pennsylvania
Philadelphia, PA 19104, USA
{boulos,andrewm,kannan}@cis.upenn.edu
Abstract. We consider the problem of fitting an n × n distance ma-
trix M by a tree metric T. We give a factor O(min{n1/p
, (k log n)1/p
})
approximation algorithm for finding the closest ultrametric T under the
Lp norm, i.e. T minimizes T, M p. Here, k is the number of distinct
distances in M. Combined with the results of [1], our algorithms imply
the same factor approximation for finding the closest tree metric under
the same norm. In [1], Agarwala et al. present the first approximation
algorithm for this problem under L∞. Ma et al. [2] present approxima-
tion algorithms under the Lp norm when the original distances are not
allowed to contract and the output is an ultrametric. This paper presents
the first algorithms with performance guarantees under Lp (p ∞) in
the general setting.
We also consider the problem of finding an ultrametric T that minimizes
Lrelative: the sum of the factors by which each input distance is stretched.
For the latter problem, we give a factor O(log2
n) approximation.
1 Introduction
An evolutionary tree for a species set S is a rooted tree in which the leaves
represent the species in S, and the internal nodes represent ancestors. The goal
of reconstructing the evolutionary tree is of fundamental scientific importance.
Given the increasing availability of molecular sequence data for a diverse set of
organisms and our understanding of evolution as a stochastic process, the nat-
ural formulation of the tree reconstruction problem is as a maximum likelihood
problem – estimate parameters of the evolutionary process that are most likely
to have generated the observed sequence data. Here, the parameters include not
only rates of mutation on each branch of the tree, but also the topology of the
tree itself. It is assumed (although this assumption is not always easy to meet)
that the sequences observed at the leaves have been multiply aligned so that each
position in a sequence has corresponding positions in the other sequences. It is
also assumed for tractability, that each position evolves according to an inde-
pendent identically distributed process. Even with these assumptions, estimating
the most likely tree is a computationally difficult problem.

This work was supported by NIH Training Grant T32HG00 46.

This work was supported by NSF CCR98-20885 and NSF CCR01-05337.

This work was supported by NSF ITR 0205456.
c

124 Boulos Harb, Sampath Kannan, and Andrew McGregor
Recently, approximately most likely trees have been found for simple stochas-
tic processes using distance-based methods as subroutines [3, 4].
For a distance-based method the input is an n × n distance matrix M where
M[i, j] is the observed distance between species i and j. Given such a matrix,
the objective is to find an edge-weighted tree T with leaves labeled 1 through n
which minimizes the Lp distance from M where various choices of p correspond
to various norms. The tree T is said to fit M. When it is possible to define
T so that T, Mp = 0, then the distance matrix is said to be additive. An
O(n2
) time algorithm for reconstructing trees from additive distances was given
by Waterman et al. [5], who proved in addition that at most one tree can exist.
However, real data is rarely additive and we need to solve the norm minimization
problem above to find the best tree. Day [6] showed that the problem is NP-hard
for p = 1, 2.
For the case of p = ∞, referred to as the L∞ norm, [7] showed how the optimal
ultrametric tree could be found efficiently and [1] showed how this could be used
to find a tree T (not necessarily ultrametric) such that T, Mp ≤ 3Topt, Mp
where Topt is the optimal tree. The algorithm of [1] is the one that is used in
[3] and [4] for approximate maximum likelihood reconstruction.
In this paper we explore approximation algorithms under other norms such
as L1 and L2. We also consider a variant, Lrelative, of the best-fit objective
mentioned above where we seek to minimize the sum of the factors by which
each input distance is stretched. The study of L1 and L2 norms is motivated
by the fact that these are often better measures of fit than L∞ and the idea
that using these methods as subroutines may yield better maximum likelihood
algorithms.
1.1 Our Results
We prove the following results:
- We can find an ultrametric tree whose Lp-error is within a factor of
O(min{n1/p
, (k log n)1/p
}) of the optimum, where k is the number of distinct
distances in the input matrix.
- We can find an ultrametric tree T whose Lrelative-error is within a factor of
O(log2
n) of the optimum.
Our algorithms also solve the problem of finding non-contracting ultramet-
rics, i.e. when T [i, j] is required to be at least M[i, j] for all i, j. More generally,
we can require that each output distance is lower bounded by some arbitrary
positive value. This generalization allows us to also find additive metrics whose
Lp-error is within a factor of O(min{n1/p
, (k log n)1/p
}) of the optimum by ap-
pealing to work in [1].
1.2 Related Work
Aside from the aforementioned L∞ result given in [1], Ma et al. [2] present an
O(n1/p
) approximation algorithm for finding non-contracting ultrametrics under

Approximating the Best-Fit Tree Under Lp Norms 125
Lp∞. Prior to our results, however, no algorithms with provable approximation
guarantees existed for fitting distances by additive metrics under Lp∞ in the
general setting.
Some of our results rely on the recent approximation algorithms for the prob-
lem of correlation clustering and related problems [8–11]. One of our algorithms
can be viewed as performing a hierarchical version of correlation clustering.
Finally, we should mention some recent work that address special cases of our
problem. In [12] an algorithm is given that finds a line-embedding of a metric
whose L1-error is O(log n) away from optimal. If the embedding is further re-
stricted to be a non-contracting line-embedding, then [13] presents an algorithm
whose approximation factor is constant.
2 Preliminaries
An ultrametric T on a set [n] is a metric that satisfies the following three-point
condition:
∀x, y, z ∈ [n] T [x, y] ≤ max{T [x, z], T [z, y]} .
That is, in an ultrametric, triangles are isosceles with the equal sides being
longest. An ultrametric is a special kind of tree metric where the distance from
the root to all points in [n] (the leaves) is the same. Recall that a tree metric
(equivalently an additive metric) A on [n] is a metric that satisfies the four-point
condition:
∀w, x, y, z ∈ [n] A[w, x] + A[y, z] ≤ max{A[w, y] + A[x, z], A[w, z] + A[x, y]} .
Given an n × n distance matrix M where M[i, j] is the observed distance
between objects i and j, our initial objective is to find an edge-weighted ultra-
metric T with leaves labeled 1 through n which minimizes the Lp distance from
M, i.e. T minimizes
T, Mp = p

i,j
|T [i, j] − M[i, j]|p . (1)
We will also look at finding an edge-weighted ultrametric T which minimizes
the average stretch of the distances in M, i.e. T minimizes
T, Mrelative =
i,j
max

T [i, j]
M[i, j]
,
M[i, j]
T [i, j]
(2)
The entry T [i, j] is the distance between the leaves i and j, which is defined
to be the sum of the edge weights on the path between i and j in T . We will also
refer to the splitting distance of an internal node v of T as the distance between
two leaves whose least common ancestor is v. Because T is an ultrametric, the
splitting distance of v is simply twice the height of v.
We will assume that the input distances in M are non-negative integers such
that

– M[x, y] = M[y, x]; and,
– M[x, y] = 0 ⇐⇒ x = y.
That is, we will not assume that the distances in M satisfy the triangle
inequality. We denote the distinct distances in M by,
dk dk−1 ... d2 d1 .
Relationship to Correlation Clustering. The problem of finding an optimal
ultrametric T minimizing T, M1 is closely related to the problem of correlation
clustering introduced in [10]. We are interested in the minimization version of
correlation clustering which is defined as follows: given a graph G whose edges are
labeled “+” (similar) or “–” (dissimilar), cluster the vertices so as to minimize
the number of pairs incorrectly classified with respect to the input labeling.
That is, minimize the number of “–” edges within clusters plus the number of
“+” edges between clusters. We will simply refer to this problem as correlation
clustering. Note that the number of clusters is not specified in the input.
In fact, when G is complete, correlation clustering is equivalent to the prob-
lem of finding an optimal ultrametric under the L1 norm when the input dis-
tances in M are restricted to 1 and 2. An edge (i, j) in the graph labeled “+”
(resp. “–”) is equivalent to the entry M[i, j] being 1 (resp. 2). It is clear that an
optimal ultrametric is an optimal clustering, and vice versa. Hence, the APX-
hardness of finding an optimal ultrametric under the L1 norm follows directly
from [11, Theorem 11].
In [11], Charikar, Guruswami and Wirth give a factor O(log n) approximation
to correlation clustering on general weighted graphs using linear programming.
In an instance of correlation clustering that is weighted, each edge e has a weight
we which can be either positive or negative. The objective is then to minimize
e:we0
(|we| if e is split) +
e:we0
(|we| if e is not split) .
The bound for the LP relaxation is established via an application of the region
growing procedure of Garg, Vazirani and Yannakakis [14]. We will state their
theorem below for reference as our algorithm in section 3.1 uses their algorithm
as a sub-procedure.
Theorem 1 ([11, Theorem 1]). There is a polynomial time algorithm that
achieves an O(log n) approximation for correlation clustering on general weighted
graphs.
3 Main Results
Both our algorithms take as input a set of splitting distances we call S that
depends on the error norm. The distances in the constructed ultrametrics will
be a subset of the given set S. The following lemma quantifies the affect of
restricting the output distances to certain sets.

Lemma 1. (a) There exists an ultrametric T with T [i, j] ∈ {d1, d2, . . . , dk} for
all i, j that is optimal under the L1 norm.
(b) There exists an ultrametric T with T [i, j] ∈ {d1, d2, . . . , dk} for all i, j such
that
T, Mp ≤ 2Topt, Mp ,
for p ≥ 2.
(c) Assuming dk = O(poly(n)), there exists an ultrametric T that uses
O(log1+ n) distances such that
T, Mrelative ≤ (1 + )Topt, Mrelative ,
where 0.
Proof. (a) Say an internal node v is undesirable if its distance h(v) to any of
its leaves satisfies 2h(v) ∈ {d1, d2, . . . , dk}. Suppose Topt is an optimal
ultrametric with undesirable nodes. We will modify Topt so that it has one
less undesirable node. Let v be the lowest undesirable node in Topt and let
d = 2h(v) ∈ (d, d+1) for some 1 ≤ ≤ k − 1. Define the following two
multisets:
D ={M[a, b] : a, b are in different subtrees of v and M[a, b] ≤ d} ,
D+1 ={M[a, b] : a, b are in different subtrees of v and M[a, b] ≥ d+1}.
Then the contribution of the distances in D ∪ D+1 to Topt, M1 is
α∈D
(d − α) +
β∈D+1
(β − d) .
The expression above is linear in d. If its slope ≥ 0 then set h(v) = d/2, and
if the slope 0 then set h(v) = min{d+1/2, h(v
)} where v
is the parent of
v. Such a change can only improve the cost of the tree.
(b) For p ≥ 2, let Topt be an optimal ultrametric with undesirable nodes. We
will transform Topt to an ultrametric T with no undesirable nodes such
that T, Mp ≤ 2Topt, Mp. Let,
Topt, Mp
p =
u
gu(2h(u)) ,
where the sum is over the internal nodes of Topt and gu(x) is the cost of
setting the splitting distance of node u to x. Again, let v be the lowest un-
desirable node and define D and D+1 as above. Fix d = 2h(v) ∈ (d, d+1).
We claim that min{gv(d), gv(d+1)} ≤ 2p
gv(d).
If d ≤ (d + d+1)/2, then we can set h(v) = d/2 since for all α ∈ D,
d − α ≤ d − α and for all β ∈ D+1, β − d ≤ 2(β − d). Otherwise, we
set h(v) = d+1/2. We are assuming w.l.o.g. that v has no parent in the re-
gion (d, d+1) since if such a parent v
exists, h(v
) will also be set to d+1/2.

(c) Let D(Topt) be the set of distances in an optimal ultrametric that minimizes
T, Mrelative. Group the distances in D(Topt) geometrically, i.e. for some
0, group the distances into the following buckets:
[1, 1 + ] ,

1 + , (1 + )2

, . . . ,

(1 + )s−1
, (1 + )s

.
Let t be the largest distance in D(Topt). Clearly, t ≤ dk = O(poly(n)).
Hence, the number of buckets s = log1+ t = O(log1+ n). Now consider an
ultrametric T
that sets T
[i, j] = (1 + )
if the optimal T [i, j] ∈ ((1 +
)−1
, (1 + )
].
T
, Mrelative =
i,j
max

T
[i, j]
M[i, j]
,
M[i, j]
T [i, j]
≤
i,j
max

(1 + )T [i, j]
M[i, j]
,
M[i, j]
T [i, j]
≤ (1 + )Topt, Mrelative .
For ease of notation, we adopt the following conventions. Let G = (V, E) be
the graph representing M in the natural way. For an edge e = (i, j) denote its
input distance M[i, j] by me and its output distance T [i, j] by te. As described
in section 2, we will code for the label and the weight |we| on the edge passed to
the correlation clustering algorithm. The lower bound on e, λe, is the minimum
value e can contract, i.e. te ≥ λe.
Supplying our algorithm with an edge lower bounds matrix Λ allows us, for
example, to solve non-contracting versions of the objective functions we seek
to minimize where for all e, te ≥ me by simply setting Λ = M. We will also
use these lower bounds in section 4 when constructing general additive metrics
under Lp norms.
In the following two subsections we present algorithms for our problem. The
ﬁrst algorithm is suitable if the number of distinct distances, k, in M is small.
Otherwise, the second algorithm is more suitable.
3.1 Algorithm 1
Our algorithm takes as input a set of splitting distances S. Each distance in the
constructed tree will belong to this set. Let |S| = κ and number the splitting
distances in ascending order s1 s2 . . . sκ. The algorithm considers the
splitting distances in descending order, and when considering sl it may set some
distances T [i, j] = sl. If a distance of the tree is not set at this point, it will
later be set to ≤ sl−1. The decision of which distances to set to sl and which
distances to set to ≤ sl−1 will be made using correlation clustering. See Fig. 1
for the description of the algorithm.
Theorem 2. Algorithm 1 can be used to ﬁnd an ultrametric T such that any
one of the following holds:

Algorithm Correlation-Clustering-Splitting(G, S, Λ)
(∗ Uses correlation clustering to decide how to split ∗)
1. Let all edges be “unset”
2. for l = κ to 1:
3. do Do correlation clustering on the graph induced by the unset edges with
weights:
-If me ≥ sl and λe sl then,
we = −(f(me, sl−1) − f(me, sl))
-If λe = sl then we = −∞
-If me = si sl then we = f(si, sl)
4. for For each unset edge e split between diﬀerent clusters:
5. do te ← sl and mark e as “set”
Fig. 1. Algorithm 1 (The function f is deﬁned in Thm. 2)
1. T, Mp ≤ O((k log n)1/p
)Topt, Mp if S = {d1, . . . , dk} and f(me, te) =
|me − te|p
.
2. T, Mrelative ≤ O(log2
n)Topt, Mrelative if S = {(1 + )i
: 0 ≤ i ≤
log1+ dk} and f(me, te) = max{ te
me
, me
te
}.
Proof. Our algorithm produces an ultrametric T where the splitting distance of
each node is restricted to be from the set S, i.e. te ∈ S for all e. The proof below
shows that the algorithm gives a O(|S| log n)–approximation to

e f(me, t
e)
where T
is the optimal ultrametric satisfying t
e ∈ S for all e. The results in the
theorem will then follow by appealing to Lemma 1.
Consider the correlation clustering instance performed in iteration l of the
algorithm. Let costopt(l) be the optimal value for this instance and let cost(l)
be the cost of our solution.
Claim 1:

1≤l≤κ cost(l) =

e f(me, te).
Consider each edge e in turn. Let te = sl. If sl me, then in the lth iteration
we pay f(me, sl) for this edge. If sl me = sl , then in each iteration i, l
≥ i l,
we pay f(sl , si−1) − f(sl , si); hence, in total we pay f(sl , sl) = f(me, te).
Claim 2: costopt(l) ≤

e f(me, t
e)
Consider the following solution to the correlation clustering problem at iter-
ation l induced by T
: for all unset edges e if t
e ≥ sl we split e and if t
e sl we
don’t split e. We claim that the cost of this solution for the correlation clustering
problem is less that

e f(me, t
e). Consider each edge e in turn.
– t
e sl and me sl: Not splitting this edge contributes nothing to the
correlation clustering objective.
– t
e ≥ sl and me sl: Splitting this edge contributes f(sl, me) to the correla-
tion clustering objective but contributes f(t
e, me)≥f(sl, me) to

ef(me, t
e).
– t
e sl and me ≥ sl: Not splitting this edge contributes f(me, sl−1) −
f(me, sl) to the correlation clustering objective but contributes f(me, t
e) ≥
f(me, sl−1) ≥ f(me, sl−1) − f(me, sl) to

e f(me, t
e).
– t
e ≥ sl and me ≥ sl: Splitting this edge contributes nothing to the correlation
clustering objective.

Algorithm Min-Cut-Splitting(G, S, sl∗ , Λ)
(∗ Uses min cuts to work out splits ∗)
1. l ← l∗
+1
2. Min-Split-Cost← ∞
3. repeat
4. l ← l − 1
5. Push-Down-Cost ←

e(max{0, me − sl})p
− (max{0, me − sl∗ })p
6. if there exists an edge e = (s, t) such that λe = sl
7. then Find min-(s, t) cut C in G with edge weights
we = (max{0, sl − me})p
8. else Find min-cut C in G with edge weights
we = (max{0, sl − me})p
9. Cut-Cost ← the cost of the cut
10. if Cut-Cost+Push-Down-Cost ≤ Min-Split-Cost
11. then Best-Cut← C
12. Best-Splitting-Point← sl
13. Min-Split-Cost ← Cut-Cost+Push-Down-Cost
14. until l = 0 or there exists an edge e with λe = sl
15. for all edges e in Best-Cut:
16. do te ← Best-Splitting-Point
17. for each connected component of G
∈ (V, E Best-Cut):
18. do Min-Cut-Splitting(G
, S,Best-Splitting-Point,Λ)
Fig. 2. Algorithm 2
Summing over all edges, the contributions to both objective functions gives the
second claim.
Combining the above claims with Thm. 1, the tree we construct has the
following property,
e
f(te, me) =
1≤l≤κ
cost(l) ≤ O(κ log n)
e
f(me, t
e) .
The theorem follows.
3.2 Algorithm 2
Our second algorithm also takes as input a set of splitting distances S and,
as before, each distance in the constructed tree belongs to this set. However
while the approximation guarantee of the first algorithm depended on |S|, the
approximation guarantee of the second algorithm depends only on n. At each
step the first algorithm decided whether or not to place internal nodes at height
sl, and, if it did, how to partition the nodes below. In our second algorithm, at
each step we instead decide the height at which we should place the next internal
node and its partition. See Fig. 2 for the description of the algorithm. The first
call to the algorithm sets sl∗ = sκ.
Theorem 3. Algorithm 2 can be used to find an ultrametric T such that any
one of the following holds:

1. T, M1 ≤ nTopt, M1 if S = {d1, . . . , dk}.
2. For p ≥ 2, T, Mp ≤ 2n1/p
Topt, Mp if S = {d1, . . . , dk}.
Proof. Our algorithm produces a ultrametric T where the splitting distance of
each node is restricted to be from the set S, i.e. te ∈ S for all e. The proof below
shows that the algorithm gives an n–approximation to T
, Mp
p where T
is the
optimal ultrametric satisfying t
e ∈ S for all e. The results in the theorem will
then follow by appealing to Lemma 1.
Claim 1: The sum of Min-Split-Cost over all recursive calls of Min-Cut-Splitting
equals T, Mp
p.
Consider an edge e = (i, j) and let v be the lowest common ancestor of i and
j in T . If me ≤ te then we paid (te − me)p
for this edge in the Cut-Cost when
splitting at v. If me te, consider the internal nodes on the path from root to
v that have splitting distances ≤ me, me ≥ si1 si2 . . . sij = te. We paid a
total of
(me −si2 )p
+[(me − si3 )p
− (me − si2 )p
]+. . .+

(me − sij )p
− (me − sij−1 )p

= (me − te)p
for this edge as Push-Down-Costs.
Claim 2: The Min-Split-Cost of each call is at most T
, Mp
p
Consider a call Min-Cut-Splitting(
G = (
V ,
E), ·, sl, ·). If there exists an e ∈
E
such that t
e ≥ sl, then {e ∈
E : t
e ≥ sl} contains at least one cut of which let
C be the cut of minimum weight. For edges e ∈ C the cost of cutting e is
(max{0, sl − me})p
≤ |t
e − me|p
. Hence the Cut-Cost is ≤ T
, Mp
p. The Push-
Down-Cost is 0 since we are cutting in the ﬁrst iteration of the loop; therefore,
Min-Split-Cost ≤ T
, Mp
p .
If all e ∈
E satisfy t
e sl then let the splitting point be sl = maxe∈
E{t
e}. The
Push-Down-Cost is then at most
e∈
E
(max{0, me − sl })p
≤
e∈
E:met
e
(me − t
e)p
.
Now the set of edges {e ∈
E : t
e = sl } contains at least one cut and, as
before, choosing the minimum weight cut, call it C, results in the Cut-Cost
being equal to

e∈C(max{0, sl − me})p
=

e∈C:t
eme
(t
e − me)p
. Hence,
Min-Spilt-Cost ≤
e∈
E:met
e
(me − t
e)p
+
e∈C:t
eme
(t
e − me)p
≤ T
, Mp
p .
The number of recursive calls of Min-Cut-Splitting is n − 1 because each call
ﬁxes an internal node of the tree being constructed and the tree has n leaves.
Therefore, T, Mp
p ≤ (n − 1)T
, Mp
p and the theorem follows. Note that
while a slightly better analysis gives that T, Mp
p ≤ DT
, Mp
p where D is the
depth of the recursion tree, D can be as much as n − 1.

4 Extension to Additive Trees
In this section, we will generalize our results to approximating the input matrix
M by general additive metrics under any Lp norm. Our generalization depends
on the following theorem from [1],
Theorem 4 (see [1, Theorem 6.2]). If G(M) is an algorithm which achieves
an α-approximation to the optimal a-restricted ultrametric under the Lp norm,
then there is an algorithm F(M) which achieves a 3α-approximation to the op-
timal additive metric under the same norm.
We will show how our algorithms from section 3 can be used to produce a-
restricted ultrametrics. We start with the definition of an a-restricted ultrametric
from [1].
Definition 1. For a point a, an ultrametric T a
is a-restricted with respect to a
distance matrix M if
(1) T a
[a, i] = 2μa for all i = a,
(2) 2μa ≥ T a
[i, j] ≥ 2 (μa − min{M[a, i], M[a, j]}) for all i, j
where μa = maxi M[a, i].
The definition of an a-restricted ultrametric immediately implies a procedure
for approximating the distance T a
opt, Mp between an optimal T a
opt and M.
For a point a, let Ma
be the matrix M with row a and column a deleted. And
let Λa
be the n − 1 × n − 1 edge lower bounds matrix where
Λa
[i, j] = 2(μa − min{M[a, i], M[a, j]}) ,
for all i, j ∈ [n] {a}, i = j. Given Ga
, the graph representing Ma
, and Λa
our
algorithms now find an a-restricted ultrametric T a
such that
T a
, Mp ≤ O(min{n1/p
, (k log n)1/p
}) T a
opt, Mp .
Appealing to Thm. 4, we have a O(min{n1/p
, (k log n)1/p
})-approximation to the
optimal additive metric under Lp.
5 Conclusions and Further Work
In this paper we have looked at embedding metrics into additive trees and ul-
trametrics. We have presented two algorithms, one suitable when the number
of distinct distances in the metric is small, and one suitable when the number
of distinct distances is large. Both algorithms are intrinsically greedy; they con-
struct trees in a top-down fashion, establishing each internal node in turn by
considering the immediate cost of the split it defines. Using these algorithms we
provide the first approximation guarantees for this problem; however, there is
scope for improving those guarantees.
Addendum: We recently learned that, independent of our work, Ailon and
Charikar [15] have obtained improved results. They use ideas similar to those in
our work.

References
1. Agarwala, R., Bafna, V., Farach, M., Paterson, M., Thorup, M.: On the approxima-
bility of numerical taxonomy (fitting distances by tree metrics). SIAM J. Comput.
28 (1999) 1073–1085
2. Ma, B., Wang, L., Zhang, L.: Fitting distances by tree metrics with increment
error. J. Comb. Optim. 3 (1999) 213–225
3. Farach, M., Kannan, S.: Efficient algorithms for inverting evolution. Journal of the
ACM 46 (1999) 437–450
4. Cryan, M., Goldberg, L., Goldberg, P.: Evolutionary trees can be learned in poly-
nomial time in the two state general markov model. SIAM J. Comput 31 (2001)
375 – 397
5. Waterman, M., Smith, T., Singh, M., Beyer, W.: Additive evolutionary trees. J.
Theoretical Biology 64 (1977) 199–213
6. Day, W.: Computational complexity of inferring phylogenies from dissimilarity
matrices. Bulletin of Mathematical Biology 49 (1987) 461–467
7. Farach, M., Kannan, S., Warnow, T.: A robust model for finding optimal evolu-
tionary trees. Algorithmica 13 (1995) 155–179
8. Emanuel, D., Fiat, A.: Correlation clustering - minimizing disagreements on arbi-
trary weighted graphs. In Battista, G.D., Zwick, U., eds.: ESA. Volume 2832 of
Lecture Notes in Computer Science., Springer (2003) 208–220
9. Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In
Arora, S., Jansen, K., Rolim, J.D.P., Sahai, A., eds.: RANDOM-APPROX. Volume
2764 of Lecture Notes in Computer Science., Springer (2003) 1–13
10. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proc. of the 43rd
IEEE Annual Symposium on Foundations of Computer Science. (2002) 238
11. Charikar, M., Guruswami, V., Wirth, A.: Clustering with qualitative information.
In: Proc. of the 44th IEEE Annual Symposium on Foundations of Computer Sci-
ence. (2003) 524
12. Dhamdhere, K.: Approximating additive distortion of embeddings into line metrics.
In Jansen, K., Khanna, S., Rolim, J.D.P., Ron, D., eds.: APPROX-RANDOM.
Volume 3122 of Lecture Notes in Computer Science., Springer (2004) 96–104
13. Dhamdhere, K., Gupta, A., Ravi, R.: Approximation algorithms for minimizing
average distortion. In Diekert, V., Habib, M., eds.: STACS. Volume 2996 of Lecture
Notes in Computer Science., Springer (2004) 234–245
14. Garg, N., Vazirani, V.V., Yannakakis, M.: Approximate max-flow min-(multi)cut
theorems and their applications. SIAM J. Comput. 25 (1996) 235–251
15. Ailon, N., Charikar, M.: Personal comunication (2005)

Beating a Random Assignment
Gustav Hast
Department of Numerical Analysis and Computer Science
Royal Institute of Technology, 100 44 Stockholm, Sweden
ghast@nada.kth.se
Abstract. Max CSP(P) is the problem of maximizing the weight of
satisfied constraints, where each constraint acts over a k-tuple of literals
and is evaluated using the predicate P. The approximation ratio of a
random assignment is equal to the fraction of satisfying inputs to P. If
it is NP-hard to achieve a better approximation ratio for Max CSP(P),
then we say that P is approximation resistant. Our goal is to characterize
which predicates that have this property.
A general approximation algorithm for Max CSP(P) is introduced. For
a multitude of different P, it is shown that the algorithm beats the ran-
dom assignment algorithm, thus implying that P is not approximation
resistant. In particular, over 2/3 of the predicates on four binary inputs
are proved not to be approximation resistant, as well as all predicates on
2s binary inputs, that have at most 2s + 1 accepting inputs.
We also prove a large number of predicates to be approximation resistant.
In particular, all predicates of arity 2s + s2
with less than 2s2
non-
accepting inputs are proved to be approximation resistant, as well as
almost 1/5 of the predicates on four binary inputs.
1 Introduction
For a constraint satisfaction problem, CSP, the objective is to satisfy a collection
of constraints by finding a good assignment. In the maximization version, Max
CSP, each constraint has a weight and the objective is to maximize the weight
of satisfied constraints. By restricting what types of constraints that are allowed,
different types of problems can be defined. Two well-known examples are Max
CUT and Max E3SAT.
A constraint in a Max CUT instance consists of two binary variables, xi and
xj, and it is satisfied iff xi = xj. Constraints of Max E3SAT are disjunctions
over three literals, where a literal is a variable or a negated variable. Both these
problems are NP-complete. A naive algorithm for these problems, as well as all
Max CSPs, is simply assigning all variables a random value, without looking
at the constraints. It is easy to see that such a random assignment on average
satisfies half of the constraints in a Max CUT instance and 7/8 of the constraints
in a Max E3SAT instance.
An algorithm is said to α-approximate a problem if the value of a produced
solution is at least α · OPT , where OPT is the value of an optimal solution.
For randomized algorithms, we allow this value to be an expected value over the
c

Beating a Random Assignment 135
random choices made by the algorithm. A major breakthrough in approximating
various Max CSPs was made when Goemans and Williamson used a semidefi-
nite relaxation approach in order to 0.878-approximate Max CUT. Before this,
there was no known polynomial time 1/2 + -approximation for any constant
0, thus no known efficient algorithms outperformed a random assignment
in approximating Max CUT. Since then, the semidefinite relaxation technique
has been used extensively in order to achieve better approximation ratios for a
large number of Max CSP problems, e.g. in [16, 18, 22, 23].
But the use of semidefinite relaxation cannot be applied to all CSPs with
success. Based on PCP techniques, a different line of research produced results
showing NP-hardness of approximating various problems [1, 2, 6, 8, 19, 20].
Håstad [13] introduced a technique that enabled very efficient analysis of PCPs
and could show that there is no hope to efficiently approximate Max E3SAT
better than 7/8. Thus, in short of proving P = NP, the best possible efficient
approximation algorithm for Max E3SAT, up to low order terms, is simply to
pick a random assignment.
Max CSP(P) is the problem of maximizing the weight of satisfied con-
straints, where each constraint acts over a k-tuple of literals and is evaluated
using the predicate P. As Creignou [4] and Khanna et al. [17] showed, Max
CSP(P) is Max SNP-hard for all predicates depending on at least two vari-
ables. Thus, only for the unary predicate the problem can be solved optimally.
An interesting distinction is whether a predicate allows a nontrivial approxi-
mation, i.e., can be approximated better than using a random assignment. If
it is NP-hard to outperform a random assignment on Max CSP(P), then we
say that P is approximation resistant. There are many reasons for studying ap-
proximation resistance. Approximation resistance is a fundamental property of a
predicate which determines if anything nontrivial can be done in approximating
the Max CSP. It is thus an important structural question. We believe that un-
derstanding this concept is a key to comprehending what it is that makes some
problems hard to approximate. Approximation resistant predicates also play a
fundamental role in the design of efficient PCPs.
There are no predicates of arity two that are approximation resistant [10],
even if the input variables are non-binary [14]. Håstad [13] showed that if P de-
pends on three binary inputs and accepts all odd parity inputs or all even parity
inputs, then P is approximation resistant. Zwick [22] completed the character-
ization for predicates of arity three by producing approximation algorithms for
Max CSP(P), for all other predicates P, that outperform a random assignment.
Samorodnitsky and Trevisan [21] constructed approximation resistant predicates
of arity 2s + s2
with only 22s
accepting assignments. A predicate P implies P
iff P(y) ⇒ P
(y) for all y. In this paper we show that all predicates implied
by a Samorodnitsky-Trevisan predicate are approximation resistant as well. As
a consequence, all predicates of arity 2s + s2
with at most 2s2
non-accepting
inputs are also approximation resistant. On the other hand, we also show that
if a predicate of arity 2s has at most 2s + 1 accepting inputs, then the predicate
is not approximation resistant. Thus, there is a tendency that the more inputs
a predicate accepts, the harder it is to beat a random assignment for the cor-

136 Gustav Hast
responding Max CSP. It could be tempting to assume that if a predicate P is
approximation resistant and P implies P
, then P
is approximation resistant as
well. As Zwick showed, this is true for all predicates of arity three and it is also
true if P is a Samorodnitsky-Trevisan predicate. However, this is not true in
general. While studying predicates of arity four, we observed that the predicate
(x1 ∧ (x2 ≡ x3)) ∨ (x1 ∧ (x2 ≡ x4)), that has been shown to be approximation
resistant [11], implies that x2, x3 and x4 cannot all be equal. Zwick [22] showed
that this latter predicate is not approximation resistant.
1.1 Our Contribution
Among other contributions, we introduce a general technique for approximat-
ing Max CSP instances. It works by ﬁnding a solution that gives a positive
contribution on the small degree terms of the Fourier spectra of the objective
function, while seeing to that the higher order terms give a smaller positive or
negative contribution. By using a global analysis of the instance, we prove that
the method beats a random assignment on Max CSP(P) for a multitude of
diﬀerent predicates P, thus showing them not to be approximation resistant.
In addition, we give non-trivial conditions that optimal solutions to hard Max
CSP instances have to adhere to. We also show a large number of predicates to
be approximation resistant.
Method. The objective function of a Max CSP(P) instance, I, is the weighted
sum of the indicator variables of the constraints. In Section 2 we show that each
such term can be written as a multilinear expression. The whole sum can thus
also be written as a multilinear expression
I(x1, . . . xn) =
m
i=1
wiP(zi1, . . . zik) =
S⊆[n],|S|≤k
cS
!
i∈S
xi . (1)
In fact this is a Fourier spectra of the objective value function. Given a random
assignment, where each binary variable xi is assigned −1 or 1 randomly and
independently, all terms of the above expression that includes at least one vari-
able xi have an expected value of zero. The only term that gives a contribution
to the expectation is then the constant term, c∅. Thus, the expected value of
a random assignment is equal to c∅. In order to beat a random assignment, we
have to produce an assignment with an expected weight of at least c∅ + γW, for
a constant γ 0 and where W is the total weight of the instance.
If there is large weight on the linear terms of the Fourier spectra, i.e.,
n
i=1 |c{i}| = δW for a constant δ 0, then it is easy to assign values to
the variables such that linear terms give a contribution of δW. The problem is,
that there is no control on the higher order terms which possibly could give a
negative contribution that cancel out the contribution from the linear terms. In
order to reduce the contribution of the higher order terms, we therefore only
give a small bias towards the solution that maximize the linear terms. With
probability ε, depending on k and δ, we assign xi to sgn(c{i}), and otherwise

assign xi a random value. The linear terms then give a positive contribution of
εδW at the same time as the absolute contribution of each term of order i is
decreased with a factor of εi
. The total absolute contribution of the higher order
terms can thereby be made to be at most 1/3 of the contribution of the linear
terms by choosing an ε small enough. Thus, the algorithm finds an assignment
with expected value of at least c∅ + 2εδW/3 and thereby it beats a random
assignment.
We can apply the same method on the bi-linear terms, if there exists an
assignment that gives a non-negligible value on the those terms. In that case,
we use Charikar and Wirth’s approximation algorithm for quadratic programs
[3], which is based on semidefinite programming, in order to find an assignment
that gives a non-negligible value on the bi-linear terms. We use that algorithm,
because even if the optimal value is small it is guaranteed to outperform the
expected value of a random assignment. We note that we also could use one of
the Max CUT algorithms in [9, 23].
In order to show that a predicate is not approximation resistant, it is enough
to create an approximation algorithm that beats a random assignment on sat-
isfiable and almost satisfiable instances. This is because a random assignment
achieves the same expected objective value irrespective of the satisfiability of the
instance. Thus, satisfiable and almost satisfiable instances are the ones where a
random assignment achieves the worst approximation ratio. By combining the
algorithm for almost satisfiable instances with a random assignment, we then get
a better approximation ratio than using a random assignment alone. For some
predicates P it is possible to show that an assignment that almost satisfies an
instance of Max CSP(P) has to give a non-negligible positive value on either
the linear or bi-linear terms of the Fourier spectra. Thus satisfying the condition
for our algorithm to outperform a random assignment, which in turn shows that
P is not approximation resistant.
Application 1. The predicates P and P
are of the same type if Max CSP(P)
and Max CSP(P
) are essentially the same problem. There are exactly 400
different non-trivial predicate types of arity four. By using the above technique
we show that 270 of these are not approximation resistant. In particular, all
predicates with at most six accepting inputs are not approximation resistant.
Extending methods in [13] we derive that 70 predicate types are approximation
resistant, thus all but 60 predicate types of arity four are characterized.
Application 2. We show that if P is a predicate of arity 2s and has at most 2s+1
accepting inputs, then it is not approximation resistant. This shows that the
approximation resistant predicate of Engebretsen and Holmerin [5], with arity
six and eight accepting inputs, is optimal in the sense of minimizing the number
of accepting inputs. The method also gives lower bounds on the soundness for
a natural class of probabilistically checkable proofs. These PCPs express NP
and have perfect or almost perfect completeness and soundness negligibly higher
than the acceptance probability of a random proof.

138 Gustav Hast
Application 3. Our method enables us to characterize optimal solutions to hard
instances of some Max CSPs. Similar insights, in particular for constraints of
arity three, were achieved in [7].
Approximation Resistant Predicates. A Samorodnitsky-Trevisan predicate PSTs,t
consists of st different parity checks, each acting on three bits of the input,
PSTs,t (x1, . . . xs, x
1, . . . x
t, x
1 , . . . x
st) =

1≤i≤s
1≤j≤t
xi ⊕ x
j ⊕ x
(i−1)t+j .
Samorodnitsky and Trevisan [21] showed that such a predicate is approximation
resistant. In this paper we revisit the proof and show that all predicates implied
by a Samorodnitsky-Trevisan predicate are approximation resistant as well. A
consequence of this is that all predicates of arity s+t+st with less than 2st
non-
accepting inputs are approximation resistant. We also show that predicates of
arity four and five, having at most two and five non-accepting inputs respectively,
are approximation resistant. This is optimal in the sense that there are predicates
with one more non-accepting input that we show are not approximation resistant.
1.2 Organization of the Paper
Section 2 gives some notation and definitions used in the paper. In Section 3
we describe the method of exploiting the existence of a good solution on low-
degree terms in order to beat a random assignment. In the following section
we use this method in order to show that predicates with few accepting inputs
are not approximation resistant. We also show that predicates with few non-
accepting inputs are approximation resistant. In Section 5 we take a closer look
at predicates of arity four. The subsequent section shows that approximation
resistance is a non-monotone property. Section 7 is devoted to characterizing
optimal solutions to hard instances of Max CSP(P). All proofs are omitted
due to space constraints. They can be found in [12].
2 Preliminaries
A predicate P maps elements from {±1}k
to {0, 1}. For notational convenience
we let input bits have value −1, denoting true, and 1, denoting false. If P accepts
an input y, then P(y) = 1, otherwise P(y) = 0. Thus, the set of accepting inputs
to P is denoted by P−1
(1).
Max CSP(P) is the Max CSP problem where all constraints are evalu-
ated using a predicate P : {±1}k
→ {0, 1}. More formally, an instance of Max
CSP(P) consists of m weighted constraints, each one acting over a k-tuple of
literals, (zi1, . . . zik), taken from the set {x1, . . . xn, x1, . . . xn}. All variables in
such a tuple are assumed to be distinct. A constraint is satisfied if P maps its
k-tuple of literals onto one. The objective is to maximize
m
i=1 wiP(zi1, . . . zik),
where wi is the (non-negative) weight of the i:th constraint. The weight of

an input (y1, . . . yk) for an assignment a to an instance of Max CSP(P) is

i:(zi1,...zik)=y wi, where each constraint tuple is evaluated on a. Two predi-
cates, P and P
, are of the same type iff there is a permutation π on [k] and
a ∈ {±1}k
such that P(x1, . . . xk) = P
(a1xπ(1), . . . akxπ(k)) for all x ∈ {±1}k
.
A Max CSP(P) instance can be expressed as a Max CSP(P
) instance by
permuting and applying a bitmask to the constraint tuples, so clearly they are
equivalent problems.
The following definition formally describes an approximation resistant pred-
icate.
Definition 1. A predicate P is approximation resistant if, for every constant
0, it is NP-hard to approximate Max CSP(P) within 2−k
|P−1
(1)| + .
A predicate can be seen as a sum of conjunctions, each conjunction corre-
sponding to an accepting input to P. If x, y ∈ {±1}k
then

S⊆[k]

i∈S xiyi
equals 2k
if x = y and otherwise the sum is zero. Thus, a conjunction α1x1 ∧
. . . ∧ αkxk, where α ∈ {±1}k
, can be arithmetized as
ψα(x1, . . . xk) = 2−k
S⊆[k]
!
i∈S
αixi =

1 if α = (x1, . . . xk)
0 otherwise
.
A predicate can thus be expressed as a multilinear expression
P(x1, . . . xk) =
α∈P −1(1)
ψα(x1, . . . xk) . (2)
We let P(i)
denote the sum of the i-degree terms of the multilinear expression
P, and P(≥i)
denotes the sum of the terms of at least degree i in P. Thus,
P(i)
(x1, . . . xk) = 2−k
α∈P −1(1) S⊆[k],|S|=i
!
i∈S
αixi .
3 General Technique
As mentioned in the introduction, the objective function of a Max CSP(P)
instance can be described as a multilinear expression, see (1). In this section we
show that if there is an assignment that gives a non-negligible positive value to
the linear or bi-linear terms of this expression, then we can do better than just
picking a random assignment.
Consider the linear terms of the objective value function, I(1)
(x1, . . . xn) =
n
i=1 c{i}xi. It is easy to see that the linear terms are maximized by assigning xi
to sgn(c{i}), giving the value

i |c{i}|. However, in order to control the expected
value of higher order terms Algorithm Lin picks an assignment that is ε-biased
towards this assignment. The expected value of the i’th order terms decrease
their value with a factor of εi
. By choosing a small enough bias we ensure that
the expected value of the non-constant terms is dominated by the linear terms.
The following theorem quantifies the performance of Algorithm Lin.

140 Gustav Hast
For each i := 1,. . . n do:
– with probability ε: assign xi :=

1 if c{i} ≥ 0
−1 otherwise
,
– or else assign xi according to an unbiased coin flip.
Fig. 1. Algorithm Lin - for approximating Max CSP(P).
Theorem 2. Let I(x1, . . . xn) =

S⊆[n],|S|≤k cS

i∈S xi be the objective value
function for an instance of Max CSP(P), P : {±1}k
→ {0, 1}. Denote the
sum of all constraints’ weights W. If

i |c{i}| ≥ δW for a constant δ 0, then
Algorithm Lin, with ε = δ/2k, produces a solution of expected weight at least
c∅ + δ2
3k W.
The method for the linear terms can also be made to work for the bi-linear
terms. If there exists an assignment that makes the sum of the bi-linear terms
non-negligibly positive, then Algorithm BiLin outperforms a random assign-
ment.
Input: A Max CSP(P) instance I
1. Approximate the quadratic program that contains of the bi-linear terms, I(2)
,
using the Charikar-Wirth algorithm [3]. Let a1, . . . an be the produced solution.
2. If I(1)
(a1, . . . an) 0 then do: ai := ai for i := 1,. . . n.
3. Set α = I(2)
(a1, . . . an)/W , where W is the total weight in I. Set ε =
max(α/2k3/2
, 0), where k is the arity of P.
4. For i := 1,. . . n do:
– With probability ε: assign xi := ai,
– or else assign xi according to an unbiased coin flip.
Fig. 2. Algorithm BiLin - for approximating Max CSP(P).
Theorem 3. Let I(x1, . . . xn) =

S⊆[n],|S|≤k cS

i∈S xi be the objective value
function for an instance of Max CSP(P), P : {±1}k
→ {0, 1}. Denote the sum
of all constraints’ weights W. If there exists, for a constant δ 0, an assignment
such that

1≤ij≤n c{i,j}xixj ≥ δW, then Algorithm BiLin produces a solution
of expected weight at least c∅ + κW, where κ 0 and only depends on constants
δ and k.
From Theorems 2 and 3 we derive Lemma 4 and its generalization, Lemma 5.
These lemmas are used in order to show that predicates are not approximation
resistant. A crucial observation in their proofs is that in order to achieve a
better approximation algorithm than a random assignment, we only have to
outperform a random assignment on satisfiable or almost satisfiable instances.
Those instances are the ones where a random assignment achieves the worst
approximation ratio.

Lemma 4. Let P : {±1}k
→ {0, 1} be a predicate. If P(1)
(y) + P(2)
(y) 0 for
all y ∈ P−1
(1), then P is not approximation resistant.
Lemma 5. Let P : {±1}k
→ {0, 1} be a predicate. An instance of Max CSP(P)
has the objective value function I(x1, . . . xn) =
m
i=1 wiP(zi1, . . . zik). For an as-
signment to x1, . . . xn, define the weight on each input y, uy =

i:(zi1,...zik)=y wi.
For constants C and δ 0, there exist constants ε 0 and κ 0 such that if

y∈P −1(1) uy(C · P(1)
(y) + P(2)
(y)) ≥ δW, where W =
m
i=1 wi, then either the
value of the assignment is less than (1 − ε)W, or Algorithm Lin or Algorithm
BiLin will achieve an approximation ratio of at least 2−k
|P−1
(1)| + κ.
By using a gadget construction to Max 2SAT and applying Zwick’s algo-
rithm for almost satisfiable Max 2SAT instances, we show the following theo-
rem.
Theorem 6. Let P : {±1}k
→ {0, 1} be a predicate. If for some constant C,
C · P(1)
(y) + P(2)
(y) is zero for at most two y ∈ P−1
(1) and strictly positive for
all other y ∈ P−1
(1), then P is not approximation resistant.
4 Predicates on k Variables
If a predicate has few accepting inputs, then we can show by using Lemma 4
that the predicate is not approximation resistant.
Theorem 7. Let P : {±1}k
→ {0, 1}, k ≥ 3. If P has at most 2k/2 + 1
accepting inputs, then P is not approximation resistant.
Remark 8. Engebretsen and Holmerin [5] designed an approximation resistant
predicate P : {±1}6
→ {0, 1}, with only eight accepting inputs. The above
theorem is thus tight for the case k = 6 as it stipulates that if a predicate has
at most seven accepting inputs, then it is not approximation resistant.
In the remaining part of this section we describe sufficient conditions on
predicates to be approximation resistant. The following theorem shows that all
predicates implied by a Samorodnitsky-Trevisan predicate is approximation re-
sistant. It turns out that a large part of the proof is essentially a reiteration
of the proof of Håstad and Wigderson [15], which is a simpler proof of the
Samorodnitsky-Trevisan PCP than the original one.
Theorem 9. Let P : {±1}s+t+st
→ {0, 1}, be a predicate that is implied by
PSTs,t , i.e., P−1
STs,t (1) ⊆ P−1
(1), then P is approximation resistant.
If a predicate P has enough accepting inputs, then there is at least one
predicate of the same type as P, such that it is implied by a Samorodnitsky-
Trevisan predicate. Using Theorem 9, we then conclude that P is approximation
resistant.
Theorem 10. Let k ≥ s + t + st and P : {±1}k
→ {0, 1} be a predicate with at
most 2st
− 1 non-accepting inputs, then P is approximation resistant.

142 Gustav Hast
Now we consider predicates of arity five. Theorem 10, with s = 1 and t = 2,
lets us conclude that if |P−1
(0)| ≤ 3 then P is approximation resistant. By
taking more care in the analysis we are able to show the following theorem.
Theorem 11. Let P be a predicate on 5 binary variables with at most 5 non-
accepting inputs, then P is approximation resistant.
Remark 12. The next section shows that there are predicates of arity four, hav-
ing exactly three non-accepting inputs, which are not approximation resistant.
This implies that there are also predicates of arity five, having exactly six non-
accepting inputs, which are not approximation resistant. Thus, Theorem 11 is
optimal in this sense.
5 Predicates on Four Variables
In this section we take a closer look at predicates of arity four. Theorem 7
stipulates that if a predicate P : {±1}4
→ {0, 1} has at most five accepting
inputs then it is not approximation resistant. That was already known as a
result of the Max 4CSP 0.33-approximation algorithm of Guruswami et al.
[11], because 5/24
0.33. By applying Theorem 6 we show that no predicate
that accepts exactly six inputs is approximation resistant. We also show many
predicates on four variables to be approximation resistant by extending methods
from [13]. Out of exactly 400 different non-trivial predicate types of arity four
we show that 79 are approximation resistant and 275 are not approximation
resistant. For every number of accepting inputs, the number of predicate types
that have been characterized as non-trivially approximable and approximation
resistant, respectively, are shown in Table 1. For a detailed description on how
these results were obtained, see [12].
Table 1. Approximability of Max CSP(P) for predicates of arity four.
# accepting inputs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 all
# non-trivially approx 1 4 6 19 27 50 50 52 27 26 9 3 1 0 0 275
# approx resistant 0 0 0 0 0 0 0 16 6 22 11 15 4 4 1 79
# unknown 0 0 0 0 0 0 6 6 23 2 7 1 1 0 0 46
6 Approximation Resistance Is Non-monotone
There is one predicate of arity four that we know to be approximation resistant
but is not implied by parity,
GLST(x1, x2, x3, x4) = (x1 ∧ (x2 ≡ x3)) ∨ (x1 ∧ (x2 ≡ x4)) . (3)
It was shown to be approximation resistant in [11]. Zwick [22] showed that the
not all equal predicate of arity three, 3NAE(x1, x2, x3) = (x1 ⊕ x2) ∨ (x1 ⊕ x3),

is not approximation resistant. For all approximation resistant predicates P
of arity three, it is true that predicates implied by P is approximation re-
sistant as well. This is true for all Samorodnisky-Trevisan predicates as well.
A natural guess could be that it is true in general. However, the observation
that GLST(x1, x2, x3, x4) implies 3NAE(x2, x3, x4) gives the following theorem
which falsifies that guess.
Theorem 13. It is not generally true that if P is approximation resistant and
P
is implied by P, then P
is approximation resistant as well.
7 Characterization of Hard Instances
A hard Max CSP instance is an instance that is satisfiable or almost satisfiable
and for which it is NP-hard to approximate it significantly better than picking
a random assignment. In this section we characterize optimal and near-optimal
solutions to such instances by bounding the weight of particular inputs for such
solutions. Our main tool for this task is Lemma 5.
Zwick characterized all predicates of arity three [22]. From this work we know
that approximation resistant predicates of arity three are of the same type as
either 3XOR which accepts if and only if the parity of the input bits is odd,
3NTW which accepts if and only if not exactly two of the input bits are true,
3OXR(x1, x2, x3) = x1 ∨ (x2 ⊕ x3), or 3OR(x1, x2, x3) = x1 ∨ x2 ∨ x3.
Theorem 14. Let P be 3NTW, 3OXR or 3OR, and let I be an instance of
Max CSP(P) of total weight W. For every γ 0, there exist ε 0 and κ 0
such that if there exists a (1 − ε)-satisfying assignment with at least weight γW
on even parity inputs, then I can be (s2−3
+ κ)-approximated in probabilistical
polynomial time, where s is the number of accepting inputs to P.
Remark 15. In [7], Feige studied the problem of distinguishing between random
Max CSP(3OR) instances and almost satisfiable instances. He shows that the
only type of almost satisfiable instances that cannot be distinguished from a ran-
dom instance is if an optimal assignment makes almost all constraints have either
one or three literals true. For all other almost satisfiable instances a certificate
of non-randomness can be acquired. By applying Algorithms Lin and BiLin, it
can be shown that the certificate implies that we can find an assignment that
beats a random assignment. From this Theorem 14 follows.
Using the same method on the GLST predicate, we show that optimal so-
lutions to hard instances of Max CSP(GLST) have almost all weight on only
four of the eight accepting inputs to GLST.
Theorem 16. Let I be an instance of Max CSP(GLST). For every γ 0,
there exist ε 0 and κ 0 such that if there exists a (1−ε)-satisfying assignment
with total weight γwtot(I) on inputs
{(1, 1, 1, 1), (1, −1, −1, −1), (−1, 1, 1, 1), (−1, −1, −1, −1)} ,
then I can be (1/2 + κ)-approximated in probabilistical polynomial time.

144 Gustav Hast
Acknowledgments
I am very grateful to Johan Håstad for much appreciated help. I also thank a
referee for bringing [7] to my attention.
References
1. Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy.
Proof verification and hardness of approximation problems. Journal of the ACM,
45(3):501–555, 1998. Preliminary version appeared in FOCS 1992.
2. Mihir Bellare, Oded Goldreich, and Madhu Sudan. Free bits, PCPs, and nonap-
proximability - towards tight results. SIAM Journal on Computing, 27(3):804–915,
1998.
3. Moses Charikar and Anthony Wirth. Maximizing quadratic programs: Extending
Grothendieck’s inequality. In Proceedings of the 45th Annual IEEE Symposium on
Foundations of Computer Science, pages 54–60, 2004.
4. Nadia Creignou. A dichotomy theorem for maximum generalized satisability prob-
lems. Journal of Computer and System Sciences, 51(3):511–522, 1995.
5. Lars Engebretsen and Jonas Holmerin. More efficient queries in PCPs for NP and
improved approximation hardness of maximum CSP. In Proceedings of STACS
2005, Lecture Notes in Computer Science 3404, pages 194–205, 2005.
6. Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM,
45(4):634–652, 1998.
7. Uriel Feige. Relations between average case complexity and approximation com-
plexity. In Proceedings of the 34th Annual ACM Symposium on Theory of Com-
puting, pages 534–543, 2002.
8. Uriel Feige, Shafi Goldwasser, László Lovász, Shmuel Safra, and Mario Szegedy.
Interactive proofs and the hardness of approximating cliques. Journal of the ACM,
43(2):268–292, 1996. Preliminary version appeared in FOCS 1991.
9. Uriel Feige and Michael Langberg. The RPR2
rounding technique for semidefinite
programs. In Proceedings of ICALP, pages 213–224, 2001.
10. Michel X. Goemans and David P. Williamson. Improved Approximation Algo-
rithms for Maximum Cut and Satisfiability Problems Using Semidefinite Program-
ming. Journal of the ACM, 42:1115–1145, 1995.
11. Venkatesan Guruswami, Daniel Lewin, Madhu Sudan, and Luca Trevisan. A tight
characterization of NP with 3 query PCPs. In Proceedings of the 39th Annual IEEE
Symposium on Foundations of Computer Science, pages 8–17, 1998.
12. Gustav Hast. Beating a Random Assignment. PhD thesis, Royal Institute of Tech-
nology, 2005.
13. Johan Håstad. Some optimal inapproximability results. Journal of the ACM,
48(4):798–859, 2001. Preliminary version appeared in STOC 1997.
14. Johan Håstad. Every 2-CSP allows nontrivial approximation. In Proceedings of the
37th Annual ACM Symposium on Theory of Computing, pages 740–746, 2005.
15. Johan Håstad and Avi Wigderson. Simple analysis of graph tests for linearity and
PCP. Random Structures and Algorithms, 22(2):139–160, 2003.
16. Howard Karloff and Uri Zwick. A 7/8-approximation algorithm for MAX 3SAT?
In Proceedings of the 38th Annual IEEE Symposium on Foundations of Computer
Science, pages 406–415, 1997.

17. Sanjeev Khanna, Madhu Sudan, Luca Trevisan, and David P. Williamson. The
approximability of constraint satisfaction problems. SIAM Journal on Computing,
30(6):1863–1920, 2000.
18. Michael Lewin, Dror Livnat, and Uri Zwick. Improved rounding techniques for the
MAX 2-SAT and MAX DI-CUT problems. In Proceedings of 9th IPCO, Lecture
Notes in Computer Science 2337, pages 67–82, 2002.
19. Carsten Lund and Mihalis Yannakakis. On the hardness of approximating mini-
mization problems. Journal of the ACM, 41(5):960–981, 1994. Preliminary version
appeared in STOC 1993.
20. Christos H. Papadimitriou and Mihalis Yannakakis. Optimization, approximation,
and complexity classes. Journal of Computer and System Sciences, 43(3):425–440,
1991.
21. Alex Samorodnitsky and Luca Trevisan. A PCP characterization of NP with op-
timal amortized query complexity. In Proceedings of the 32nd Annual ACM Sym-
posium on Theory of Computing, pages 191–199, 2000.
22. Uri Zwick. Approximation algorithms for constraint satisfaction problems involving
at most three variables per constraint. In Proceedings of the 9th Annual ACM-
SIAM Symposium on Discrete Algorithms, pages 201–210, 1998.
23. Uri Zwick. Outward rotations: a tool for rounding solutions of semideﬁnite pro-
gramming relaxations, with applications to MAX CUT and other problems. In
Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pages
679–687, 1999.

Scheduling on Unrelated Machines
Under Tree-Like Precedence Constraints
V.S. Anil Kumar1,
, Madhav V. Marathe2
,
Srinivasan Parthasarathy3,
, and Aravind Srinivasan3,
1
Basic and Applied Simulation Science (CCS-DSS)
Los Alamos National Laboratory, MS M997
P.O. Box 1663, Los Alamos, NM 87545
anil@lanl.gov
2
Virginia Bio-informatics Institute
and Department of Computer Science Virginia Tech, Blacksburg 24061
mmarathe@vbi.vt.edu
3
Department of Computer Science, University of Maryland
College Park, MD 20742
{sri,srin}@cs.umd.edu
Abstract. We present polylogarithmic approximations for the
R|prec|Cmax and R|prec|

j
wjCj problems, when the precedence con-
straints are “treelike” – i.e., when the undirected graph underlying the
precedences is a forest. We also obtain improved bounds for the weighted
completion time and flow time for the case of chains with restricted as-
signment – this generalizes the job shop problem to these objective func-
tions. We use the same lower bound of “congestion+dilation”, as in other
job shop scheduling approaches. The first step in our algorithm for the
R|prec|Cmax problem with treelike precedences involves using the algo-
rithm of Lenstra, Shmoys and Tardos to obtain a processor assignment
with the congestion + dilation value within a constant factor of the op-
timal. We then show how to generalize the random delays technique of
Leighton, Maggs and Rao to the case of trees. For the weighted comple-
tion time, we show a certain type of reduction to the makespan problem,
which dovetails well with the lower bound we employ for the makespan
problem. For the special case of chains, we show a dependent rounding
technique which leads to improved bounds on the weighted completion
time and new bicriteria bounds for the flow time.
1 Introduction
The most general scheduling problem involves unrelated parallel machines and
precedence constraints, i.e., we are given: (i) a set of n jobs with precedence
constraints that induce a partial order on the jobs; (ii) a set of m machines, each
of which can process at most one job at any time, and (iii) an arbitrary set of

Research supported by the Department of Energy under Contract W-7405-ENG-36.

Research supported in part by NSF Award CCR-0208005 and NSF ITR Award
CNS-0426683.
c

Scheduling on Unrelated Machines Under Tree-Like Precedence Constraints 147
integer values {pi,j}, where pi,j denotes the time to process job j on machine i.
Let Cj denote the completion time of job j. Subject to the above constraints,
two commonly studied versions are (i) minimize the makespan, or the maximum
time any job takes, i.e. maxj{Cj} – this is denoted by R|prec|Cmax, and (ii)
minimize the weighted completion time – this is denoted by R|prec|

j wjCj.
Numerous other variants, involving release dates or other objectives have been
studied (see e.g. Hall [7]).
Almost optimal upper and lower bounds are known for the versions of the
above problems without precedence constraints (i.e., the R||Cmax and
R||

j wjCj problems) [3, 13, 23], but very little is known in the presence of
precedence constraints. The only case of the general R|prec|Cmax problem for
which non-trivial approximations are known is the case where the precedence
constraints are chains – this is the job shop scheduling problem (see Shmoys et
al. [22]), which itself has a long history. The ﬁrst result for job shop schedul-
ing was the breakthrough work of Leighton et al. [14, 15] for packet scheduling,
which implied an O(log n) approximation for the case of unit processing costs.
Leighton et al. [14, 15] introduced the “random delays” technique, and almost
all the results on the job shop scheduling problem [5, 6, 22] are based on vari-
ants of this technique. Shmoys et al. [22] also generalize job-shop scheduling to
DAG-shop scheduling, where the operations of each job form a DAG, instead of
a chain, with the additional constraint that the operations within a job can be
done only one at a time.
The only results known for the case of arbitrary number of processors with
more general precedence constraints are for identical parallel machines (de-
noted by P|prec|Cmax; Hall [7]), or for related parallel machines (denoted by
Q|prec|Cmax) [2, 4]. The weighted completion time objective has also been stud-
ied for these variants [3, 8]. When the number of machines is constant, polyno-
mial time approximation schemes are known [9, 11]. Note that all of the above
discussion relates to non-preemptive schedules, i.e., once the processing of a job
is started, it cannot be stopped until it is completely processed; preemptive vari-
ants of these problems have also been well studied (see e.g. Schulz and Skutella
[20]).
Far less is known for the weighted completion time objective in the same
setting, instead of the makespan. The known approximations are either for the
case of no precedence constraints [23], or for precedence constraints with identi-
cal/related processors [8, 19]. To the best of our knowledge, no non-trivial bound
is known on the weighted completion time on unrelated machines, in the presence
of precedence constraints of any kind.
Here, motivated by applications such as evaluating large expression-trees and
tree-shaped parallel processes, we consider the special case of the R|prec|Cmax
and R|prec|

j wjCj problems, where the precedences form a forest, i.e., the
undirected graph underlying the precedences is a forest. Thus, this naturally
generalizes the job shop scheduling problem, where the precedence constraints
form a collection of disjoint directed chains.

148 V.S. Anil Kumar et al.
Summary of Results. We present the first polynomial time approximation
algorithms for the R|prec|Cmax and R|prec|

j wjCj problems, under “treelike”
precedences. As mentioned earlier, these are the first non-trivial generalizations
of the job shop scheduling problems to precedence constraints which are not
chains. Since most of our results hold in the cases where the precedences form
a forest (i.e., the undirected graph underlying the DAG is a forest), we will
denote the problems by R|forest|Cmax, and R|forest|

j wjCj, respectively,
to simplify the description – this generalizes the notation used by Jansen and
Solis-oba [10] for the case of chains.
1. The R|forest|Cmax Problem. We obtain a polylogarithmic approximation
for this problem in Section 2. We employ the same lower bound used in [5, 6,
14, 22]: LB
.
= max{Pmax, Πmax}, where Pmax is the maximum processing time
along any directed path and Πmax is the maximum processing time needed by
any machine, for a fixed assignment of jobs to machines. Let pmax = maxi,j pi,j
be the maximum processing time of any job on any machine. We obtain an
O( log2
n
log log n log min(pmax,n)
log log n ) approximation to the R|forest|Cmax problem. When
the forests are out-trees or in-trees, we show that this factor can be improved to
O(log n · log(min{pmax, n})/ log log n); for the special case of unit processing
times, this actually becomes O(log n). We also show that the lower-bound LB
cannot be put to much better use: even in the case of trees – for unit processing
costs, we show instances whose optimal schedule is Ω(LB · log n).
Our algorithm for solving R|forest|Cmax follows the overall approach used to
solve the job shop scheduling problem (see, e.g. Shmoys et al. [22]) and involves
two steps: (1) We show how to compute a processor assignment within a (3+
√
5
2 )–
factor of LB, by extending the approach of Lenstra et al. [13], and, (2) We
design a poly-logarithmic approximation algorithm for the resulting variant of
the R|prec|Cmax problem with pre-specified processor assignment, and forest
shaped precedences.
2. The R|forest|

j wjCj Problem. In Section 3, We show a reduction from
R|prec|

j wjCj to R|prec|Cmax of the following form: if there is a schedule of
makespan (Pmax + Πmax) · ρ for the latter, then there is an O(ρ)-approximation
algorithm for the former. We exploit this, along with the fact that our approxima-
tion guarantee for R|forest|Cmax is of the form “(Pmax +Πmax) times polylog”,
to get a polylogarithmic approximation for the R|forest|

j wjCj problem. Our
reduction is similar in spirit to that of Queyranne and Sviridenko [19]: both re-
ductions employ geometric time windows and linear constraints on completition
times for bounding congestion and dilation. However, the reduction in [19] is
meant for identical parallel machines while our reduction works for unrelated
machines. Further, [19] works for job-shop and dag-shop problems under the
assumption that no two operations from the same DAG can be executed concur-
rently, although no precedence relation might exist between the two operations;
in contrast, we do not impose this restriction and allow concurrent processing
subject to precedence and assignment constraints being satisfied.

3. Minimizing Weighted Completion Time and Flow Time on Chains.
For a variant of the R|forest|

j wjCj problem where (i) the forest is a col-
lection of chains (i.e., the weighted completion time variant of the job shop
scheduling problem), and (ii) for each machine i and operation v, pi,v ∈ {pv, ∞}
(i.e., the restricted-assignment variant), we show a better approximation of
O(log n/ log log n) to the weighted completion time in Section 4. Our result en-
sures that (i) the precedence constraints are satisfied with probability 1, and (ii)
for any (v, t), the probability of scheduling v at time t equals its fractional (LP)
value xv,t. This result also leads to a bicriteria (1 + o(1))–approximation for
weighted flow time variant of this problem, using O(log n/ log log n) copies of
each machine.
Due to space limitations, several proofs and algorithm details are omitted
here and are deferred to the full version this paper.
2 The R|forest|Cmax Problem
Consider a (fractional) assignment x of jobs to machines, where xi,j is the frac-
tion of job j assigned to machine i. In the description below, we will use the
terms “node” and “job” interchangably; we will not use the term “operation”
to refer to nodes of a DAG, because we do not have the job shop or dag shop
constraints that at most one node in a DAG can be processed at a time. As
before, Pmax denotes the maximum processing time along any directed path,
i,e., Pmax = maxpath P{

j∈P

i xi,jpi,j}. Also, Πmax denotes the maximum
load on any machine, i.e., Πmax = maxi{

j xi,jpi,j}. Our algorithm for the
R|forest|Cmax problem involves the following two steps:
Step 1: We first construct a processor assignment for which the value of
max{Pmax, Πmax} is within a constant factor ((3 +
√
5)/2) of the smallest-
possible. This is described in Section 2.1.
Step 2: Solve the GDSS problem we get from the previous step to get a schedule
of length polylogarithmically more than max{Pmax, Πmax}. This is described in
Section 2.2.
2.1 Step 1: A Processor Assignment Within a Constant Factor
of max{Pmax, Πmax}
We now describe the algorithm for processor assignment, using some of the
ideas from Lenstra et al. [13]. Let T be our “guess” for the optimal value of
LB = max{Pmax, Πmax}. Define ST = {(i, j) | pij ≤ T }. Let J and M denote the
set of jobs and machines, respectively. We now define a family of linear programs
LP(T ), one for each value of T ∈ Z+
, as follows: (A1) ∀j ∈ J

i xij = 1, (A2)
∀i ∈ M

j xijpij ≤ T , (A3) ∀j ∈ Jzj =

i pijxij, (A4) ∀(j
≺ j)cj ≥ cj + zj,
(A5) ∀j ∈ Jcj ≤ T . The constraints (A1) ensure that each job is assigned
a machine, constraints (A2) ensure that the maximum fractional load on any
machine (Πmax) is at most T . Constraints (A3) define the fractional processing
time zj for a job j and (A4) capture the precedence constraints amongst jobs

(cj denotes the fractional completion of time of job j). We note that maxj cj is
the fractional Pmax. Constraints (A5) state that the fractional Pmax value is at
most T .
Let T ∗
be the smallest value of T for which LP(T ) has a feasible solution. It
is easy to see that T ∗
is a lower bound on LB. We now present a rounding scheme
which rounds a feasible fractional solution LP(T ∗
) to an integral solution. Let
Xij denote the indicator variable which denotes if job j was assigned to machine
i in the integral solution, and let Cj be the integer analog of cj and zj. We first
modify the xij values using filtering (Lin and Vitter [16]). Let μ = 3+
√
5
2 . For
any (i, j), if pij μzj, then set xij to zero. This step could result in a situation
where, for a job j, the fractional assignment

i xij drops to a value r such
that r ∈ [1 − 1
μ , 1). So, we scale the (modified) values of xij by a factor of at
most γ = μ
μ−1 . Let A denote this fractional solution. Crucially, we note that
any rounding of A, which ensures that only non-zero variables in A are set to
non-zero values in the integral solution, has an integral Pmax value which is at
most μT∗
. This follows from the fact that if Xij = 1 in the rounded solution,
then pij ≤ μzj. Hence, it is easy to see that by induction, for any job j, Cj is at
most μcj ≤ μT∗
.
We now show how to round A. Recall that Lenstra et al. [13] present a
rounding algorithm for unrelated parallel machines scheduling without prece-
dence constraints with the following guarantee: if the input fractional solution
has a fractional Πmax value of x, then the output integral solution has an inte-
gral Πmax value of at most x + maxxij 0 pij. We use A as the input instance for
the rounding algorithm of Lenstra et al. [13]. Note that A has a fractional Πmax
value of at most γT ∗
. Further, maxxij 0 pij ≤ T ∗
. This results in an integral
solution I whose Pmax value is at most μT∗
, and whose Πmax value is at most
(γ + 1)T ∗
. Observe that, setting μ = 3+
√
5
2 results in μ = γ + 1. Finally, we note
that the optimal value of T can be arrived at by a bisection search in the range
[0, npmax], where n = |J| and pmax = maxi,j pij. Since T ∗
is a lower bound on
LB, we have the following result.
Theorem 1. The above algorithm computes a processor assignment for each job
such that the value of max{Pmax, Πmax} for the resulting assignment is within a
(3+
√
5
2 )–factor of the optimal.
2.2 Step 2: Solving the GDSS Problem Under Treelike Precedences
We first consider the case when the precedences are a collection of directed in-
trees or out-trees. We then extend this to the case where the precedences form
an arbitrary forest (i.e., the underlying undirected graph is a forest). Since the
processor assignment is already specified in the GDSS problem, we will use the
notation m(v) to denote the machine to which node v is assigned. Also, since
the machine is already fixed, the processing time for node v is also fixed, and is
denoted by pv.

GDSS on Out-/In-Arborescences. An out-tree is a tree rooted at some
node, say r, with all edges directed away from r; an in-tree is a tree obtained by
reversing all the directions in an out-tree. In the discussion below, we only focus
on out-trees; the same results can be obtained for in-trees. The algorithm for out-
trees requires a careful partitioning of the tree into blocks of chains, and giving
random delays at the start of each chain in each of the blocks – thus the delays are
spread all over the tree. The head of the chain waits for all its ancestors to finish
running, after which it waits for an amount of time equal to its random delay.
After this, the entire chain is allowed to run without interruption. Of course, this
may result in an infeasible schedule where multiple jobs simultaneously contend
for the same machine (at the same time). We show that this contention is low
and can be resolved by expanding the infeasible schedule produced above.
Chain Decomposition. We define the notions of chain decomposition of a
graph and its chain width; the decomposition for an out-directed arborescence
is illustrated in Figure 1. Given a DAG G(V, E), let din(u) and dout(u) denote
the in-degree and out-degree, respectively, of u in G. A chain decomposition of
G(V, E) is a partition of its vertex set into subsets B1, . . . , Bλ (called blocks)
such that the following properties hold: (i) The subgraph induced by each block
Bi is a collection of vertex-disjoint directed chains, (ii) For any u, v ∈ V , let
u ∈ Bi be an ancestor of v ∈ Bj. Then, either i j, or i = j and u and v
belong to the same directed chain of Bi, (iii) If dout(u) 1, then none of u’s
out-neighbors are in the same block as u. The chain-width of a DAG is the
minimum value λ such that there is a chain decomposition of the DAG into λ
blocks.
(a) (b)
Fig. 1. Chain decomposition of an out-directed arborescence. The vertices enclosed
within the boxes in figures 1(a) and 1(b) are in blocks B3 and B2 respectively while
the remaining vertices are in block B1.
Well Structured Schedules. We now state some definitions motivated by
those in Goldberg et al. [6]. Given a GDSS instance with a DAG G(V, E) and
given a chain decomposition of G into λ blocks, we construct a B-delayed schedule
for it as follows; B is an integer that will be chosen later. Each job v which is the
head of a chain in a block is assigned a delay d(v) in {0, 1, . . ., B−1}. Let v be the

head of chain Ci. Job v waits for d(v) amount of time after all its predecessors
have finished running, after which the jobs of Ci are scheduled consecutively (of
course, the resulting schedule might be infeasible). A random B-delayed schedule
is a B-delayed schedule in which all the delays have been chosen independently
and uniformly at random from {0, 1, . . ., B − 1}. For a B-delayed schedule S,
the contention C(Mi, t) is the number of jobs scheduled on machine Mi in the
time interval [t, t + 1). As in [6, 22], we assume w.l.o.g. that all job lengths are
powers of two. This can be achieved by multiplying each job length by at most a
factor of two (which affects our approximation ratios only by a constant factor).
A delayed scheduled S is well-structured if for each k, all jobs with length 2k
begin in S at a time instant that is an integral multiple of 2k
. Such schedules can
be constructed from randomly delayed schedules as follows. First create a new
GDSS instance by replacing each job v = (m(v), pv) by the job v̂ = (m(v), 2pv).
Let S be a random B-delayed schedule for this modified instance, for some B;
we call S a padded random B-delayed schedule. From S, we can construct a well-
structured delayed schedule, S
, for the original GDSS instance as follows: insert
v with the correct boundary in the slot assigned to v̂ by S. S
will be called a
well-structured random B-delayed schedule for the original GDSS instance.
Our Algorithm. We now describe our algorithm; for the sake of clarity, we
occasionally omit floor and ceiling symbols (e.g., “B = 2Πmax/ log(npmax)” is
written as “B = 2Πmax/ log(npmax)”). As before let pmax = maxv pv.
1. Construct a chain decomposition of the DAG G(V, E) and let λ be its chain
width.
2. Let B = 2Πmax/ log(npmax). Construct a padded random B-delayed sched-
ule S by first increasing the processing time of each job v by a factor of
2 (as described above), and then choosing a delay d(v) ∈ {0, . . ., B − 1}
independently and uniformly at random for each v.
3. Construct a well-structured random B-delayed schedule S
as described
above.
4. Construct a valid schedule S
using the technique from Goldberg et al. [6]
as follows:
(a) Let the makespan of S
be L.
(b) Partition the schedule S
into frames of length pmax; i.e., into the set of
time-intervals {[ipmax, (i + 1)pmax), i = 0, 1, . . ., L/pmax − 1}.
(c) For each frame, use the frame-scheduling technique from [6] to produce a
feasible schedule for that frame. Concatenate the schedules of all frames
to obtain the final schedule.
The following theorem shows the performance guarantee of the above algo-
rithm, when given a chain decomposition.
Theorem 2. Let pmax = maxi,j pij. Given an instance of treelike GDSS and a
chain decomposition of its DAG G(V, E) into λ blocks, the schedule S
produced
by the above algorithm has makespan O(ρ · (Pmax + Πmax)) with high probabil-
ity, where ρ = max{λ, log n} · log(min{pmax, n})/ log log n. Furthermore, the
algorithm can be derandomized.

The proof of Theorem 3 demonstrates a chain decomposition of width
O(log n) for any out-tree: this completes the algorithm for an out-tree. An iden-
tical argument would work for the case of a directed in-tree. We note that the no-
tions of chain decomposition and chain-width for the out-directed arborescences
are similar to those of caterpillar decomposition and caterpillar dimension for
trees (see Linial et al. [17]). However, in general, a caterpillar decomposition for
an arborescence need not be a chain-decomposition and vice-versa.
Theorem 3. There is a deterministic polynomial-time approximation algorithm
for solving the GDSS problem when the underlying DAG is restricted to be an
in/out tree. The algorithm computes a schedule with makespan O((Pmax+Πmax)·
ρ), where ρ = log n · log(min{pmax, n})/ log log n. In particular, we get an
O(log n)–approximation in the case of unit-length jobs.
GDSS on Arbitrary Forest-Shaped DAGs. We now consider the case where
the undirected graph underlying the DAG is a forest. The chain decomposition
algorithm described for in/out-trees does not work for arbitrary forests: instead
of following the approach for in/out-trees, we observe that once we have a chain
decomposition, the problem restricted to a block of chains is precisely the job
shop scheduling problem. This allows us to reduce the R|forest|Cmax problem
to a set of job shop problems, for which we use the algorithm of Goldberg et
al. [6]. While this is simpler than the algorithm for in/out-trees, we incur another
logarithmic factor in the approximation guarantee.
The following lemma and theorem show that a good decomposition can be
computed for forests which can be exploited to yield a good approximation ratio.
Lemma 1. Every DAG T whose underlying undirected graph is a forest, has a
chain decomposition into γ blocks, where γ ≤ 2(lg n + 1).
Theorem 4. Given a GDSS instance and a chain decomposition of its DAG
G(V, E) into γ blocks, there is a deterministic polynomial-time algorithm which
delivers a schedule of makespan O((Pmax + Πmax) · ρ), where
ρ = γ log n
log log n . Thus, Lemma 1 implies that
ρ = O( log2
n
log log n ) is achievable.
3 The R|forest|

j wjCj Problem
We now consider the objective of minimizing weighted completion time, where
the given weight for each job j is wj ≥ 0. Given an instance of R|prec|

j wjCj
where the jobs have not been assigned their processors, we now reduce it to in-
stances of R|prec|Cmax with processor assignment. More precisely, we show the
following: let Pmax and Πmax denote the “dilation” and “congestion” as usual; if
there exists a schedule of makespan ρ·(Pmax +Πmax) for the latter, then there is
a O(ρ)-approximation algorithm for the former. Let the machines and jobs be in-
dexed by i and j; pi,j is the (integral) time for processing job j on machine i, if we

choose to process j on i. We now present an LP-formulation for R|prec|

j wjCj
which has the following variables: for = 0, 1, . . ., variable xi,j, is the indicator
variable which denotes if “job j is processed on machine i, and completes in the
time interval (2−1
, 2
]”; for job j, Cj is its completion time, and zj is the time
spent on processing it. The objective is to minimize

j wjCj subject to: (1)
∀j,

i, xi,j, = 1, (2) ∀j, zj =

i pi,j

xi,j,, (3) ∀(j ≺ k), Ck ≥ Cj + zj,
(4) ∀j,

i, 2−1
xi,j, Cj ≤

i, 2
xi,j,, (5) ∀(i, ),

j pi,j

t≤ xi,j,t ≤ 2
,
(6) ∀ ∀maximal chains P,

j∈P

i pi,j

t≤ xi,j,t ≤ 2
, (7) ∀(i, j, ), (pi,j
2
) ⇒ xi,j, = 0, (8) ∀(i, j, ), xi,j, ≥ 0
Note that (5) and (6) are “congestion” and “dilation” constraints respec-
tively. Our reduction proceeds as follows. Solve the LP, and let the optimal
fractional solution be denoted by variables x∗
i,j,, C∗
j , and z∗
j . We do the fol-
lowing filtering, followed by an assignment of jobs to (machine, time-frame)
pairs.
Filtering: For each job j, note from the first inequality in (4) that the total
“mass” (sum of xi,j, values) for the values 2
≥ 4C∗
j , is at most 1/2. We first
set xi,j, = 0 if 2
≥ 4C∗
j , and scale each xi,j, to xi,j,/(1 −

≥4C∗
j

i xi,j, ),
if is such that 2
4C∗
j – this ensures that equation (1) still holds. After the
filtering, each non-zero variable increases by at most a factor if 2. Additionally,
for any fixed j, the following property is satisfied: consider the largest value of
such that xi,j, is non-zero; let this value be
; then, 2
= O(C∗
j ). The right-
hand-sides of (5) and (6) become at most 2+1
in the process and the Cj values
increase by at most a factor of two.
Assigning Jobs to Machines and Frames. For each j, set F(j) to be the
frame (2−1
, 2
], where is the index such that 4C∗
j ∈ F(j). Let G[] denote
the sub-problem which is restricted to the jobs in this frame. Let Pmax() and
Πmax() be the fractional congestion and dilation, respectively, for the sub-
problem restricted to G[]. From constraints (5) and (6), and due to our filtering
step, which at most doubles any non-zero variable, it follows that both Pmax()
and Πmax() are O(2
). We now perform a processor assignment as follows: for
each G[], we use the processor assignment scheme in Section 2.1 to assign pro-
cessors to jobs. This ensures that the integral Pmax() and Πmax() values are
at most a constant times their fractional values.
Scheduling: First schedule all jobs in G[1]; then schedule all jobs in G[2], and
so on. We can use any approximation algorithm for makespan-minimization, for
each of these scheduling steps. It is easy to see that we get a feasible solution:
for any two jobs j1, j2, if j1 ≺ j2, then C∗
j1
≤ C∗
j2
and frame F(j1) occurs before
F(j2) and hence gets scheduled first.
Theorem 5. If there exists an approximation algorithm which yields a schedule
whose makespan is O((Pmax+Πmax)·ρ), then there is also an O(ρ)–approximation
algorithm for minimizing weighted completion time. Thus, Theorem 4 implies
that ρ = O( log2
n
log log n ) is achievable.

Proof. Consider any job j which belongs to G[]; both Pmax() and Πmax()
are O(2
), the final completion time of job j is O(ρ2
). Since 2
= O(C∗
j ), the
theorem follows.
4 Weighted Completion Time and Flow Time on Chains
We now consider the case of R|prec|

j wjCj, where the processor-assignment is
not prespecified. We consider the restricted-assignment variant of this problem
[13, 21], where for every job v, there is a value pv such that for all machines
i, pi,v ∈ {pv, ∞}. Let S(v) denote the set of machines such that pi,v = pv. We
focus on the case where the precedence DAG is a disjoint union of chains with
all pu being polynomially-bounded positive integers in the input-size N.
We now present our approximation algorithms for weighted completion time.
The algorithm and proof techniques for flow time is similar to weighted comple-
tition time and is omitted from this version.
Weighted Completion Time. Recall that N = maxv{n, m, pv} denote the
“input size”. In this section, we obtain an O(log N/ log log N)-approximation al-
gorithm for the minimum weighted completion time problem. We first describe
an LP relaxation. Let T =

v pv. In the following LP, for ease of exposition,
we assume that all the pv values are equal to one. Our algorithm easily gener-
alizes to the case where the pv values are arbitrary positive integers, and the
LP is polynomial-sized if all pv values are polynomial in N. Let ≺ denote the
immediate predecessor relation, i.e., if u ≺ v, then they both belong to the
same chain and u is an immediate predecessor of v in this chain. Note that if
v is the first job in its chain, then it has no predecessor. In the time-indexed
LP formulation below, the variable xv,i,t denotes the fractional amount of job
v that is processed on machine i at time t. The objective is min

v w(v)C(v),
subject to: (B1) ∀v,

i∈S(v)

t∈[1,...,T ] xv,i,t = 1, (B2) ∀ i ∈ [1, . . . m], ∀ t ∈
[1, . . . T],

v xv,i,t ≤ 1, (B3) ∀ v, ∀ t ∈ [1, . . . T], zv,t =

i∈[1,...m] xv,i,t,
(B4) ∀ u ≺ v, ∀ t ∈ [1, . . . T],

t∈[1,...,t] zv,t ≤

t∈[1,...,t−1] zu,t , (B5)
∀v, C(v) =

t∈[1,...T ] t · zv,t, (B6) ∀ v, ∀ i ∈ S(v), ∀t ∈ [1, . . . , T], xv,i,t ≥ 0.
The constraints (B1) ensure that all jobs are processed completely, and (B2)
ensure that at most one job is (fractionally) assigned to any machine at any
time. The variable zv,t denotes the fractional amount of job v that has been
processed on all machines at time t. Constraints (B3) and (B4) are the precedence
constraints and (B6) define the completion time C(v) for job v.
Our algorithm proceeds as follows. We first solve the above LP optimally.
Let OPT be the optimal value of the LP and let x and z denote the optimal
solution values in the rest of the discussion. We define a rounding procedure for
each chain such that the following hold:
1. Let Zv,t be the indicator random variable which denotes if v is executed at
time t in the rounded solution. Let Xv,i,t be the indicator random variable which
denotes if v is executed at time t on machine i in the rounded solution. Then
E[Zv,t] = zv,t and E[Xv,i,t] = xv,i,t.

2. All precedence constraints are satisfied in the rounded solution.
3. Jobs in different chains are rounded independently.
After the Zv,t values have been determined, we do the machine assignment
as follows: if Zv,t = 1, then job v is assigned to machine i with probability
(xv,i,t/zv,t). In general, this assignment strategy might result in jobs from dif-
ferent chains executing on the same machine at the same time, and hence an
infeasible schedule. Let C1 denote the cost of this infeasible solution. Property
1 above ensures that E[C1] = OPT . Let Y be the random variable which de-
notes the maximum contention of any machine at any time. We obtain a feasible
solution by “expanding” each time slot by a factor of Y .
We now show our rounding procedure for the jobs of a specific chain such
that properties 1 and 2 hold; different chains are handled independently as
follows: for each chain Γ, we choose a value r(Γ) ∈ [0, 1] uniformly and in-
dependently at random. For each job v belonging to chain Γ, Zv,t = 1 iff
t
−1
t=1 zv,t r(Γ) ≤
t
t=1 zv,t. Bertsimas et al. [1] show other applications for
such rounding techniques. A moment’s reflection shows that property 1 holds
due to the randomized rounding and property 2 holds due to Equation (B4).
A straight forward application of the Chernoff-type bound from [18] yields the
following lemma:
Lemma 2. Let E denote the event that Y ≤ (α log N/ log log N), where α
0 is a suitably large constant. Event E occurs after the randomized machine
assignment with high probability: this probability can be made at least 1 − 1/Nβ
for any desired constant β 0, by letting the constant α be suitably large.
Finally, we note that we expand an infeasible schedule only if the event E
occurs. Otherwise, we can repeat the randomized machine assignment until event
E occurs and expand the resultant infeasible schedule. Let the final cost of our
solution be C. We now have an O(log N/ log log N)-approximation as follows:
E[C | E] ≤ E[O

log N
log log N

· C1 | E] ≤ O

log N
log log N

· E[C1]
Pr[E] ≤ O

log N
log log N

· OPT .
Acknowledgments
We are thankful to David Shmoys and the anonymous APPROX 2005 referees
for valuable comments.
References
1. D. Bertsimas, C.-P. Teo and R. Vohra. On Dependent Randomized Rounding Al-
gorithms. Operations Research Letters, 24(3):105–114, 1999.
2. C. Chekuri and M. Bender. An Efficient Approximation Algorithm for Minimiz-
ing Makespan on Uniformly Related Machines. Journal of Algorithms, 41:212–224,
2001.
3. C. Chekuri and S. Khanna. Approximation algorithms for minimizing weighted
completion time. Handbook of Scheduling, 2004.

4. F. A. Chudak and D. B. Shmoys. Approximation algorithms for precedence-
constrained scheduling problems on parallel machines that run at different speeds.
Journal of Algorithms, 30(2):323–343, 1999.
5. U. Feige and C. Scheideler. Improved bounds for acyclic job shop scheduling. Com-
binatorica, 22:361–399, 2002.
6. L.A. Goldberg, M. Paterson, A. Srinivasan and E. Sweedyk. Better approximation
guarantees for job-shop scheduling. SIAM Journal on Discrete Mathematics, Vol.
14, 67–92, 2001.
7. L. Hall. Approximation Algorithms for Scheduling. in Approximation Algorithms
for NP-Hard Problems, Edited by D. S. Hochbaum. PWS Press, 1997.
8. L. Hall, A. Schulz, D.B. Shmoys, and J. Wein. Scheduling to minimize average com-
pletion time: Offline and online algorithms. Mathematics of Operations Research,
22:513–544, 1997.
9. K. Jansen and L. Porkolab, Improved Approximation Schemes for Scheduling Unre-
lated Parallel Machines. Proc. ACM Symposium on Theory of Computing (STOC),
pp. 408–417, 1999.
10. K. Jansen and R. Solis-oba. Scheduling jobs with chain precedence constraints.
Parallel Processing and Applied Mathematics, PPAM, LNCS 3019, pp. 105–112,
2003.
11. K. Jansen, R. Solis-Oba and M. Sviridenko. Makespan Minimization in Job Shops:
A Polynomial Time Approximation Scheme. Proc. ACM Symposium on Theory of
Computing (STOC), pp. 394–399, 1999.
12. S. Leonardi and D. Raz. In Approximating total flow time on parallel machines. In
Proc. ACM Symposium on Theory of Computing, 110–119, 1997.
13. J. K. Lenstra, D. B. Shmoys and É. Tardos. Approximation algorithms for schedul-
ing unrelated parallel machines. Mathematical Programming, Vol. 46, 259–271,
1990.
14. F.T. Leighton, B. Maggs and S. Rao. Packet routing and jobshop scheduling in
O(congestion + dilation) Steps, Combinatorica, Vol. 14, 167–186, 1994.
15. F. T. Leighton, B. Maggs, and A. Richa, Fast algorithms for finding O(congestion
+ dilation) packet routing schedules. Combinatorica, Vol. 19, 375–401, 1999.
16. J. H. Lin and J. S. Vitter. -approximations with minimum packing constraint
violation. In Proceedings of the ACM Symposium on Theory of Computing, 1992,
pp. 771–782.
17. N. Linial, A. Magen, and M.E. Saks. Trees and Euclidean Metrics. In Proceedings
of the ACM Symposium on Theory of Computing, 169–175, 1998.
18. A. Panconesi and A. Srinivasan. Randomized distributed edge coloring via an ex-
tension of the Chernoff-Hoeffding bounds, SIAM Journal on Computing, Vol. 26,
350–368, 1997.
19. M. Queyranne and M. Sviridenko. Approximation algorithms for shop scheduling
problems with minsum objective, Journal of Scheduling, Vol. 5, 287–305, 2002.
20. A. Schulz and M. Skutella. The power of α-points in preemptive single machine
scheduling. Journal of Scheduling 5(2): 121–133, 2002.
21. P. Schuurman and G. J. Woeginger. Polynomial time approximation algorithms for
machine scheduling: Ten open problems. Journal of Scheduling 2:203–213, 1999.
22. D.B. Shmoys, C. Stein and J. Wein. Improved approximation algorithms for shop
scheduling problems, SIAM Journal on Computing, Vol. 23, 617–632, 1994.
23. M. Skutella. Convex quadratic and semidefinite relaxations in scheduling. Journal
of the ACM, 46(2):206–242, 2001.

Approximation Algorithms for Network Design
and Facility Location with Service Capacities
Jens Maßberg and Jens Vygen
Research Institute for Discrete Mathematics, University of Bonn
Lennéstr. 2, 53113 Bonn, Germany
Abstract. We present the first constant-factor approximation algo-
rithms for the following problem: Given a metric space (V, c), a set D ⊆ V
of terminals/ customers with demands d : D → R+, a facility opening
cost f ∈ R+ and a capacity u ∈ R+, find a partition D = D1 ˙
∪ · · · ˙
∪Dk
and Steiner trees Ti for Di (i = 1, . . . , k) with c(E(Ti)) + d(Di) ≤ u for
i = 1, . . . , k such that
k
i=1 c(E(Ti)) + kf is minimum.
This problem arises in VLSI design. It generalizes the bin-packing prob-
lem and the Steiner tree problem. In contrast to other network design and
facility location problems, it has the additional feature of upper bounds
on the service cost that each facility can handle.
Among other results, we obtain a 4.1-approximation in polynomial time,
a 4.5-approximation in cubic time and a 5-approximation as fast as com-
puting a minimum spanning tree on (D, c).
1 Introduction
Facility location and network design problems have received much attention in
the last decade. Approximation algorithms have been designed for many variants.
In most cases, a set of terminals/customers, possibly with demands, has to be
served by one or more facilites. A facility serves its customers either by direct
links (a star centered at the facility) or by an arbitary connected network (a
Steiner tree). Facilities (and sometimes edges) often have capacities, i.e. they
can serve customers of limited total demand only.
In contrast to all previously considered models, our facilites have a capacity
bounding total customer demand plus service cost. This is an essential constraint
in VLSI design, but it may well occur in other applications.
1.1 Problem Statement
We use standard notation (see, e.g., [13]) and consider the following problem:
Given a metric space (V, c), a set D ⊆ V of terminals/customers with demands
d : D → R+, a facility opening cost f ∈ R+ and a capacity u ∈ R+, find a
partition D = D1 ˙
∪ · · · ˙
∪Dk and Steiner trees Ti for Di (i = 1, . . . , k) with
c(E(Ti)) + d(Di) ≤ u for i = 1, . . . , k (1)
such that
c

Approximation Algorithms for Network Design and Facility Location 159
k
i=1
c(E(Ti)) + kf (2)
is minimum.
So the objective function is the sum of service cost (total length of Steiner
trees) and facility cost (f times the number of Steiner trees).
1.2 Results
We describe three approximation algorithms for this problem. Their running
times are dominated by the first step, in which algorithm A computes a minimum
spanning tree on (D, c), algorithm B computes an approximate Steiner tree for
D in (V, c
), and algorithm C computes an approximate tour in (D, c
). Here c
and c
are new metric cost functions.
Let α be the Steiner ratio (supremum of length of a minimum spanning tree
over length of a optimum Steiner tree). Let β and γ be the performance ratios
of Steiner tree and TSP approximation algorithms, respectively.
Then algorithm A has performance guarantee 1+2α, algorithm B has perfor-
mance guarantee max{3.705, 3β}, and algorithm C has performance guarantee
max{4, 3γ}.
This means that we have a 4.5-approximation for general metrics, using
Christofides’ algorithm [2], where the most time-consuming step is computing a
minimum weight perfect matching on a subset of customers.
Moreover, we have a 5-approximation for general metrics, whose running
time is dominated by finding a minimum spanning tree in (D, c). Finally, if
we run algorithm B with the Robins-Zelikovsky algorithm, we even get a 4.099-
approximation algorithm, however only with an enourmous (though polynomial)
running time.
1.3 Motivation: VLSI Design
The problem arises in VLSI design when designing a clock network. The storage
elements (flip-flops, latches) of a chip (in the following: customers) need to get a
periodic clock signal. Often they have to receive it directly from special modules
(clock splitters, local clock buffers), in the following: facilities.
The task is to place a set of facilites on the chip and connect each customer
to a facility via a Steiner tree consisting of horizontal and vertical wires. As each
facility can drive only a limited electrical capacitance, this means that there is
an upper bound on the weighted sum of the wire length and the total input
capacitance of the customers served by each facility.
At this stage the main goal is to minimize power consumption. Wires (pro-
portional to their length) as well as facilities (proportional to their number)
contribute to the power consumption. As we can always place a facility some-
where on a Steiner tree connecting the customers, we arrive at our problem
formulation. In this case, the metric is (R2
, 1).

160 Jens Maßberg and Jens Vygen
Fig. 1. Solution with 3675 customers and 161 facilities.
A state-of-the-art industrial clocktree design tool is described in [9]. Of
course, this has many other components. For example, the facilities have to
be served by a clock signal, too. More importantly, each customer is associ-
ated with a time interval, and the time intervals of customers served by the
same facility must have nonzero intersection. We do not know whether there is
an approximation algorithm for the resulting generalization. However, for many
clock networks we can allow zero skew, i.e. the intersection of all time intervals
is nonempty. In this case the problem considered in this paper is an excellent
model of the practical problem. Experimental results are shown in section 3.5.
Real life clocktree instances can have more than 100 000 terminals. Figure 1
shows a solution for a practical instance with 3675 customers (inst1 of section
3.5).
1.4 Related Work
To our knowledge, our problem has not been considered yet in the literature.
However, it generalizes several classical NP-hard combinatorial optimization
problems.

It includes the Steiner tree problem: simply set u and f large enough. The
currently best known approximation algorithm for Steiner trees, by Robins and
Zelikovsky [16], achieves an approximation ratio of 1.55.
It also includes the soft-capacitated facility location problem in the case
where all elements of V are identical potential facilities. The best known ap-
proximation ratio for the soft-capacitated facility location problem is 3 [3], and
2 in the case of uniform demands [14].
Our problem reduces to the bin-packing problem if there are no service costs,
i.e. c ≡ 0. The best approximation ratio for bin-packing is 1.5 (unless P=NP)
[17], although there is a fully polynomial asymptotic approximation scheme [11,
13].
Finally, our problem is loosely related to several clustering problems like the
k-tree cover problem (e.g. [5]), the vehicle routing problem (e.g. [18]), and other
network design problems such as connected facility location or the rent-or-buy
problem.
1.5 Complexity
It is obvious that the problem is strongly NP-hard and MAXSNP-hard as it
contains the Steiner tree problem and the bin-packing problem. Moreover we
can show the following:
Proposition 1. There is no (2 − )−approximation algorithm (for any 0)
for any class of metrics where the Steiner tree problem cannot be solved exactly
in polynomial time.
Proof. Otherwise the Steiner tree decision problem which is NP-complete (see
[7]) can be solved in polynomial time:
Assume we have a (2 − )−approximation algorithm for some 0, and
S = {s1, . . . , sn}, k ∈ R+, is an instance for the Steiner tree decision problem
Is there a Steiner tree with terminals S and length ≤ k?. We con-
struct an instance for our problem by taking S as the set of terminals, setting
l(s) = 0 ∀s ∈ S, u = k and f = k 2−
. Then the (2−)−approximation algorithm
computes a solution consisting of one facility if and only if there is a Steiner tree
of length ≤ k.
For example, this applies to the class of ﬁnite metric spaces [12] as well as
the Eudlidean plane (R2
, 2) [6] and the rectilinear plane (R2
, 1) [7].
Regardless of the metric, there is no 3
2 -approximation algorithm unless P =
NP, as follows by transformation from the NP-complete Partition problem.
1.6 Geometric Instances
For geometric instances (in particular (V, c) = (R2
, p) with p ∈ {1, 2}) of the
Steiner tree problem, the TSP and other problems, Arora [1] developped an
approximation scheme. Similar techniques can be applied also to our problem to

get a 2-approximation. In fact, we get (1+) times the service cost of an optimum
solution (for any fixed positive ), but due violated capacity constraints up to
twice its facility cost.
By Proposition 1, this 2-approximation is best possible. However, Arora’s
algorithm is too slow for practical purposes, and we do not even know how to
obtain a 3-approximation with a practically fast algorithm. Note that in our
VLSI application we have the rectilinear plane (R2
, 1), and better algorithms
for this case (than the factor 4 that we will present) would be particularly inter-
esting.
2 A Lower Bound
Recall that the Steiner ratio α for a given metric is the minimum α ∈ R so that
for all finite sets of points S the length of a minimum spanning tree on S is at
most α times the length of a minimum Steiner tree on S. For general metrics we
have α = 2. In the metric space (R2
, 1) it has been shown that α = 3
2 ([10]).
Let (V, c, D, u, f) be an instance of our problem. We define a k-spanning (k-
Steiner) forest to be a forest F with V (F) = D (D ⊆ V (F) ⊆ V , respectively)
containing exactly k connected components.
Proposition 2. Let α be the Steiner ratio of the given metric.For k ∈ {1, . . . , n}
let Fk
Steiner be the edge set of a minimum cost k-Steiner forest and Fk
spann the
edge set of a minimum cost k-spanning forest. Then
c(Fk
Steiner) ≤ c(Fk
spann) ≤ αc(Fk
Steiner ).
Proof. Every k-spanning forest is a k-Steiner forest, so the first inequality holds.
If we replace every component T of Fk
Steiner by a minimal spanning tree on
T ∩ D, the result is a k−spanning forest of cost at most α · c(FSteiner). The cost
of a minimal k−spanning forest cannot be greater.
Now we construct a sequence F1, . . . , Fn, where (D, Fi) is an i-spanning forest
of minimum cost. We start by choosing (D, F1) as a minimum spanning tree.
Let e1, . . . , en−1 be the edges of F1 so that c(e1) ≥ . . . ≥ c(en−1). For i = 2 to n
set Fi = Fi−1 {ei−1}.
Proposition 3. (D, Fi) is an i-spanning forest of minimum cost for each i.
Proof. By induction on i. (V, F1) is a minimum cost 1-spanning forest by con-
struction. Removing an edge increases the number of components by one, so by
induction Fi is an i-spanning forest.
In order to show the optimality of Fi we recall the fact that for two forests
(D, W1), (D, W2) with |W1| |W2| there is an edge e ∈ W2 W1 so that (D, W1 ∪
{e}) is still a forest (matroid property, see e.g. [13, p. 281]).
Assume that Fi−1 is an i − 1-spanning forest of minimum cost, and consider
Fi. Let F be an i-spanning forest of minimum cost and e ∈ Fi−1 F so that

F ∪ {e} is still a forest. Recalling that ei−1 is the most expensive edge of Fi−1
and F ∪ {e} is an (i − 1)-forest we get
c(Fi) + c(ei−1) = c(Fi−1)
≤ c(F ∪ {e})
= c(F) + c(e)
≤ c(F) + c(ei−1)
so (D, Fi) has minimum cost.
We have shown:
Corollary 1. The cost of any k-Steiner forest is at least 1
α c(Fk).
Proof. This follows from Proposition 2 and Proposition 3.
The next step is to compute a lower bound on the number of facilities (Steiner
trees) of a feasible solution. A feasible k-Steiner forest is a k-Steiner forest where
inequality (1) holds for each of the components T1, . . . , Tk.
Proposition 4. If 1
α c(Ft) + d(D) t · u, then there is no feasible t−Steiner
forest.
Proof. Let F be a feasible t-Steiner forest with components T1, . . . , Tt. Summing
up all t inequalities (1) we get
c(E(F)) + d(D) =
t
i=1
(c(E(Ti)) + d(D ∩ V (Ti))) ≤ t · u.
By Corollary 1 we know that c(E(F)) ≥ 1
α c(Ft), so 1
α c(Ft) + d(D) ≤ t · u.
Let t
be the smallest positive integer so that
1
α
c(Ft ) + d(D) ≤ t
· u.
Thus t
is a lower bound for the number of facilities (Steiner trees) of a feasible
solution. We conclude:
Theorem 1. mint≥t
1
α c(Ft) + t · f

is a lower bound for the cost of an optimal
solution.
We denote by t
the smallest t for which the minimum is obtained, Lr :=
1
α c(Ft ), and Lf := t
· f. Then Lr + Lf is a lower bound on the cost of an
optimum solution, and
Lr + d(D) ≤ Lf
u
f
. (3)

3 The Algorithms
We propose three algorithms. All of them start by constructing a forest with
relatively small total cost. Then they split connected components violating (1).
For D
⊆ D and a tree T
with D
⊆ V (T
) ⊆ V we define load(T
, D
) :=
c(E(T
)) + d(D
). The pair (D
, T
) is called overloaded if load(T
, D
) u.
Overloaded components will be split.
3.1 Algorithm A
Algorithm A first computes Ft . If (T
, D ∩ V (T
)) is not overloaded for any
component T of Ft , we have a feasible solution costing at most α times the
optimum.
Otherwise let T
be a connected component of Ft where (T
, D ∩ V (T
)) is
overloaded. We will split off subtrees and reduce the load by at least u
2 for each
additional component.
Choose an arbitrary vertex r ∈ V and orient T
as an arborescence rooted
at r. For a vertex v ∈ V we denote by Tv the subtree of T
rooted at v.
Now let v ∈ V be a vertex of maximum distance (number of edges) from r
with load(Tv, D ∩ V (Tv)) u. If there is a successor w of v with load(Tw, D ∩
V (Tw)) + c(v, w) ≥ u
2 , then we split off Tw by removing the edge (v, w).
Let v1, . . . , vq be the successors of v with xi := load(Tvi , D ∩ V (Tvi )) +
c(v, vi) ≤ u (i = 1, . . . , q). Regard x1, . . . , xq and xq+1 := d(v) as an instance of
the bin-packing problem with bin capacity u, and apply the next-fit algorithm for
bin packing. The result is a partition {1, . . . , q+1} = b1 ˙
∪ · · · ˙
∪bp with

i∈bj
xi ≤
u for j = 1, . . . , p and

i∈bj ∪bj+1
xi u for j = 1, . . . , p − 1.
We split off the subtrees corresponding to the first p
:= 2p
2 bins, and thus
reduce the load by more than u
2 p
.
More precisely, let Vi := V (Tvi ) (i = 1, . . . , q) and Vq+1 := {v}. We replace T
by p
+1 trees: For j = 1, . . . , p
we have the pair (T
[

i∈bj
Vi ∪{v}], D∩

i∈bj
Vi)
(which is not overloaded; note that v just serves as a Steiner point and does not
contribute to the load unless q + 1 ∈ bj = bp), and finally we have T
:=
T
−

i∈b1∪···∪b
p
Vi. Note that load(T
, D ∩V (T
)) ≤ load(T
, D ∩V (T
))− u
2 p
.
Splitting continues until no overloaded components are left.
v
1
v
2
v
3
v
5
v
4
v
v
1
v
2
v
3
v
5
v
4
v
Fig. 2. Splitting an overloaded component. In this example two new trees have been
split off. The new component including v1, v2 and v3 has got an additional Steiner
point at the position of v.

The number of new components generated for tree T is at most 2
u load(T ).
Thus the cost of the solution computed by algorithm A is at most
c(Ft ) + t
f +
2
u
(c(Ft ) + d(D))f = αLr + Lf +
2f
u
(αLr + d(D)).
Using (3) we get:
Lemma 1. Algorithm A computes a solution of cost at most αLr +3Lf +2f
u (α−
1)Lr.
As f
u Lr ≤ Lf by (3), we get:
Corollary 2. Algorithm A computes a solution of cost at most (2α + 1) the
optimum.
For small values of f
u , the performance ratio is better, as can be seen by
analysing Lemma 1:
Corollary 3. For instances with f
u ≤ φ, algorithm A computes a solution of
cost at most max{3, α+2αφ+φ
1+φ } times the optimum.
3.2 Algorithm B
Let L∗
r and L∗
f be facility and service cost of an optimum solution, respectively.
Clearly,
L∗
r + d(D) ≤ L∗
f
u
f
. (4)
Let c
be the metric deﬁned by c
(v, w) := min{c(v, w), u
2 } for all v, w ∈ V .
Algorithm B computes a Steiner tree F for D in (V, c
) with some approximation
algorithm. Then it deletes the edges of length u
2 from F. If the resulting forest
contains overloaded components, they are split as in algorithm A (but with c
instead of c). Then the total number of components is 1 + 2
u (c
(F) + d(D)).
We have a total cost of
c
(F) + f +
2f
u
(c
(F) + d(D)) (5)
Note that an optimum solution can be extended to a Steiner tree for D by adding
L∗
f
f −1 edges. Hence there is a Steiner tree for D in (V, c
) of length L∗
r +(
L∗
f
f −1)u
2 .
F is at most β times more expensive. Hence we can bound the total cost by
c
(F)(1 +
2f
u
) + f +
2f
u
d(D)
≤ β(L∗
r +
uL∗
f
2f
−
u
2
)(1 +
2f
u
) + f +
2f
u
d(D)
≤ βL∗
r +
2βfL∗
r
u
+
βL∗
f u
2f
+ βL∗
f +
2f
u
d(D)
≤ βL∗
r +
2βfL∗
r
u
+
βL∗
f u
2f
+ βL∗
f + 2L∗
f − 2
f
u
L∗
r,
where we used (4) in the last inequality.

The right-hand side is a convex function in f
u , and it is equal to 3
2 βL∗
r +3βL∗
f
for f
u ∈ {
L∗
f
L∗
r
, β
4(β−1) }, so it is at most 3
2 βL∗
r + 3βL∗
f for all values in between. As
f
u ≤
L∗
f
L∗
r
by (4), we have:
Lemma 2. Algorithm B yields a 3β-approximation unless f
u β
4(β−1)
A simple calculation shows:
Theorem 2. Running algorithm A and algorithm B and taking the better solu-
tion yields a performance ratio of max{3β, 13+6α+
√
169−84α+36α2
10 }. In particular
we have a performance ratio of max{3β, 3.705} for any metric.
In particular, using the Robins-Zelikovsky [16] algorithm we get a 4.648-
approximation in polynomial time.
However, we can do better by a closer look at this algorithm. Recall that
the Robins-Zelikovsky algorithm works with a parameter k and analyzes all k-
restricted full components; its running time is O(n4k
) and the length of the
Steiner tree that it computes is at most
c
(E(Y ∗
)) + c
(L∗
) ln

1 +
c
(F1) − c
(E(Y ∗
))
c(L∗)

,
where (D, F1) is a minimum spanning tree, Y ∗
is any k-restricted Steiner tree
for D in (V, c
), and L∗
is a loss of Y ∗
, i.e. a minimum cost subset of E(Y ∗
)
connecting each Steiner point of degree at least three in Y ∗
to a terminal.
Let 0 ﬁxed. For k ≥ 2
1
, the optimum k-restricted Steiner tree is at
most a factor 1 + longer than an optimum Steiner tree [4]. Thus, taking an
optimum k-restricting Steiner tree for each component of our optimum solution
and adding edges to make the graph connected, we get a k-restricted Steiner
tree Y ∗
with c(E(Y ∗
)) ≤ (1 + )L∗
r + (
Lf
f − 1)u
2 . and with loss L∗
of length
c(L∗
) ≤ 1+
2 L∗
r. Moreover, c
(F1) ≤ αL∗
r + (
Lf
f − 1)u
2 .
We conclude
c
(F) ≤ c
(E(Y ∗
)) + c
(L∗
) ln

1 +
(α − 1 − )L∗
r
c(L∗)

≤ c
(E(Y ∗
)) + (α − 1)L∗
r
c
(L∗
)
(α − 1)L∗
r
ln

1 +
(α − 1)L∗
r
c(L∗)

≤ c
(E(Y ∗
)) + L∗
r
1 +
2
ln(1 + 2(α − 1)),
as max{x ln(1 + 1
x ) | 0 x ≤ 1+
2(α−1) } = 1+
2(α−1) ln(1 + 2(α−1)
1+ ).
For α = 2 we get
c
(F) ≤ (1 + )

1 +
ln 3
2

L∗
r +

Lf
f
− 1

u
2
.

Writing β := (1+)

1 + ln 3
2

(note that β 1.5495 for ≤ 10−4
), and plugging
this into (5) we get that algorithm B computes a solution of total cost at most
c
(F)(1 +
2f
u
) + f +
2f
u
d(D)
≤ βL∗
r +
2(β − 1)fL∗
r
u
+
L∗
f u
2f
+ 3L∗
f ,
which is at most (β + 1
2 )L∗
r + (2β + 1)L∗
f unless f
u 1
4(β−1) 1
2 . In this case we
can run algorithm A to get a 3-approximation:
Theorem 3. Running algorithm A if 2f u, and otherwise running algorithm
B with the Robins-Zelikovsky algorithm with parameter k = 210000
, is a 4.099-
approximation algorithm.
3.3 Algorithm C
Let c
be the metric defined by c
(v, w) := min{c(v, w), u} for all v, w ∈ V .
Algorithm C computes a tour F for D in (V, c
) using some approximation
algorithm for the TSP.
We achieve the final Steiner forest by deleting one of the longest edges of
F and splitting the resulting path into paths P1, . . . , Pk such that load(Pi, D ∩
V (Pi)) ≤ u and load(P
i , D ∩ V (P
i )) u for i = 1, . . . , k
− 1, where P
i is Pi
plus the edge connecting Pi and Pi+1.
Due to the page limit, we only claim without proofs that
Theorem 4. Algorithm C yields a 3γ-approximation unless f
u γ
2(γ−1) . Run-
ning algorithm A and algorithm C and taking the better solution yields a perfor-
mance ratio of max{3γ, 4} for any metric.
In particular, using Christofides’ algorithm [2] we have a 4.5-approximation
algorithm. Details will be included in the full paper.
3.4 Running Times and Tight Examples
In algorithm A, the time to determine the initial minimum spanning tree (D, F1)
dominates the running time.
In algorithm B and C, the running times of the Steiner tree and TSP ap-
proximation algorithms, respectively, dominate the overall running time.
We can construct examples showing that the performance guarantee of algo-
rithm A is not better than five, but we don’t know whether algorithms B and C
are tight.
3.5 Experimental Results
As mentioned in Section 1.3 the problem considered in this paper arises in VLSI
design. We implemented algorithm A for the (R2
, 1) metric and ran it on several
instances from industrial chips.

Table 1. Experimental results.
inst1 inst2 inst3 inst4 inst5 inst6
number of terminals 3 675 17 140 45 606 54 831 109 224 119 461
total demand of terminals 9.09 38.54 97.62 121.87 250.24 258.80
min demand of terminals 0.0025 0.0022 0.0021 0.0021 0.0021 0.0021
max demand of terminals 0.0025 0.0087 0.0022 0.0086 0.0093 0.0023
facility opening cost f 0.127 0.127 0.127 0.127 0.127 0.127
capacity u 0.148 0.110 0.110 0.110 0.124 0.110
length min. spanning tree 13.72 60.35 134.24 183.37 260.36 314.48
lower bound cost 23.07 112.70 251.06 363.28 531.05 689.19
lower bound facilities 117 638 1475 2051 3116 3998
lower bound service cost 8.21 31.68 63.73 102.80 135.32 181.45
final solution cost 32.52 174.50 377.29 531.03 762.15 981.61
final number of facilities 161 947 2171 2922 4156 5525
final service cost 12.08 54.23 101.57 159.93 234.34 279.93
lower bound factor 1.41 1.55 1.59 1.46 1.44 1.42
To improve the results and to consider the special metric we changed the
algorithm in three points.
By Lemma 1 the cost of the final solution is at most αLr + 3Lf + 2f
u (α −
1)Lf . If we don’t remove one of the edges e when constructing the initial forest,
the number of facilities of the forest is reduced by one and by this the cost is
decreased by f while the service cost is increased by c(e). Moreover the additional
service cost has to be driven by facilities which increases the upper bound by
2f
u c(e). So the approximation factor is not increased if we don’t remove edges e
with c(e)+ 2f
u c(e) ≤ f. As the inital forest then has less components we probably
get a partition with less facilities.
After constructing the initial forest, we recompute each Steiner tree using
special 1 Steiner tree heuristics (Hanan’s algorithm [8] for small instances, Mat-
suyama and Takahashi’s algorithm [15] for all other instances) and take this tree
if it is shorter than the spanning tree. So we reduce the service cost.
When splitting a component at a vertex we compute all feasible bin-packings
and choose one with minimum number of bins that minimizes the load of the
remaining tree. As w.l.o.g. each vertex of an 1-Steiner tree has at most degree
4, this doesn’t increase the computation time signifiantly.
Table 1 shows the computational results of six instances coming from real
clocktrees. The solution of inst1 is shown in figure 1. The computation time for
each instance was less than 3 minutes.
By Corollary 3 each computed solution costs at most 3 times the optimum.
In practice we get a factor of about 1.5 to the lower bound.
References
1. S. Arora. Polynomial time approximation schemes for euclidean tsp and other
geometric problems. In Journal of the ACM 45, pages 753–782, 1998.

2. N. Christoﬁdes. Worst-case analysis of a new heuristic for the traveling salesman
problem. Technical report, CS-93-13, G.S.I.A., Carnegie Mellon University, Pitts-
burgh, 1976.
3. F. A. Chudak and D. B. Shmoys. Improved approximation algorithms for a ca-
pacitated facility location problem. In Proceedings of the 10th annual ACM-SIAM
symposium on Discrete algorithms (SODA’99), pages 875–876, 1999.
4. D.-Z. Du, Y. Zhang, and Q. Feng. On better heuristic for Euclidean Steiner min-
imum trees. In 32nd Annual IEEE Symposium on Foundations of Computer Sci-
ence, pages 431–439, 1991.
5. G. Even, N. Garg, J. Könemann, R. Ravi, and A. Sinha. Covering graphs us-
ing trees and stars. In Proceddings of 6th International Workshop on Approxima-
tion Algorithms for Combinatorial Optimization Problems and 7th International
Workshop on Randomization and Approximation Techniques in Computer Science
(APPROX-RANDOM), pages 24–35, 2003.
6. M. R. Garey, R. L. Graham, and D. S. Johnson. The complexity of computing
steiner minimal trees. In SIAM Journal on Applied Mathematics 32, pages 835–
859, 1977.
7. M.R. Garey and D.S. Johnson. The rectilinear steiner problem is np-complete. In
SIAM Journal on Applied Mathematics 32, pages 826–834, 1977.
8. M. Hanan. On steiner’s problem with rectilinear distance. SIAM Journal on Ap-
plied Mathematics 14(2), pages 255–265, 1966.
9. S. Held, B. Korte, J. Maßberg, M. Ringe, and J. Vygen. Clock scheduling and
clocktree construction for high performance asics. In Proceedings of the 2003
IEEE/ACM international conference on Computer-aided design (ICCAD ’03),
pages 232–240. IEEE Computer Society, 2003.
10. F.K. Hwang. On steiner minimal trees with rectilinear distance. In SIAM Journal
on Applied Mathematics 30, pages 104–114, 1976.
11. N. Karmarkar and R. M. Karp. An eﬃcient approximation scheme for the one-
dimensional bin packing problem. In 23rd Annual IEEE Symposium on Founda-
tions of Computer Science, pages 312–320, 1982.
12. R. M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W.
Thatcher, editors, Complexity of Computer Computations, pages 85–103. Plenum
Press, New York, 1972.
13. B. Korte and J. Vygen. Combinatorial Optimization, Theory and Algorithms, third
edition. Springer, 2005.
14. M. Mahdian, Y. Ye, and J. Zhang. A 2-approximation algorithm for the soft-
capacitated facility location problem. In Proceedings of 6th International Workshop
on Approximation Algorithms for Combinatorial Optimization (APPROX), pages
129–140, 2003.
15. A. Matsuyama and H. Takahashi. An approximate solution for the steiner problem
in graphs. Mathematica Japonica 24(6), pages 573–577, 1980.
16. Gabriel Robins and Alexander Zelikovsky. Improved steiner tree approximation in
graphs. In Symposium on Discrete Algorithms, pages 770–779, 2000.
17. D. Simchi-Levi. New worst-case results for the bin-packing problem. In Naval Re-
search Logistics, 41, pages 579–585, 1994.
18. P. Toth and D. Vigo, editors. The Vehicle Routing Problem. SIAM monographs on
discrete mathematics and applications. 2002.

Finding Graph Matchings in Data Streams
Andrew McGregor
Department of Computer and Information Science, University of Pennsylvania
Philadelphia, PA 19104, USA
andrewm@cis.upenn.edu
Abstract. We present algorithms for finding large graph matchings in
the streaming model. In this model, applicable when dealing with mas-
sive graphs, edges are streamed-in in some arbitrary order rather than
residing in randomly accessible memory. For 0, we achieve a 1
1+
approximation for maximum cardinality matching and a 1
2+
approxi-
mation to maximum weighted matching. Both algorithms use a constant
number of passes and Õ(|V |) space.
1 Introduction
Given a graph G = (V, E), the Maximum Cardinality Matching (MCM) problem
is to find the largest set of edges such that no two adjacent edges are selected.
More generally, for an edge-weighted graph, the Maximum Weighted Matching
(MWM) problem is to find the set of edges whose total weight is maximized
subject to the condition that no two adjacent edges are selected. Both problems
are well studied and exact polynomial solutions are known [1–4]. The fastest
of these algorithms solves MWM with running time O(nm + n2
log n) where
n = |V | and m = |E|.
However, for massive graphs in real world applications, the above algorithms
can still be prohibitively computationally expensive. Examples include the vir-
tual screening of protein databases. (See [5] for other examples.) Consequently
there has been much interest in faster algorithms, typically of O(m) complexity,
that find good approximate solutions to the above problems. For MCM, a linear
time approximation scheme was given by Kalantari and Shokoufandeh [6]. The
first linear time approximation algorithm for MWM was introduced by Preis [7].
This algorithms achieved a 1/2 approximation ratio. This was later improved
upon by the (2/3 − ) linear time1
approximation algorithm given by Drake and
Hougardy [5].
In addition to concerns about time complexity, when computing with mas-
sive graphs it is no longer reasonable to assume that we can store the entire
input graph in random access memory. In this case the above algorithms are not
applicable as they require random access to the input. With this in mind, we

This work was supported by NSF ITR 0205456.
1
Note that here, and throughout this paper, we assume that is an arbitrarily small
constant.
c

Finding Graph Matchings in Data Streams 171
consider the graph stream model discussed in [8–10]. This is a particular formu-
lation of the data stream model introduced in [11–14]. In this model the edges
of the graph stream-in in some arbitrary order. That is, for a graph G = (V, E)
with vertex set V = {v1, v2, . . . , vn} and edge set E = {e1, e2, . . . , em}, a graph
stream is the sequence ei1 , ei2 , . . . , eim , where eij ∈ E and i1, i2, . . . , im is an
arbitrary permutation of {1, 2, . . ., m}.
The main computational restriction of the model is that we have limited space
and therefore we can not store the entire graph (this would require Õ(m) space.)
In this paper we restrict our attention to algorithms that use Õ(n) 2
space. This
restriction was identified as a “sweet-spot” for graph streaming in a summary
article by Muthukrishnan [8] and subsequently shown to be necessary for the
verification of even the most primitive of graph properties such as connectivity
[10]. We may however have multiple passes of the graph stream (as was assumed
in [10, 11]). To motivate this assumption one can consider external memory
systems in which seek times are typically the bottleneck when accessing data.
In this paper, we will assume that we can only have constant passes. Lastly
it is important to note that while most streaming algorithms use polylog(m)
space on a stream of length m, this is not always the case. Examples include the
streaming-clustering algorithm of Guha et al. [15] that uses m
space and the
“streaming”3
algorithm of Drineas and Kannan [16] that uses space
√
m.
Our Results: In this paper we present algorithms that achieve the following
approximation ratios:
1. For 0, a 1
1+ approximation to maximum cardinality matching.
2. For 0, a 1
2+ approximation to maximum weighted matching.
MCM and MWM have previously been studied under similar assumptions by
Feigenbaum et al. [9]. The best previously attained results were a 1
6 approxima-
tion to MWM and for 0 and a 2
3+ approximation to MCM on the assumption
that the graph is a bipartite graph. Also in the course of this paper we tweak
the 1
6 approximation to MWM to give a 1
3+2
√
2
approximation to MWM that
uses only one pass of the graph stream.
2 Unweighted Matchings
2.1 Preliminaries
In this section we describe a streaming algorithm that, for 0, computes a
1/(1 + ) approximation to the MCM of the streamed graph. The algorithm
will use a constant number of passes. We start by giving some basic definitions
common to many matching algorithms.
2
Sometimes known as the semi-streaming space restriction.
3
Instead of “streaming,” the authors of [16] use the term “pass-efficient” algorithms.

172 Andrew McGregor
Definition 1 (Basic Matching Theory Definitions). Given a matching M
in a graph G = (V, E), we call a vertex free if it does not appear as the end point
of any edge in M. A length 2i+1 augmenting path is a path u1u2 . . . u2i+2 where
u1 and u2i+2 are free vertices and (uj, uj+1) ∈ M for even j and (uj, uj+1) ∈
E M for odd j.
Note that if M is a matching and P is an augmenting path then M$P
(the symmetric difference of M and P) is a matching of size strictly greater
than M. Our algorithm will start by finding a maximal matching and then, by
finding short augmenting paths, increase the size of the matching by making local
changes. Note that finding a maximal matching is easily achieved in one pass
– we select an edge iff we have not already selected an adjacent edge. Finding
maximal matchings in this way will be at the core of our algorithm and we will
make repeated use of the fact that the maximum matching has cardinality at
most twice that of any maximal matching.
The following lemma establishes that, when there are few short augmenting
paths, the size of the matching found can be lower-bound in terms of the size of
the maximum cardinality matching Opt.
Lemma 1. Let M be a maximal matching and Opt be a matching of maximum
cardinality. Consider the connected components in Opt$M. Ignoring connected
components with the same number of edges from M as from Opt, let αiM be
the number of connected components with i edges from M. Then
max
1≤i≤k
αi ≤
1
2k2(k + 1)
⇒ M ≥
Opt
1 + 1/k
Proof. In each connected component with i edges from M there is either i or i+1
edges from Opt. Therefore, Opt ≤

1≤i≤k αi
i+1
i |M|+ k+2
k+1 (1−

1≤i≤k αi)|M|.
By assumption
1≤i≤k
αi
i + 1
i
+
k + 2
k + 1
(1 −
1≤i≤k
αi) ≤
1
k(k + 1)
+
k + 2
k + 1
= (1 + 1/k)
The result follows.
So, if there are αiM components in Opt$M with i + 1 edges from Opt and
i edges from M, then there are at least αiM length 2i + 1 augmenting paths for
M. Finding an augmenting path allows us to increase the size of M. Hence, if
max1≤i≤k αi is small we already have a good approximation to Opt whereas, if
max1≤i≤k αi is large then there exists 1 ≤ i ≤ k such that there are many length
2i + 1 augmenting paths.
2.2 Description of the Algorithm
Now we have defined the basic notion of augmenting paths, we are in a position
to give an overview of our algorithm. We have just reasoned that, if our matching

Fig. 1. A schematic of the procedure for finding length 9 augmenting paths. Explained
in the text.
is not already large, then there exists augmenting paths of some length no greater
than 2k + 1. Our algorithm looks for augmenting paths of each of the k different
lengths separately. Consider searching for augmenting paths of length 2i+1. See
Fig. 1 for a schematic of this process when i = 4. In this figure, (a) depicts the
graph G with heavy solid lines denoting edges in the current matching. To find
length 2i + 1 we randomly “project” the graph (and current matching) into a
set of graphs, Li, which we now define.
Definition 2. Consider a graph whose n nodes are partitioned into i + 2 layers
Li+1, . . . L0 and whose edgeset is a subset of ∪1≤j≤i+1{(u, v) : u ∈ Lj, v ∈ Lj−1}.
We call the family of such graphs Li. We call a path ul, ul−1, . . . u0 such that
uj ∈ Lj an l-path.
The random projection is performed as follows. The image of a free node in
G = (V, E) is a node in either Li+1 or L0. The image of a matched edge e = (v, v
)
is a node, either u(v,v) or u(v,v), in one of Li, . . . L1 chosen at random. The edges
in the projected graph G
are those in G that are “consistent” with the mapping
of the free nodes and the matched edges, ie. there is an edge between a node
u(v1,v2) ∈ Lj and u(v3,v4) ∈ Lj−1 if there is an edge (v2, v3) ∈ E. Now note that
an i+1-path in G
corresponds to a 2i+1 augmenting path in G. Unfortunately
the converse is not true, there may be 2i + 1 augmenting paths in G that do
not correspond to i + 1-paths in G
because we only consider consistent edges.
However, we will show later that a constant fraction of augmenting paths exist
(with high probability) as i + 1-paths in G
. In Figure 1, (b) depicts G
, the
layered graph formed by randomly projecting G into Li.
We now concern ourselves with finding a nearly maximal set of node disjoint
i + 1-paths in a graph G
. See algorithm Find-Layer-Paths in Figure 2. The
algorithm finds node disjoint i+1-paths by doing something akin to a depth first
search. Finding a maximal set of node disjoint i+1-paths can easily be achieved
in the RAM model by actually doing a DFS, deleting nodes of found i + 1-paths
and deleting edges when backtracking. Unfortunately this would necessitative
too many passes in the streaming model as each backtrack potentially requires
another pass of the data. Our algorithm in essence blends a DFS and BFS in

174 Andrew McGregor
such a way that we can substantially reduce the number of backtracks required.
This will come at the price of possibly stopping prematurely, ie. when there may
still exist some i + 1-paths that we have not located.
The algorithm first finds a maximal matching between Li+1 and Li. Let S
be the subset of nodes Li involved in this first matching. It then finds a maximal
matching between S
and Li−1. We continue in this fashion, finding a matching
between S
= {u ∈ Li−1 : u matched to some u
∈ Li} and Li−2. One can think
of the algorithm as growing node disjoint paths from left to right. (Fig. 1 (c)
tries to capture this idea. Here, the solid lines represent matchings between the
layers.) If the size of the maximal matching between some subset S of a level
Lj and Lj−1 falls below a threshold we declare all vertices in S to be dead-
ends and conceptually remove them from the graph (in the sense that we never
again use these nodes while try to find i + 1-paths.) At this point we start back-
tracking. It is the use of this threshold that ensures a limit on the amount of
back-tracking performed by the algorithm. However, because of the threshold,
it is possible that a vertex may be falsely declared to be a dead-end, ie. there
may still be a node disjoint path that uses this vertex. With this in mind we
want the threshold to be low such that this does not happen often and we can
hope to find all but a few of a maximal set of node disjoint i + 1-paths. When
we grow some node disjoint paths all the way to L0, we remove these paths and
recurse on the remaining graph. For each node v, the algorithm maintains a tag
indicating if it is a “Dead End” or, if we have found a i + 1 path involving v,
the next node in the path.
It is worth reiterating that in each pass of the stream we simply find a maxi-
mal matching between some set of nodes. The above algorithm simply determines
within which set of nodes we find a maximal matching.
Our algorithm is presented in detail in Fig. 2. Here we use the notation
s ∈R S to denote choosing an element s uniformly at random from a set S. Also,
for a matching M, ΓM (u) = v if (u, v) ∈ M and ∅ otherwise.
2.3 Correctness and Running Time Analysis
We first argue that the use of thresholds in Find-Layer-Paths ensures that we
find all but a small number of a maximal set of i + 1-paths.
Lemma 2 (Running Time and Correctness of Find-Layer-Paths). Given
G
∈ Li, Find-Layer-Paths algorithm finds at least (γ − δ)|M| of the i + 1-paths
where γ|M| is the size of some maximal set of i + 1-paths. Furthermore the
algorithm takes a constant number of passes.
Proof. First note that Find-Layer-Paths(·, ·, ·, l) is called with argument δ2i+1−l
.
During the running of Find-Layer-Paths(·, ·, ·, l) when we run line 15, the number
of i + 1-paths we rule out is at most 2δ2i+1−l
|Ll−1| (the factor 2 comes from the
fact that a maximal matching is at least half the size of a maximum matching.)
Let El be the number of times Find-Layer-Paths(·, ·, ·, l) is called: Ei+1 = 1,
El ≤ El+1/δ2i+1−l
and therefore El ≤ δ−

0≤j≤i−l 2j
= δ−2i−l+1
+1
. Hence, we

Algorithm Find-Matching(G, )
(∗ Finds a matching ∗)
Output: A matching
1. Find a maximal matching M
2. k ← 1

+ 1
3. r ← 4k2
(8k + 10)(k − 1)(2k)k
4. for j = 1 to r:
5. for i = 1 to k:
6. do Mi ← Find-Aug-Paths(G, M, i)
7. M ← argmaxMi
|Mi|
8. return M
Algorithm Find-Aug-Paths(G, M, i)
(∗ Finds length 2i + 1 augmenting paths for a matching M in G ∗)
1. G
←Create-Layer-Graph(G, M, i)
2. P =Find-Layer-Paths(G
, Li+1, 1
r(2k+2)
, i + 1)
3. return MP
Algorithm Create-Layer-Graph(G, M, i)
(∗ Randomly constructs G
∈ Li from a graph G and matching M ∗)
1. if v is a free vertex then l(v) ∈R {0, i + 1}
2. if e=(u,v)∈M then j ∈R [i] and l(e)←j, l(u)←ja and l(v)←jb or vice versa.
3. Ei ←{(u, v)∈E : l(u) = i + 1, l(v) = ia}, E0 ←{(u, v)∈E : l(u) = 1b, l(v) = 0}
4. for j = 0 to i + 1
5. do Lj ← l−1
(j)
6. for j = 1 to i − 1
7. do Ej ← {(u, v) ∈ E : l(u) = (j + 1)b, l(v) = ja}
8. return G
= (Li+1 ∪ Li ∪ . . . ∪ L0, Ei ∪ Ei−1 ∪ . . . ∪ E0)
Algorithm Find-Layer-Paths(G
, S, δ, j)
(∗ Finds many j-paths from S ⊂ Lj ∗)
1. Find maximal matching M
between S and untagged vertices in Lj−1
2. S
← {v ∈ Lj−1 : ∃u, (u, v) ∈ M
}
3. if j = 1
4. then if u ∈ ΓM (Lj−1) then t(u) ← ΓM (u), t(ΓM (u)) ← ΓM (u)
5. if u ∈ S ΓM (Lj−1) then t(u) ← “Dead End ”
6. return
7. repeat
8. Find-Layer-Paths(G
, S
, δ2
, j − 1)
9. for v ∈ S
such that t(v) =“Dead End”
10. do t(ΓM (v)) ← v
11. Find maximal matching M
between untagged vertices in S and Lj−1.
12. S
← {v ∈ Lj−1 : ∃u, (u, v) ∈ M
}
13. until |S
| ≤ δ|Lj−1|
14. for v ∈ S untagged
15. do t(b) ← “Dead End”.
16. return
Fig. 2. An Algorithm for Finding Large Cardinality Matchings. (See text for an infor-
mal description.)

176 Andrew McGregor
remove at most 2Elδ2i+1−l
|Ll| ≤ 2δ|Ll|. Note that when nodes are labeled as
dead-ends in a call to Find-Layer-Paths(·, ·, ·, 1), they really are dead-ends and
declaring them such rules out no remaining i+1-paths. Hence the total number of
paths not found is at most 2δ

1≤j≤i |Lj| ≤ 2δ|M|. The number of invocations
of the recursive algorithm is
1≤l≤i+1
El ≤
1≤l≤i+1
δ−2i+1−l
+1
≤ δ−2i+1
i.e. O(1) and each invocation requires one pass of the data stream to find a
maximal matching.
When looking for length 2i+1 augmenting paths for a matching M in graph
G, we randomly create a layered graph G
∈ Li+1 using Create-Layer-Graph
such that i + 1-paths in G
correspond to length 2i + 1 augmenting paths. We
now need to argue that a) many of the 2i + 1 augmenting paths in G exist in
G
as i + 1-paths and b) that finding a maximal, rather that a maximum, set of
i + 1-paths in G
is sufficient for our purposes.
Theorem 1. If G has αiM length 2i + 1 augmenting paths, then the number of
length i + 1-paths found in G
is at least
(biβi − δ)|M| ,
where bi = 1
2i+2 and βi is a random variables distributed as Bin(αi|M|, 1
2(2i)i ).
Proof. Consider a length 2i + 1 augmenting path P = u0u1 . . . u2i+1 in G. The
probability that P appears as an i + 1-path in G
is at least,
2P (l(u0)=0) P (l(u2i+1)=i + 1)
!
j∈[i]
P (l(u2j)=ja and l(u2j−1)=jb)=
1
2(2i)i
.
Given that the probability of each augmenting path existing as a i + 1-path
in G
is independent, the number of length i + 1-paths in G
is distributed as
Bin(αi|M|, 1
2(2i)i ). The size of a maximal set of node disjoint i + 1-paths is at
least a 1
2i+2 fraction of the maximum size node-disjoint set i+1-paths. Combining
this with Lemma 2 gives the result.
Finally, we argue that we only need to try to augment our initial matching
a constant number of times.
Theorem 2 (Correctness). With probability 1−f by running O(log 1
f ) copies
of the algorithm Find-Matching in parallel we find a 1 − approximation to the
matching of maximum cardinality.
Proof. We show that the probability that a given run of Find-Matching does
not find a (1 + ) approximation is bounded above by e−1
.
Define a phase of the algorithm to be one iteration of the loop started at
line 4 of Find-Matching. At the start of phase p of the algorithm, let Mp be the

current matching. In the course of phase p of the algorithm we augment Mp by
at least |Mp|(max1≤i≤k(biβi,p)−δ) edges where βi,p|Mp| ∼ Bin(αi,p|Mp|, 1
2(2i)i ).
Let Ap be the value of |Mp| maxi(biβi,p) in the pth phase of the algorithm. As-
sume that for each of the r phases of the algorithm max αi,p ≥ α∗
:= 1
2k2(k−1) .
(By Lemma 1, if this is ever not the case, we already have a sufficiently sized
matching.) Therefore, Ap dominates bkBin(α∗
|M1|, 1
2(2k)k ). Let (Xp)1≤p≤r be
independent random variables, each distributed as bkBin(α∗
|M1|, 1
2(2k)k ). There-
fore,
P
⎛
⎝|M1|
!
1≤p≤r
(1 + max{0, max
1≤i≤k
(biβi,p) − δ}) ≥ 2|M1|
⎞
⎠
≥ P
⎛
⎝
1≤p≤r
max
1≤i≤k
biβi,p ≥ 2 + rδ
⎞
⎠
≥ P
⎛
⎝
1≤p≤r
Xp ≥ |M1|
2 + rδ
bk
⎞
⎠
= P (Z ≥ |M1|(4k + 5)) ,
for δ = bk/r where Z = Bin(α∗
|M1|r, 1
2(2k)k ). Finally, by an application of the
Chernoff bound,
P (Z ≥ |M1|(4k + 5)) = 1 − P (Z E (Z) /2) 1 − e−2(8k+10)|M1|
≥ 1 − e−1
,
for r = 2(2k)k
(8k + 10)/α∗
. Of course, since M1 is already at least half the size
of the maximal matching. This implies that with high probability, at some point
during the r phases our assumption that max αi,p ≥ α∗
, became invalid and at
this point we had a sufficiently large matching.
3 Weighted Matching
We now turn our attention to finding maximum weighted matchings. Here each
edge e ∈ E of our graph G has a weight w(e) (wlog. w(e) 0). For a set of edges
S let w(S) =

e∈S w(e). We seek to maximize w(S) subject to the constraint
that S contains no two adjacent edges.
Consider the algorithms given in Fig. 3. The algorithm Find-Weighted-
Matching can be viewed as a parameterization of the one pass algorithm given
in [9] in which γ was implicitly equal to 1. The algorithm greedily collects edges
as they stream past. The algorithm maintains a matching M at all points. On
seeing an edge e, if w(e) (1 + γ)w({e
|e
∈ M, e
and e share an end point})
then the algorithm removes any edges in M sharing an end point with e and
adds e to M. The algorithm Find-Weighted-Matching-Multipass generalizes this
to a multi-pass algorithm that in effect, repeats the one pass algorithm until the
improvement yielded falls below some threshold. We start by recapping some

178 Andrew McGregor
notation introduced in [9]. While unfortunately macabre, this notation is never-
theless helpful for developing intuition.
Definition 3. In a given pass of the graph stream, we say that an edge e is born
if e ∈ M at some point during the execution of the algorithm. We say that an
edge is killed if it was born but subsequently removed from M by a newer heavier
edge. This new edge murdered the killed edge. We say an edge is a survivor if it
is born and never killed. For each survivor e, let the Trail of the Dead be the set
of edges T (e) = C1 ∪ C2 ∪ . . ., where C0 = {e}, C1 = {the edges murdered by e},
and Ci = ∪e∈Ci−1 {the edges murdered by e
}.
Lemma 3. For a given pass let the set of survivors be S. The weight of the
matching found at the end of that pass is therefore w(S).
1. w(T (S)) ≤ w(S)/γ
2. Opt ≤ (1 + γ) (w(T (S)) + 2w(S))
Proof. 1. For each murdering edge e, w(e) is at least (1 + γ) the cost of
murdered edges, and an edge has at most one murderer. Hence, for all i,
w(Ci) ≥ (1+γ)w(Ci+1) and therefore (1+γ)w(T (e)) =

i≥1(1+γ)w(Ci) ≤

i≥0 w(Ci) = w(T (e)) + w(e). The first point follows.
2. We can charge the costs of edges in Opt to the S ∪T (S) such that each edge
e ∈ T (S) is charged at most (1 + γ)w(e) and each edge e ∈ S is charged at
most 2(1 + γ)w(e). See [9] for details.
Hence in the one pass algorithm we get an 1
1
γ +3+2γ
approximation ratio since
Opt ≤ (1 + γ)(w(T (S)) + 2w(S)) ≤ (3 +
1
γ
+ 2γ)w(S)
The maximum of this function is achieved for γ = 1
√
2
giving approximation ratio
1
3+2
√
2
. This represents only a slight improvement over the 1/6 ratio attained
previously. However, a much more significant improvement is realized in the
multi-pass algorithm Find-Weighted-Matching-Multipass.
Theorem 3. The algorithm Find-Weighted-Matching-Multipass finds a 1
2(1+)
approximation to the maximum weighted matching. Furthermore, the number of
passes required is at most,
log(3/2 +
√
2)
log (2/3)3
(1+2/3)2−(2/3)3
+ 1 .
Proof. First we prove that the number of passes is as claimed. We increase the
weight of our solution by a factor 1+κ each time we do a pass and we start with a
1/(3 + 2
√
2) approximation. Hence, if we take, log1+κ(3/2 +
√
2) passes we have
already found a maximum weighted matching. Substituting in κ = γ3
(1+γ)2−γ3
establishes the bound on the number of passes.

Algorithm Find-Weighted-Matching(G, γ)
(∗ Finds Large Weighted Matchings in One Pass ∗)
1. M ← ∅
2. for each edge e ∈ G
3. do if w(e) (1 + γ)w({e
|e
∈ M, e
4. then M ← M ∪ {e} {e
|e
∈ M, e
and e share an end point}
5. return M
Algorithm Find-Weighted-Matching-Multipass(G, )
(∗ Finds Large Weighted Matchings ∗)
1. γ ← 2
3
2. κ ← γ3
(1+γ)2−γ3
3. Find a 1
3+2
√
2
weighted matching, M
4. repeat
5. S ← w(M)
6. for each edge e ∈ G
7. do if w(e) (1 + γ)w({e
|e
∈ M, e
8. then M ← M∪{e}{e
|e
∈ M, e
and e share an end point}
9. until w(M)
S
≤ 1 + κ
10. return M
Fig. 3. Algorithms for Finding Large Weighted Matchings.
Let Mi be the matching constructed after the i-th pass. Let Bi = Mi ∩Mi−1.
Now, (1 + γ)(w(Mi−1) − w(Bi)) ≤ w(Mi) − w(Bi) and so,
w(Mi)
w(Mi−1)
=
w(Mi)
w(Mi−1) − w(Bi) + w(Bi)
≥
(1 + γ)w(Mi)
w(Mi) + γw(Bi)
.
If w(Mi)
w(Mi−1) (1 + κ), then we deduce that w(Bi) ≥ γ−κ
γ+γκw(Mi). Appealing
to Lemma 3, this means that, for all i,
OPT ≤ (1/γ + 3 + 2γ)(w(Mi) − w(Bi)) + 2(1 + γ)w(Bi) ,
since edges in Bi have empty trails of the dead. So if w(Bi) ≥ γ−κ
γ+γκ w(Mi) we
get that,
OPT ≤ (1/γ + 3 + 2γ)(w(Mi) − w(Bi)) + 2(1 + γ)w(Bi)
≤ (1/γ + 3 + 2γ − (1/γ + 1)
γ − κ
γ + γκ
)w(Mi)
≤ (2 + 3γ)w(Mi) .
Since γ = 2
3 the claimed approximation ratio follows.

180 Andrew McGregor
4 Conclusions and Open Questions
New constant pass streaming algorithms, using Õ(n) space, have been presented
for the MCM and MWM problems. The MCM algorithms uses a novel ran-
domized technique that allows us to find augmenting paths and thereby find a
matching of size Opt/(1 + ). The MWM algorithm builds upon previous work
to find a matching whose weight is at least Opt/(2 + ).
It is worth asking if there exists a streaming algorithm that achieves a 1/(1+)
approximation for the MWM problem. It is possible, although non-trivial, to ex-
tend some of the ideas for the MCM problem to deal with the case when edges
are weighted. The main problem lies in the fact that, in the weighted case, there
are “augmenting cycles” in addition to (weight-)augmenting paths (naturally
defined). Unfortunately, finding cycles in the streaming model seems to be in-
herently difficult. In particular, this author fears that lower bounds concerning
finding the girth of a graph [10] and finding common neighborhoods [17] may
suggest a negative result.
Finally, the careful reader may have noticed that the number of passes re-
quired by the MCM algorithm has a rather strong dependence on in the sense
that, as becomes small, the number of passes necessary, grows very quickly.
This paper was rather cavalier with the dependence on as we were primar-
ily concerned with ensuring that the number of passes was independent of n.
However, for the sake of practical applications, a weaker dependence would be
desirable.
References
1. Edmonds, J.: Maximum matching and a polyhedron with 0,1-vertices. J. Res. Nat.
Bur. Standards 69 (1965) 125–130
2. Gabow, H.N.: Data structures for weighted matching and nearest common ances-
tors with linking. In: Proc. ACM-SIAM Symposium on Discrete Algorithms. (1990)
434–443
3. Hopcroft, J.E., Karp, R.M.: An n5/2
algorithm for maximum matchings in bipartite
graphs. SIAM J. Comput. 2 (1973) 225–231
4. Micali, S., Vazirani, V.: An O(
√
V E) algorithm for finding maximum matching
in general graphs. In: Proc. 21st Annual IEEE Symposium on Foundations of
Computer Science. (1980)
5. Drake, D.E., Hougardy, S.: Improved linear time approximation algorithms for
weighted matchings. In Arora, S., Jansen, K., Rolim, J.D.P., Sahai, A., eds.:
RANDOM-APPROX. Volume 2764 of Lecture Notes in Computer Science.,
Springer (2003) 14–23
6. Kalantari, B., Shokoufandeh, A.: Approximation schemes for maximum cardinal-
ity matching. Technical Report LCSR–TR–248, Laboratory for Computer Science
Research, Department of Computer Science. Rutgers University (1995)
7. Preis, R.: Linear time 1/2-approximation algorithm for maximum weighted match-
ing in general graphs. In Meinel, C., Tison, S., eds.: STACS. Volume 1563 of Lecture
Notes in Computer Science., Springer (1999) 259–269
8. Muthukrishnan, S.: Data streams: Algorithms and applications. (2003) Available
at http://athos.rutgers.edu/∼muthu/stream-1-1.ps.

9. Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: On graph problems
in a semi-streaming model. In: Proc. 31st International Colloquium on Automata,
Languages and Programming. (2004) 531–543
10. Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: Graph distances in
the streaming model: The value of space. Proc. 16th ACM-SIAM Symposium on
Discrete Algorithms (2005)
11. Munro, J., Paterson, M.: Selection and sorting with limited storage. Theoretical
Computer Science 12 (1980) 315–323
12. Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams.
Technical Report 1998-001, DEC Systems Research Center (1998)
13. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the
frequency moments. Journal of Computer and System Sciences 58 (1999) 137–147
14. Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An approximate L1
difference algorithm for massive data streams. SIAM Journal on Computing 32
(2002) 131–151
15. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In:
Proc. 41th IEEE Symposium on Foundations of Computer Science. (2000) 359–366
16. Drineas, P., Kannan, R.: Pass efficient algorithms for approximating large matrices.
In: Proc. 14th ACM-SIAM Symposium on Discrete Algorithms. (2003) 223–232
17. Buchsbaum, A.L., Giancarlo, R., Westbrook, J.: On finding common neighbor-
hoods in massive graphs. Theor. Comput. Sci. 1-3 (2003) 707–718

A Primal-Dual Approximation Algorithm for
Partial Vertex Cover: Making Educated Guesses
Julián Mestre
Department of Computer Science
University of Maryland, College Park, MD 20742
jmestre@cs.umd.edu
Abstract. We study the partial vertex cover problem, a gener-
alization of the well-known vertex cover problem. Given a graph
G = (V, E) and an integer s, the goal is to cover all but s edges, by
picking a set of vertices with minimum weight. The problem is clearly
NP-hard as it generalizes the vertex cover problem. We provide a primal-
dual 2-approximation algorithm which runs in O(V log V +E) time. This
represents an improvement in running time from the previously known
fastest algorithm.
Our technique can also be applied to a more general version of the prob-
lem. In the partial capacitated vertex cover problem each vertex
u comes with a capacity ku and a weight wu. A solution consists of a
function x : V → N0 and an orientation of all but s edges, such that
the number edges oriented toward any vertex u is at most xuku. The
cost of the cover is given by

v∈V xvwv. Our objective is to find a cover
with minimum cost. We provide an algorithm with the same performance
guarantee as for regular partial vertex cover. In this case no algorithm
for the problem was known.
1 Introduction
The vertex cover problem has been widely studied [12, 16]. In the simplest
version, we are given a graph G = (V, E) and are required to pick a minimum
subset of vertices to cover all the edges of the graph. An edge is said to be covered
if at least one of its incident vertices is picked. The problem is NP-hard and was
one of the first problems shown to be NP-hard in Karp’s seminal paper [15].
However, several different approximation algorithms have been developed for
it [12]. These algorithms develop solutions within twice the optimal in polynomial
time.
Recently several generalizations of the vertex cover problem have been con-
sidered. Bshouty and Burroughs [4] were the first to study the partial vertex
cover problem. In this generalization we are not required to cover all edges of
the graph, any s edges may be left uncovered. By allowing some fixed number
of edges to be left uncovered we hope to obtain a cover with substantially lower
cost. It can be regarded as a trade off between cost and coverage.

Research supported by NSF Awards CCR 0113192 and CCF 0430650
c

A Primal-Dual Approximation Algorithm for Partial Vertex Cover 183
The paper by Bshouty and Burroughs [4] was the first to give a factor 2
approximation using LP-rounding. Subsequently, combinatorial algorithms with
the same approximation guarantee were proposed. Houchbaum [13] presented an
O(EV log V 2
E log V ) time algorithm; Bar-Yehuda [1] developed an O(V 2
) time
algorithm using the local-ratio technique; finally, a primal-dual algorithm which
runs in O(V (E + V log V )) time was given by Gandhi et al. [8].
The main result of this paper is a primal-dual 2-approximation algorithm for
partial vertex cover which runs in O(V log V + E). This represents an improve-
ment in running time over the previously known best solution for the problem.
One additional benefit of our method is that it can be used to solve a more
general version of the problem.
Motivated by a problem in computational biology, the capacitated vertex
cover problem was proposed by Guha et al. [9]. For each vertex v ∈ V we are
given a capacity kv and a weight wv. A capacitated vertex cover is a function
x : V → N0 such that there exists an orientation of all edges in which the number
of edges directed into v ∈ V is at most kvxv. When e is oriented toward v we
say e is covered by or assigned to v. The weight of the cover is

v∈V wvxv, we
want to find a cover with minimum weight. Factor 2 approximation algorithms
have been developed [7, 9] for this problem.
By allowing s edges to be left unassigned we get a new problem: the partial
capacitated vertex cover problem. Thus we bring together two generaliza-
tion of the original vertex cover problem that have only been studied separately.
Using the primal-dual algorithm of [8] for regular capacitated vertex cover cou-
pled with our method we show a 2-approximation for the more general problem.
For vertex cover, the best (known) approximation factor for general graphs is
2 − o(1) [3, 10] but the best constant factor remains 2. Some algorithms [2, 6, 11]
that achieve this later bound rank among the first approximation algorithms and
can be interpreted using the primal-dual method, where a maximal dual solution
is found and the cost of the cover is charged to the dual variables. It is interesting
to see how far the same primal-dual method can be taken to tackle our much
more general version of vertex cover.
Finally we mention that the techniques used in this paper are not restricted to
vertex-cover-like problems. For instance, the primal-dual algorithm of Jain and
Vazirani [14] for the facility location problem can be used in conjunction
with our method to design a faster approximation algorithm for the the partial
version of the problem which was studied by Charikar et at. [5].
2 LP Formulation
Let us describe an integer linear program formulation for the partial vertex cover
problem. Each vertex v ∈ V has associated a variable xv ∈ {0, 1} which indicates
if v is picked for the cover. For every edge e ∈ E either one of its endpoints is
picked or the edge is left uncovered, the variable pe ∈ {0, 1} indicates the later.
The number of uncovered edges cannot exceed s. Figure 1 show the LP relaxation
of the IP just described.

184 Julián Mestre
min

v∈V wvxv
subject to:
pe + xu + xv ≥ 1 ∀ e = {u, v} ∈ E

e∈E pe ≤ s
xu, pe ≥ 0
Fig. 1. LP for Partial Vertex Cover
As it stands, the LP exhibits an unbounded integrality gap. Consider the
following example: a star where the center node has degree d, all leaves have
unit weight while the center node has weight 10. Suppose we must cover two
edges, that is, s = d − 2. The optimal integral solution (OPT) must pick two
leaves, the cost of the solution is therefore 2. On the other hand, a fractional
solution can pick 2/d of the center node, every edge is covered by this amount
which adds up to 2. The cost is 20/d, choose d big enough to make the cost as
small as desired.
Suppose we knew that h ∈ V is the most expensive vertex in OPT. We
modify the problem slightly by forcing the solution to pick h and disallowing
picking more expensive vertices. This modification does not increase the cost
of the integral optimal solution, but it does narrow the integrality gap. In the
star example above if we knew that any leaf can be the most expensive vertex
in OPT, then the modification effectively closes the integrality gap. Provided
we know which is the most expensive vertex in OPT, Gandi et al. [8] made use
of the above modification to develop a primal-dual 2-approximation algorithm
for partial vertex cover. Unfortunately we don’t know, so we must guess. Their
procedure is run on each possible choice of h, and the best cover produced is
returned. Each run takes O(V log V +E) time and therefore the whole procedure
runs in O(V (V log V + E)) time.
Our approach is along these lines, but instead of doing exhaustive guessing
we run our algorithm only once and make different guesses along the way. One
may say that the algorithm makes educated guesses, this allows us to bring down
the running time to O(V log V + E).
3 Primal-Dual Algorithm
The dual program for partial vertex cover is given in Figure 2. By δ(v) we refer
to the edges incident to vertex v.
Initially all dual variables are set to 0, and all edges are unassigned. The
algorithm works in iterations, each consisting of a pruning step followed by a
dual update step. As the algorithm progresses we build a cover C ⊆ V , which
initially is empty. Along the way we may disallow picking some vertices, we keep
track of these with the set R, which also starts off empty.
In the pruning step we check every vertex v ∈ V {C ∪ R }. If C + v is a
feasible cover, we guess C + v as a possible cover and then we disallow picking

max

e∈E ye − sz
subject to:

e∈δ(v) ye ≤ wv ∀ v ∈ V
ye ≤ z ∀ e ∈ E
ye, z ≥ 0
Fig. 2. The dual for Partial Vertex Cover
v, i.e., we set wv ← ∞ and add it to R. Notice that many vertices may be
guessed/disallowed in a single pruning step.
If |E(R)| s we must stop as there is no hope of finding a feasible cover
anymore, even if all vertices in V R are picked. We thus return, among the
guessed covers, the one with the least weight.
In the dual update step we uniformly raise z and the ye variables of unas-
signed edges until some vertex u becomes tight, notice that u ∈ V R. If multiple
vertices become tight then arbitrarily pick one to process. Vertex u is added to
C, free edges in δ(u) are assigned to u and their dual variables are frozen. After
this is done we proceed to the next iteration. Realize that after adding u the
cover is still not feasible, otherwise u would have been guessed and disallowed
in the pruning step.
The algorithm terminates when |E(R)| s, at this point we know OPT must
use at least one vertex in R. Let h be the first such vertex to be added to R,
and C the cover at the moment h was guessed. By definition C + h is a feasible
solution, suppose its cost is at most twice that of OPT. The algorithm returns
the guessed cover with minimum weight, thus it constitutes a 2-approximation
for partial vertex cover. Now it all boils down to proving the following lemma.
Lemma 1. Let h be the first vertex in OPT to be added to R, and C the cover
constructed by the algorithm when h was guessed. Then w(C + h) ≤ 2 OPT.
As a warm-up we first show that w(C + h) ≤ 2 OPT + wh, this implies a
3-factor approximation. To do so we modify the problem slightly by disallowing
picking vertices added to R before h, this does not change the cost of the integral
optimal solution. In fact the algorithm does exactly this by changing the weight
of those vertices to ∞. We denote by OPT
, LP
and DL
the cost of the optimal
integral, fractional primal and dual solutions for the new problem. Note that
DL
= LP
≤ OPT
= OPT.
The dual solution (y, z) constructed by the algorithm at the moment h was
guessed constitutes a feasible solution for the dual of the modified problem. Every
vertex v ∈ C is tight (wv =

e∈δ(v) ye), thus charging once the dual variables of
edges incident to v is enough to pay for v. Assigned edges are charged at most
twice, therefore w(C) ≤ 2

e assigned ye. Since the set C is not a feasible solution
more than s edges remain unassigned. These edges have ye = z and are not
charged, therefore w(C) ≤ 2(

e∈E ye − sz) ≤ 2 OPT, and the bound follows.

186 Julián Mestre
Proof (of Lemma 1). To obtain the desired bound we must modify the problem
even further: In addition to disallowing picking vertices added to R before h we
force every solution to pick h. Like before this should not change the cost of the
integral optimal solution and so the cost of a feasible dual solution for the new
problem is a lower bound on the cost of OPT.
These modiﬁcations to the problem bring some changes to the LP formula-
tion. The weight of the vertices that came to R before h are set to ∞. Since we
are forced to pick h, we set xh = 1. As a result a constant wh term appears in
the objective function of both the primal and the dual and the dual constraint
for xh disappears. Every edge e ∈ δ(h) is covered by h, so the primal constraint
for e and the variable pe go away; in the dual this means variable ye and the con-
straint ye ≤ z disappear. The dual solution (y, z) constructed by the algorithm
is not feasible as variables ye for e ∈ δ(h) have non-zero value. We construct
another solution y
by letting y
e = 0 for e ∈ δ(h) and y
e = ye otherwise1
. The
cost of the new solution (y
, z) is

e∈E y
e − sz + wh.
If we proceed as before to account for the cost of vertices v ∈ C we run into
two problems. The ﬁrst comes from neighbors of h for which wv =

e∈δ(v) y
e +
yvh. This means we cannot fully pay for v just using

e∈δ(v) y
e, and so we
consider the term yvh to be overcharge. Thus the total overcharge from neighbors
of h is at most

e assigned ∈δ(h) ye.
The second problem arises when trying to pay for l, the last vertex added to
C. Suppose that just before l became tight there were f free edges, note that
these edges have ye = z. When l is added to C, tl of these edges are assigned to
it, since C is not feasible we have f − tl s. Assume that of the remaining free
edges, th are incident to h. Because C + h is feasible we have f − (tl + th) ≤ s.
After we switch from y to y
only f − th edges remain with y
e = z, notice that
f − th s, otherwise h would have been guessed earlier. Among these edges, s
should be left uncharged as the term −sz in the objective function cancels their
contribution. Vertex l needs to charge tl of these edges, but only f − th − s ≤ tl
are available. The shortfall is tl − (f − th − s) th, again we consider this as
overcharge which is bounded by thz =

e unassigned ∈δ(h) ye.
The total overcharge is therefore

e∈δ(h) ye ≤ wh and so the lemma follows:
w(C + h) ≤ 2(
e∈E
y
e − sz) + wh + wh ≤ 2(
e∈E
y
e − sz + wh) ≤ 2OPT

For the implementation we keep two heaps for the vertices. One heap tells
us when a vertex will become tight, which is easily determined by looking at its
weight, the current value of the dual variables of edges incident to it and the
number of those edges that are free. The second heap tells us how many free
edges are incident to the vertex.
1
Although dual variables for e ∈ δ(h) really disappear, the analysis is cleaner if we
keep them but force them to be 0

In the dual update step we take out the vertex which will become tight the
soonest and update the heap key for its neighbors. In the pruning step we look
at the second heap and fetch vertices with the largest number of free incident
edges. Checking if its addition to C make the solution feasible can be done in
constant time by keeping track of how many edges C covers.
All in all at most 2|V | remove and at most 2|E| increase/decrease key oper-
ations are performed. Using Fibonacci heaps it takes O(V log V + E) time. This
finishes the proof of the main result of the paper.
Theorem 1. There is a 2-approximation algorithm for Partial Vertex Cover
which runs in O(V log V + E).
4 Generalizations
Our technique can be used to solve a more general version of the vertex cover
problem. In the partial capacitated vertex cover problem every vertex u ∈ V
has a capacity ku and a weight wu. A single copy of u can cover at most ku
edges incident to it, the number of copies picked is given by xu ∈ N0. Every
edge e = {u, v} ∈ E is either left uncovered (pe = 1) or is assigned to one of its
endpoints, variables yeu and yev indicate this. The number of edges assigned to
a vertex u cannot exceed kuxu, while the ones left unassigned cannot be more
than s. Figure 3 shows the LP relaxation of the program just described. An
additional constraint xv ≥ yev is needed to fix the integrality gap.
min

v∈V wvxv
subject to:
pe + yeu + yev ≥ 1 ∀ e = {u, v} ∈ E
kvxv ≥

e∈δ(v) yev ∀ v ∈ V
xv ≥ yev ∀ v ∈ e ∈ E

e∈E pe ≤ s
pe, yev, xv ≥ 0
Fig. 3. LP for partial capacitated vertex cover
Like before our algorithm works in iterations, each having a pruning and a
dual update step. For the later we follow the algorithm in [9] which is described
here for completeness. We will use the dual program of Figure 4.
Initially all variables are set to 0 and all edges are unassigned. A vertex u
is said to be high degree if more than ku unassigned edges are incident to u,
otherwise u is low degree. As we will see a vertex may start off as high degree
and later become low degree as edges get assigned to its neighbors. We define
Lu to be the set of unassigned edges incident to u at the moment u becomes low
degree. For vertices that start off as low degree we define Lu = δ(u).
In the dual update step we raise αe for all unassigned edges, because of the
constraint αe ≤ qv +lev one of the terms on the right hand side must be raised by
the same amount. Which variable is increased depends on the nature of v: If v is

188 Julián Mestre
max

e∈E αe − sz
subject to:
kvqv +

e∈δ(v) lev ≤ wv ∀ v ∈ V
αe ≤ qv + lev ∀ v ∈ e ∈ E
αe ≤ z ∀ e ∈ E
qv, lev, αe, z ≥ 0
Fig. 4. The dual
high degree we rise qv otherwise we rise lev. When some vertex u becomes tight
we open it: If u is high degree we assign to it all the free edges in δ(u), otherwise
u is low degree and all edges in Lu are assigned to it. Notice that in the later
case some of the edges in Lu might have already been assigned. In particular this
means that some of the edges originally assigned to a high degree vertex when
it was opened may later be taken away by the opening of low degree neighbors.
Therefore the decision of how many copies to pick of a vertex is differed until
the end of the algorithm. In what follows when we talk about current solution
we mean the current assignment of edges.
In the pruning step when processing a vertex u, although in principle we are
allowed to pick multiple copies of a vertex, we check if adding just one copy of
u to the current solution makes it feasible, if so guess the solution plus u and
then disallow picking u, i.e.: add it to R and set wu ← ∞.
A consequence of the pruning step is that if in the dual update step the
vertex u to become tight is low degree then opening u cannot make the solution
feasible. On the other hand when u is high degree opening u may make the
solution feasible. In which case the algorithm ends prematurely: we assign just
enough edges to make the solution feasible, i.e., s edges are left unassigned, and
we return the best solution among the previously guessed covers and the current
solution. If on the other hand this does not happen and R grows to the point
where |E(R)| s we stop and return the best guessed solution.
Let us first consider the case where the algorithm ends prematurely, we will
show that we can construct a solution with cost 2(

e∈E αe − sz). Constructing
the solution is simple: we pick enough copies of every vertex to cover the edges
assigned to it.
Let u be one of the opened vertices, furthermore suppose u was low degree
when opened, this means only one copy is needed. If u started off as low degree
then wu =

e∈Lu
leu =

e∈Lu
αe as qu = 0. On the other hand if u became low
degree afterwards then |Lu| = ku and only edges in Lu have nonzero leu, thus:
wu = quku +
e∈Lu
leu =
e∈Lu
qu + leu =
e∈Lu
αe
In either cases we can pay for u by charging once the edges in Lu.
Now let us consider the case when u is high degree. By definition more than
ku edges were assigned to u when it became tight, note that some of these edges
may be later taken away by low degree neighbors. Let Ru be the edges assigned

to u at the end of the algorithm. There are two cases to consider. If |Ru| ku
we only need one copy of u, charging once the edges originally assigned to u
(which are more than ku and have αe = qu) is enough to pay for wu = kuqu.
Otherwise |Ru|
ku
copies of u are needed. Unfortunately we cannot pay for them
using

e∈Ru
αe = |Ru|qu = |Ru|
ku
wu which is enough to pay for the first |Ru|
ku

copies but may be less than |Ru|
ku
wu. The pathological case happens when
|Ru| = ku + 1, two copies are needed but edges in Ru only have a dual cost of
wu(1 + 1
ku
). The key observation is that edges in Ru will not be charged from
the other side, therefore we can charge any ku edges twice to pay for the extra
copy.
How many times can a single edge be charged? At most twice, either from a
single endpoint or once from each endpoint. We are leaving s edges uncharged
with αe = z, therefore the solution can be paid with 2(

e∈E αe − sz).
Before ending prematurely the algorithm may have disallowed some vertices.
If none of them is used by OPT, modifying the problem by disallowing picking
them does not increase the cost of the integral optimal solution and therefore
the cost of the dual solution at the end is a lower bound on OPT. It follows from
the above analysis that the cover found has cost at most 2OPT. Otherwise at
least one of the guesses was correct. Likewise if the algorithm terminates because
|E(R)| s there must be at least one correct guess. In either case let h be the
first vertex added to R which is used by OPT.
We define a new problem where we are not allowed to pick vertices added
to R before h and we are forced to pick at least one copy of h. Notice that we
place no restriction on which edges h covers and that extra copies may be picked,
therefore the cost of the integral optimal solution for the new problem does not
change. The primal changes consist in replacing xh with xh + 1. As a result the
objective function becomes

e∈E αe − sz − khqh −

e∈δ(h) leh + wh.
The dual solution (α, q, l, z) constructed by the algorithm when h was guessed
is a feasible solution for the dual of the new problem. As usual the cost of this
solution is a lower bound on OPT.
Consider the edge assignment when h was guessed, before constructing the
cover we need to modify the assignment. If h is low degree then assign edges in
Lh to it, otherwise assign any kh free edges incident to h. After this is done less
than s edges may be left unassigned, we need to free some of the edges assigned
to the last vertex to become tight as to leave exactly s edges unassigned with
αe = z. Now build the cover by picking as many copies of each vertex to cover
the edges assigned to it.
By switching to the modified problem we gain a wh term in the objective func-
tion of the dual, while some cost is lost, namely khqh +

e∈δ(h) leh =

e∈D αe
where D is the set of edges assigned to h. We can regard the contribution of edges
in D to the dual cost as disappearing. In turn, this causes some overcharge when
paying for the rest of the solution. A similar argument as the one used for regular
partial vertex cover shows that the overcharge is at most

e∈D αe ≤ wh. Add
the cost of h and the whole solution can be paid with twice the cost of the dual
which is ≤ 2 OPT.

190 Julián Mestre
Returning the solution with minimum weight among the ones guessed guar-
antees producing a cover with cost at most twice that of the optimal solution.
The algorithm can be implemented using Fibonacci heaps to ﬁnd out which
vertex will become tight next and to do the pruning, the total running time is
O(V log V + E). The section is thus summarized in the following theorem.
Theorem 2. There exists a 2-approximation algorithm for Partial Capacitated
Vertex Cover which runs in O(V log V + E)
5 Conclusion
We have developed two algorithms for partial vertex cover problems. Our prun-
ing/dual update technique is quite general and may be applied to other covering
problems where a primal-dual already exist for the special case of s = 0. For
instance a similar algorithm may be developed for facility location using Jain
and Vazirani’s algorithm [14] for the dual update step.
Acknowledgment
The author would like to thank Samir Khuller for suggesting developing an
algorithm for partial vertex cover that does not use exhaustive guessing.
References
1. R. Bar-Yehuda. Using homogenous weights for approximating the partial cover
problem. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA’99), pages 71–75, 1999.
2. R. Bar-Yehuda and S. Even. A linear time approximation algorithm for approxi-
mating the weighted vertex cover. Journal of Algorithms, 2:198–203, 1981.
3. R. Bar-Yehuda and S. Even. A local-ratio theorem for approximating the weighted
vertex cover problem. Annals of Discrete Mathematics, 25:27–46, 1985.
4. N. Bshouty and L. Burroughs. Massaging a linear programming solution to give a
2-approximation for a generalization of the vertex cover problem. In Proceedings
of the 15th Annual Symposium on the Theoretical Aspects of Computer Science
(STACS’98), pages 298–308, 1998.
5. M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility
location problems with outliers. In soda01, pages 642–651, 2001.
6. K. L. Clarkson. A modiﬁcation of the greedy algorithm for vertex cover. Informa-
tion Processing Letters, 16(1):23–25, 1983.
7. R. Gandhi, S. Khuller, S. Parthasarathy, and A. Srinivasan. Dependent round-
ing in bipartite graphs. In Proceedings of the 43rd Annual IEEE Symposium on
Foundations of Computer Science (FOCS’02), pages 323–332, 2002.
8. R. Gandhi, S. Khuller, and A. Srinivasan. Approximation algorithms for partial
covering problems. In Proceedings of the 11th International Colloquium on Au-
tomata, Languages, and Programming (ICALP’01), pages 225–236, 2001.

9. S. Guha, R. Hassin, S. Khuller, and E. Or. Capacitated vertex covering with ap-
plications. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA’02), pages 858–865, 2002.
10. E. Halperin. Improved approximation algorithms for the vertex cover problem in
graphs and hypergraphs. In Proceedings of the 11th Annual ACM-SIAM Sympo-
sium on Discrete Algorithms (SODA’00), pages 329–337, 2000.
11. D. S. Hochbaum. Approximation algorithms for the set covering and vertex cover
problems. SIAM Journal on Computing, 11:385–393, 1982.
12. D. S. Hochbaum, editor. Approximation Algorithms for NP–hard Problems. PWS
Publishing Company, 1997.
13. D. S. Hochbaum. The t-vertex cover problem: Extending the half integrality
framework with budget constraints. In Proceedings of the 1st International Work-
shop on Approximation Algorithms for Combinatorial Optimization Problems (AP-
PROX’98), pages 111–122, 1998.
14. K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location
and k-median problems using the primal-dual schema and lagrangian relaxation.
Journal of the ACM, 48(2):274–296, 2001.
15. R. M. Karp. Reducibility among combinatorial problems. In Complexity of Com-
puter Computations, pages 85–103. Plenum Press, 1972.
16. V. V. Vazirani. Approximation Algorithms. Springer-Verlag, 2001.

Efficient Approximation of Convex Recolorings
Shlomo Moran1,
and Sagi Snir2
1
Computer Science dept., Technion, Haifa 32000, Israel
moran@cs.technion.ac.il
2
Mathematics dept. University of California, Berkeley, CA 94720, USA
ssagi@math.berkeley.edu
Abstract. A coloring of a tree is convex if the vertices that pertain to
any color induce a connected subtree; a partial coloring (which assigns
colors to some of the vertices) is convex if it can be completed to a
convex (total) coloring. Convex coloring of trees arises in areas such as
phylogenetics, linguistics, etc. e.g., a perfect phylogenetic tree is one in
which the states of each character induce a convex coloring of the tree.
Research on perfect phylogeny is usually focused on finding a tree so that
few predetermined partial colorings of its vertices are convex.
When a coloring of a tree is not convex, it is desirable to know “how far”
it is from a convex one. In [MS05], a natural measure for this distance,
called the recoloring distance was defined: the minimal number of color
changes at the vertices needed to make the coloring convex. This can be
viewed as minimizing the number of “exceptional vertices” w.r.t. to a
closest convex coloring. The problem was proved to be NP-hard even for
colored strings.
In this paper we continue the work of [MS05], and present a 2-appro-
ximation algorithm of convex recoloring of strings whose running time
O(cn), where c is the number of colors and n is the size of the input, and
an O(cn2
) 3-approximation algorithm for convex recoloring of trees.
1 Introduction
A phylogenetic tree is a tree which represents the course of evolution for a given
set of species. The leaves of the tree are labeled with the given species. Internal
vertices correspond to hypothesized, extinct species. A character is a biologi-
cal attribute shared among all the species under consideration, although every
species may exhibit a different character state. Mathematically, if X is the set of
species under consideration, a character on X is a function C from X into a set C
of character states. A character on a set of species can be viewed as a coloring of
the species, where each color represents one of the character’s states. A natural
biological constraint is that the reconstructed phylogeny have the property that
each of the characters could have evolved without reverse or convergent transi-
tions: In a reverse transition some species regains a character state of some old

A preliminary version of the results in this paper appeared in [MS03].

This research was supported by the Technion VPR-fund and by the Bernard Elkin
Chair in Computer Science.
c

Efficient Approximation of Convex Recolorings 193
ancestor whilst its direct ancestor has lost this state. A convergent transition
occurs if two species possess the same character state, while their least common
ancestor possesses a different state.
In graph theoretic terms, the lack of reverse and convergent transitions means
that the character is convex on the tree: for each state of this character, all
species (extant and extinct) possessing that state induce a single block, which is
a maximal monochromatic subtree. Thus, the above discussion implies that in a
phylogenetic tree, each character is likely to be convex or “almost convex”. This
makes convexity a fundamental property in the context of phylogenetic trees
to which a lot of research has been dedicated throughout the years. The Per-
fect Phylogeny (PP) problem, whose complexity was extensively studied (e.g.
[Gus91, KW94, AFB96, KW97, BFW92, Ste92]), seeks for a phylogenetic tree
that is simultaneously convex on each of the input characters. Maximum par-
simony (MP) [Fit81, San75] is a very popular tree reconstruction method that
seeks for a tree which minimizes the parsimony score defined as the number of
mutated edges summed over all characters (therefore, PP is a special case of
MP). [GGP+
96] introduce another criterion to estimate the distance of a phy-
logeny from convexity. They define the phylogenetic number as the maximum
number of connected components a single state induces on the given phylogeny
(obviously, phylogenetic number one corresponds to a perfect phylogeny). Con-
vexity is a desired property in other areas of classification, beside phylogenetics.
For instance, in [Be00, BDFY01] a method called TNoM is used to classify genes,
based on data from gene expression extracted from two types of tumor tissues.
The method finds a separator on a binary vector, which minimizes the number
of “1” in one side and “0” in the other, and thus defines a convex vector of mini-
mum Hamming distance to the given binary vector. In [HTD+
04], distance from
convexity is used (although not explicitly) to show strong connection between
strains of Tuberculosis and their human carriers.
In a previous work [MS05], we defined and studied a natural distance from
a given coloring to a convex one: the recoloring distance. In the simplest, un-
weighted model, this distance is the minimum number of color changes at the
vertices needed to make the given coloring convex (for strings this reduces to
Hamming distance from a closest convex coloring). This model was extended
to a weighted model, where changing the color of a vertex v costs a nonnega-
tive weight w(v). The most general model studied in [MS05] is the non-uniform
model, where the cost of coloring vertex v by a color d is an arbitrary nonnegative
number cost(v, d).
It was shown in [MS05] that finding the recoloring distance in the unweighted
model is NP-hard even for strings (trees with two leaves), and few dynamic
programming algorithms for exact solutions of few variants of the problem were
presented.
In this work we present two polynomial time, constant ratio approximation
algorithms, one for strings and one for trees. Both algorithms are for the weighted
(uniform) model. The algorithm for strings is based on a lower bound technique
which assigns penalties to colored trees. The penalties can be computed in linear

194 Shlomo Moran and Sagi Snir
time, and once a penalty is computed, a recoloring whose cost is smaller than
the penalty is computed in linear time. The 2-approximation follows by showing
that for a string, the penalty is at most twice the cost of an optimal convex
recoloring. This last result does not hold for trees, where a different technique
is used. The algorithm for trees is based on a recursive construction that uses a
variant of the local ratio technique [BY00, BYE85], which allows adjustments of
the underlying tree topology during the recursive process.
The rest of the paper is organized as follows. In the next section we present
the notations and define the models used. In Section 3 we define the notion of
penalty which provides lower bounds on the optimal cost of convex recoloring
of any tree. In Section 4, we present the 2-approximation algorithm for the
string. In Section 5 we briefly explain the local ratio technique, and present
the 3-approximation algorithm for the tree. We conclude and point out future
research directions in Section 6.
2 Preliminaries
A colored tree is a pair (T, C) where T = (V, E) is a tree with vertex set V =
{v1, . . . , vn}, and C is a coloring of T , i.e. - a function from V onto a set of colors
C. For a set U ⊆ V , C|U denotes the restriction of C to the vertices of U, and
C(U) denotes the set {C(u) : u ∈ U}. For a subtree T
= (V (T
), E(T
)) of T ,
C(T
) denotes the set C(V (T
)). A block in a colored tree is a maximal set of
vertices which induces a monochromatic subtree. A d-block is a block of color d.
The number of d-blocks is denoted by nb(C, d), or nb(d) when C is clear from
the context. A coloring C is said to be convex if nb(C, d) = 1 for every color
d ∈ C. The number of d-violations in the coloring C is nb(C, d)−1, and the total
number of violations of C is

c∈C(nb(C, d) − 1). Thus a coloring C is convex
iff the total number of violations of C is zero (in [FBL03] the above sum, taken
over all characters, is used as a measure of the distance of a given phylogenetic
tree from perfect phylogeny).
The definition of convex coloring is extended to partially colored trees, in
which the coloring C assigns colors to some subset of vertices U ⊆ V , which
is denoted by Domain(C). A partial coloring is said to be convex if it can be
extended to a total convex coloring (see [SS03]). Convexity of partial and total
coloring have simple characterization by the concept of carriers: For a subset
U of V , carrier(U) is the minimal subtree that contains U. For a colored tree
(T, C) and a color d ∈ C, carrierT (C, d) (or carrier(C, d) when T is clear) is
the carrier of C−1
(d). We say that C has the disjointness property if for each
pair of colors {d, d
} it holds that carrier(C, d)∩carrier(C, d
) = ∅. It is easy to
see that a total or partial coloring C is convex iff it has the disjointness property
(in [DS92] convexity is actually defined by the disjointness property).
When some (total or partial) input coloring (C, T) is given, any other coloring
C
of T is viewed as a recoloring of the input coloring C. We say that a recoloring
C
of C retains (the color of) a vertex v if C(v) = C
(v), otherwise C
overwrites
v. Specifically, a recoloring C
of C overwrites a vertex v either by changing the

color of v, or just by uncoloring v. We say that C
retains (overwrites) a set of
vertices U if it retains (overwrites resp.) every vertex in U. For a recoloring C
of
an input coloring C, XC(C
) (or just X(C
)) is the set of the vertices overwritten
by C
, i.e.
XC(C
)={v ∈ V : [v ∈ Domain(C)]

[(v /
∈ Domain(C
) ) ∨ (C(v)=C
(v) )]}.
With each recoloring C
of C we associate a cost, denoted as costC(C
) (or
cost(C
) when C is understood), which is the number of vertices overwritten by
C
, i.e. costC(C
) = |XC(C
)|. A coloring C∗
is an optimal convex recoloring of C,
or in short an optimal recoloring of C, and costC(C∗
) is denoted by OPT (T, C),
if C∗
is a convex coloring of T , and costC(C∗
) ≤ costC(C
) for any other convex
coloring C
of T .
The above cost function naturally generalizes to the weighted version: the
input is a triplet (T, C, w), where w : V → R+
∪ {0} is a weight function which
assigns to each vertex v a nonnegative weight w(v). For a set of vertices X,
w(X) =

v∈X w(v). The cost of a convex recoloring C
of C is costC(C
) =
w(X(C
)), and C
is an optimal convex recoloring if it minimizes this cost.
The above unweighted and weighted cost models are uniform, in the sense that
the cost of a recoloring is determined by the set of overwritten vertices, regardless
the specific colors involved. [MS05] defines also a more subtle non uniform model,
which is not studied in this paper.
Let AL be an algorithm which receives as an input a weighted colored tree
(T, C, w) and outputs a convex recoloring of (T, C, w), and let AL(T, C, w) be
the cost of the convex recoloring output by AL. We say that AL is an r-
approximation algorithm for the convex tree recoloring problem if for all inputs
(T, C, w) it holds that AL(T, C, w)/OPT (T, C, w) ≤ r [GJ79, Hoc97, Vaz01].
We complete this section with a definition and a simple observation which
will be useful in the sequel. Let (T, C) be a colored tree. A coloring C∗
is an
expanding recoloring of C if in each block of C∗
at least one vertex v is retained
(i.e., C(v) = C∗
(v)).
Observation 1 let (T = (V, E), C, w) be a weighted colored tree, where w(V )
0. Then there exists an expanding optimal convex recoloring of C.
Proof. Let C
be an optimal recoloring of C which uses a minimum number
of colors (i.e. |C
(V )| is minimized). We shall prove that C
is an expanding
recoloring of C.
Since w(V ) 0, the claim is trivial if C
uses just one color. So assume for
contradiction that C
uses at least two colors, and that for some color d used
by C
, there is no vertex v s.t. C(v) = C
(v) = d. Then there must be an edge
(u, v) such that C
(u) = d but C
(v) = d
= d. Therefore, in the uniform cost
model, the coloring C
which is identical to C
except that all vertices colored d
are now colored by d
is an optimal recoloring of C which uses a smaller number
of colors - a contradiction.
In view of Observation 1 above, we assume in the sequel (sometimes implic-
itly) that the given optimal convex recolorings are expanding.

red blue green
C
⇓
C
Fig. 1. C
is a convex recoloring for C which defines the following penalties:
pgreen(C
) = 1, pred(C
) = 2, pblue(C
) = 3.
3 Lower Bounds via Penalties
In this section we present a general lower bound on the recoloring distance of
weighted colored trees. Although for a general tree this bound can be fairly poor,
in the next section we present an algorithm for convex recoloring of strings, which
always finds a convex recoloring whose cost is at most twice this lower bound,
and hence it is a 2-approximation algorithm for strings.
Let (T, C, w) be a weighted colored tree. For a color d and U ⊆ V (T ) let:
penaltyC,d(U) = w(U ∩ C−1(d)) + w(U ∩ C−1
(d))
Informally, when the vertices in U induce a subtree, penaltyC,d(U) is the total
weight of the vertices which must be overwritten to make U the unique d-block
in the coloring: a vertex v must be overwritten either if v ∈ U and C(v) = d, or
if v /
∈ U and C(v) = d.
penaltyC(C
), the penalty of a convex recoloring C
of C, is the sum of the
penalties of all the colors, with respect to the color blocks of C
:
penaltyC(C
) =
d∈C
penaltyC,d(C

−1
(d))
Figure 1 depicts the calculation of a penalty associated with a convex recoloring
C
of C.
In the sequel we assume that the input colored tree (T, C) is fixed, and omit
it from the notations.
Claim. penalty(C
) = 2cost(C
)
Proof. From the definitions we have
penalty(C
) =
d∈C
w ({v ∈ V : C
(v) = d and C(v) = d}∪
{v ∈ V : C
(v) = d and C(v) = d})
= 2w({v ∈ V : C
(v) = C(v)}) = 2cost(C
)

Fig. 2. A convex recoloring must overwrite at least one of the large lateral blocks (a
triangle or a rectangle).
As can be seen in Figure 1, penalty(C
) = 6 while cost(C
) = 3.
For each color d, p∗
d is the penalty of a block which minimizes the penalty
for d:
p∗
d = min{penaltyd(V (T
)) : T
is a subtree of T }
Corollary 1. For any recoloring C
of C,
d∈C
p∗
d ≤
d∈C
penaltyd(C
) = 2cost(C
).
Proof. The inequality follows from the definition of p∗
d, and the equality from
Claim 3.
Corollary 1 above provides a lower bound on the cost of convex recoloring
of trees. It can be shown that this lower bound can be quite poor for trees,
that is: OPT (T, C) can be considerably larger than (

d∈C p∗
d)/2. For example,
any convex recoloring of the tree in Figure 2, must overwrite at least one of the
(large) lateral blocks in the tree, while (

d∈C p∗
d)/2 in that tree is the weight of
the (small) central vertex (the circle). However in the next section we show that
this bound can be used to obtain a polynomial time 2-approximation for convex
recoloring of strings.
4 A 2-Approximation Algorithm for Strings
Let a weighted colored string (S, C, w), where S = (v1, . . . , vn), be given. For
1 ≤ i ≤ j ≤ n, S[i, j] is the substring (vi, vi+1, . . . , vj) of S. The algorithm starts
by finding for each d a substring Bd = S[id, jd] for which penaltyd(S[id, jd]) = p∗
d.
It is not hard to verify that Bd consists of a subsequence of consecutive vertices
in which the difference between the total weight of d-vertices and the total weight
of other vertices (i.e. w(Bd ∩ C−1
(d)) − w(Bd C−1
(d))) is maximized, and
thus Bd can be found in linear time. We say that a vertex v is covered by color
d if it belongs to Bd. v is covered if it is covered by some color d, and it is free
otherwise.

⇓
Ĉ
Fig. 3. The upper part of the figure shows the optimal blocks on the string and the
lower part shows the coloring returned by the algorithm.
We describe below a linear time algorithm which, given the blocks Bd, defines
a convex coloring Ĉ so that cost(Ĉ)

d p∗
d, which by Corollary 1 is a 2-
approximation to a minimal convex recoloring of C.
Ĉ is constructed by performing one scan of S from left to right. The scan
consists of at most c stages, where stage j defines the j − th block of Ĉ, to be
denoted Fj, and its color, dj, as follows.
Let d1 be the color of the leftmost covered vertex (note that v1 is either free
or covered by d1). d1 is taken to be the color of the first (leftmost) block of
Ĉ, F1, and Ĉ(v1) is set to d1. For i 1, Ĉ(vi) is determined as follows: Let
Ĉ(vi−1) = dj. Then if vi ∈ Bdj or vi is free, then Ĉ(vi) is also set to dj. Else, vi
must be a covered vertex. Let dj+1 be one of the colors that cover vi. Ĉ(vi) is
set to dj+1 (and vi is the first vertex in Fj+1).
Observation 2 Ĉ is a convex coloring of S.
Proof. Let dj be the color of the j − th block of Ĉ, Fj, as described above. The
convexity of Ĉ follows from the the following invariant, which is easily proved
by induction: For all j ≥ 1, ∪j
k=1Fk ⊇ ∪j
k=1Bdk
. This means that, for all j, no
vertex to the right of Fj is covered by dj, and hence no such vertex is colored
by dj.
Thus it remains to prove
Lemma 1. cost(Ĉ) ≤

d∈C p∗
d.
Proof. cost(Ĉ) =

{w(vi) : C(vi) = Ĉ(vi)}. Thus w(vi) is added to cost(Ĉ)
only if C(vi) = d and Ĉ(vi) = d
for some distinct d
, d. By the algorithm,
Ĉ(vi) = d
only if vi ∈ Bd or vi is free. In the first case w(vi) is accounted for
in p∗
d . In the second case it is accounted for in p∗
d. In both cases w(vi) is added
to sum on the righthand side, which proves the inequality.

5 A 3-Approximation Algorithm for Tree
In this section we present a polynomial time algorithm which approximates the
minimal convex coloring of a weighted tree by factor three. The input is a triplet
(T, C, w), where w is a nonnegative weight function and C is a (possibly partial)
coloring whose domain is the set support(w) = {v ∈ V : w(v) 0}.
We first introduce the notion of covers w.r.t. colored trees. A set of vertices
X is a convex cover (or just a cover) for a colored tree (T, C) if the (partial)
coloring CX = C|[V X] is convex (i.e., C can be transformed to a convex coloring
by overwriting only vertices in X). Thus, if C
is a convex recoloring of (T, C),
then XC(C
), the set of vertices overwritten by C
, is a cover for (T, C), and cost
of C
is w(X(C
)). Moreover, deciding whether a subset X ⊆ V is a cover for
(T, C), and constructing a total convex recoloring C
of C such that X(C
) ⊆ X
in case it is, can be done in O(n · nc) time. Therefore, finding an optimal convex
total recoloring of C is polynomially equivalent to finding an optimal cover X,
or equivalently a partial convex recoloring C
of C so that w(X(C
)) = w(X) is
minimized.
Our approximation algorithm makes use of the local ratio technique, which
is useful for approximating optimization covering problems such as vertex cover,
dominating set, minimum spanning tree, feedback vertex set and more
[BYE85, BBF99, BY00]. We hereafter describe it briefly:
The input to the problem is a triplet (V, Σ ⊆ 2V
, w : V → R+
), and the goal is to
find a subset X ∈ Σ such that w(X) is minimized, i.e. w(X) = OPT (V, Σ, w) =
min
Y ∈Σ
w(Y ) (in our context V is the set of vertices, and Σ is the set of covers).
The local ratio principle is based on the following observation (see e.g. [BY00]):
Observation 3 For every two weight functions w1, w2:
OPT (V, Σ, w1) + OPT (V, Σ, w2) ≤ OPT (V, Σ, w1 + w2)
Now, given our initial weight function w, we select w1, w2 s.t. w1 + w2 = w
and |supprt(w1)| |support(w)|. We first apply the algorithm to find an r-
approximation to (V, Σ, w1) (in particular, if V support(w1) is a cover, then it
is an optimal cover to (V, Σ, w1)). Let X be the solution returned for (V, Σ, w1),
and assume that w1(X) ≤ r · OPT (V, Σ, w1). If we could also guarantee that
w2(X) ≤ r · OPT (V, Σ, w2) then by Observation 3 we are guaranteed that X is
also an r-approximation for (V, Σ, w1 + w2 = w). The original property, intro-
duced in [BYE85], which was used to guarantee that w2(X) ≤ r·OPT (V, Σ, w2)
is that w2 is r-effective, that is: for every X ∈ Σ it holds that w2(X) ≤
r · OPT (V, Σ, w2) (note that if V ∈ Σ, the above is equivalent to requiring
that w2(V ) ≤ r · OPT (V, Σ, w2)).
Theorem 4. [BYE85] Given X ∈ Σ s.t. w1(X) ≤ r · OPT (V, Σ, w1). If w2 is
r-effective, then w(X) = w1(X) + w2(X) ≤ r · OPT (V, Σ, w).
We start by presenting two applications of Theorem 4 to obtain a 3-approx-
imation algorithm for convex recoloring of strings and a 4-approximation algo-
rithm for convex recoloring of trees.

3-string-APPROX:
Given an instance of the convex weighted string problem (S, C, w):
1. If V support(w) is a cover then X ← V support(w). Else:
2. Find 3 vertices x, y, z ∈ support(w) s.t. C(x) = C(z) = C(y) and y lies
between x and z.
(a) ε ← min{w(x), w(y), w(z)}
(b) w2(v) =

ε if v ∈ {x, y, z}
0 otherwise.
(c) w1 ← w − w2
(d) X ← 3-string-APPROX(S, C|support(w1), w1)
Note that a (partial) coloring of a string is not convex iff the condition in 2
holds. It is also easy to see that w2 is 3-effective, since any cover Y must contain
at least one vertex from any triplet described in condition 2, hence w2(Y ) ≥ ε
while w2(V ) = 3ε.
The above algorithm cannot serve for approximating convex tree coloring
since in a tree the condition in 2 might not hold even if V support(w) is not
a cover. In the following algorithm we generalize this condition to one which
must hold in any non-convex coloring of a tree, in the price of increasing the
approximation ratio from 3 to 4.
4-tree-APPROX:
Given an instance of the convex weighted tree problem (T, C, w):
2. Find two pairs of (not necessarily distinct) vertices (x1, x2) and
(y1, y2) in support(w) s.t. C(x1) = C(x2) = C(y1) = C(y2), and
carrier({x1, x2}) ∩ carrier({y1, y2}) = ∅:
(a) ε ← min{w(xi), w(yi)}, i = {1, 2}
(b) w2(v) =

ε if v ∈ {x1, x2, y1, y2}
0 otherwise.
(c) w1 ← w − w2
(d) X ← 4-tree-APPROX(S, C|support(w1), w1)
The algorithm is correct since if there are no two pairs as described in step 2,
then V support(w) is a cover. Also, it is easy to see that w2 is 4-effective. Hence
the above algorithm returns a cover with weight at most 4 · OPT (T, C, w).
We now describe algorithm 3-tree-APPROX. Informally, the algorithm uses
an iterative method, in the spirit of the local ratio technique, which approxi-
mates the solution of the input (T, C, w) by reducing it to (T
, C
, w1) where
|support(w1)| |support(w)|. Depending on the given input, this reduction is
either of the local ratio type (via an appropriate 3-effective weight function)
or, the input graph is replaced by a smaller one which preserves the optimal
solutions.

v
Fig. 4. Case 2: a vertex v is contained in 3 diﬀerent carriers.
3-tree-APPROX(T, C, w)
On input (T, C, w) of a weighted colored tree, do the following:
2. (T
, C
, w1) ← REDUCE(T, C, w). The function REDUCE guaran-
tees that |support(w1)| |support(w)|
(a) X
← 3-tree-APPROX(T
, C
, w1).
(b) X ← UPDAT E(X
, T ). The function UPDATE guarantees that if
X
is a 3-approximation to (T
, C
, w1), then X is a 3-approximation
to (T, C, w).
Next we describe the functions REDUCE and UPDATE, by considering
few cases. In the ﬁrst two cases we employ the local ratio technique.
Case 1: support(w) contains three vertices x, y, z such that y lies on the path
from x to z and C(x) = C(z) = C(y).
In this case we use the same reduction of 3-string-APPROX:
Let
ε = min{w(x), w(y), w(z)} 0.
Then
REDUCE(T, C, w) = (T, C|support(w1), w1),
where w1(v) = w(v) if v /
∈ {x, y, z}, else w1(v) = w(v) − ε. The same argu-
ments which imply the correctness of 3-string-APPROX imply that if X
is a
3-approximation for (T
, C
, w1), then it is also a 3-approximation for (T, C, w),
thus we set
UPDAT E(X
, T ) = X
.
Case 2: Not Case 1, and T contains a vertex v such that v ∈ ∩3
i=1carrier(di, C)
for three distinct colors d1, d2 and d3 (see Figure 4).
In this case we must have that w(v) = 0 (else Case 1 would hold), and there
are three designated pairs of vertices {x1, x2}, {y1, y2} and {z1, z2} such that
C(xi) = d1, C(yi) = d2, C(zi) = d3(i = 1, 2), and v lies on each of the three
paths connecting these three pairs (see Figure 4). We set

REDUCE(T, C, w) = (T, C|support(w1), w1),
where w1 is defined as follows.
Let
ε = min{w(xi), w(yi), w(zi) : i = 1, 2}.
Then w1(v) = w(v) if v is not in one of the designated pairs, else w1(v) = w(v)−ε.
Finally, any cover for (T, C) must contain at least two vertices from the set
{xi, yi, zi : i = 1, 2}, hence w − w1 = w2 is 3-effective, and by the local ratio
theorem we can set
UPDAT E(X
, T ) = X
.
Case 3: Not Cases 1 and 2.
Root T at some vertex r and for each color d let rd be the root of the subtree
carrier(d, C). Let d0 be a color for which the root rd0 is farthest from r. Let
T̄ be the subtree of T rooted at rd0 , and let T̂ = T T̄ (see Figure 5). By the
definition of rd0 , no vertex in T̂ is colored by d0, and since Case 2 does not hold,
there is a color d
so that {d0} ⊆ C(V (T̄)) ⊆ {d0, d
}.
rdo
T̂
T̄
d
d0
Fig. 5. Case 3: Not case 1 nor 2. T̄ is the subtree rooted at rd0 and T̂ = T T̄.
Subcase 3a: C(V (T̄)) = {d0} (see Figure 6).
In this case, carrier(d0, C) ∩ carrier(d, C) = ∅ for each color d = d0, and for
each optimal solution X it holds that X ∩ V (T̄) = ∅. We set
REDUCE(T, C, w) ← (T̂, C|V (T̂ ), w|V (T̂ )).

?
rdo
T̄
d0
s
Fig. 6. Case 3a: No vertices of T̂ are colored by d
.
? s
rdo
T̄
d0
d
Fig. 7. Case 3b: rd0 ∈ Td0 ∩ carrier(d
).
The 3-approximation X
to (T
, C
, w1) is also a 3-approximation to (X, C, w),
thus
UPDATE(X
, T ) = X
.
We are left with the last case, depicted in Figure 7.
Subcase 3b: C(V (T̄ )) = {d0, d
}.
In this case we have that rd0 ∈ carrier(d0, C) ∩ carrier(d
, C) and w(rd0 ) = 0
(since Case 1 does not hold).
Informally, REDUCE(T, C, w) modiﬁes the tree T by replacing the subtree
T̄ by a smaller subtree ¯
T0, which contains only two vertices, and which encodes
three possible recolorings of T̄, one of which must be used in an optimal recoloring
of T . The tree T
, resulted from replacing T̄ by ¯
T0 in the tree T , has smaller
support than T , since |support(w) ∩ V (T̄ )| ≥ 3. This last inequality holds since,
by the fact that rd0 lies between two vertices colored d0 and between two vertices
colored d
, V (T̄) must contain at least two vertices colored d0 and at least one
vertex colored d
.
Observation 5 There is an optimal convex coloring C
which satisﬁes the fol-
lowing: C
(v) = d0 for any v ∈ V (T̂), and C
(v) ∈ {d0, d
} for any v ∈ V (T̄).

Proof. Let Ĉ be an expanding optimal convex recoloring of (T, C). We will show
that there is an optimal coloring C
satisfying the lemma such that cost(C
) ≤
cost(Ĉ). Since Ĉ is expanding and optimal, at least one vertex in T̄ is colored
either by d0 or by d
. Let U be a set of vertices in T̄ so that carrier(U) is a
maximal subtree all of whose vertices are colored by colors not in {d0, d
}. Then
carrier(U) must have a neighbor u in T̄ s.t. Ĉ(u) ∈ {d0, d
}. Changing the colors
of the vertices in U to Ĉ(u) does not increase the cost of the recoloring. This
procedure can be repeated until all the vertices of T̄ are colored by d0 or by d
.
A similar procedure can be used to change the color of all the vertices in T̂ to
be different from d0. It is easy to see that the resulting coloring C
is convex and
cost(C
) ≤ cost(Ĉ).
The function REDUCE in Subcase 3b is based on the following observation:
Let C
be any optimal recoloring of T satisfying Observation 5, and let s be the
parent of rd0 in T . Then C
|V (T̄ ), the restriction of the coloring C
to the vertices
of T̄, depends only on whether carrier(d
, C
) intersects V (T̂), and in this case
if it contains the vertex s. Specifically, C
V (T̄ )
must be one of the three colorings
of V (T̄), Chigh, Cmedium and Cmin, according to the following three scenarios:
1. carrier(d
, C
) ∩ V (T̂) = ∅ and s /
∈ carrier(d
, C
). Then it must be the case
that C
colors all the vertices in V (T̄) by d0. This coloring of T̄ is denoted
as Chigh.
2. carrier(d
, C
)∩V (T̂ ) = ∅ and s ∈ carrier(d
, C
). Then C
|T̄ is a coloring of
minimal possible cost of T̄ which either equals Chigh (i.e. colors all vertices
by d0), or otherwise colors rd0 by d
. This coloring of T̄ is called Cmedium.
3. carrier(d
, C
)∩V (T̂) = ∅. Then C
|T̄ must be an optimal convex recoloring
of T̄ by the two colors d0, d
. This coloring of T̄ is called Cmin.
We will show soon that the colorings Chigh, Cmedium and Cmin above can be
computed in linear time. The function REDUCE in Subcase 3b modifies the
tree T by replacing T̄ by a subtree ¯
T0 with only 2 vertices, rd0 and v0, which
encodes the three colorings Chigh, Cmedium, Cmin. Specifically,
REDUCE(T, C, w) = (T
, C
, w1)
where (see Figure 8):
– T
is obtained from T by replacing the subtree T̄ by the subtree ¯
T0 which
contains two vertices: a root rd0 with a single descendant v0.
– w1(v) = w(v) for each v ∈ V (T̂). For rd0 and v0, w1 is defined as fol-
lows: w1(rd0 ) = cost(Cmedium) − cost(Cmin) and w1(v0) = cost(Chigh) −
cost(Cmin).
– C
(v) = C(v) for each v ∈ V (T̂ ); if w1(rd0 ) 0 then C
(rd0 ) = d0 and if
w1(v0) 0 then C
(v0) = d
. (If w1(u) = 0 for u ∈ {rd0 , v0}, then C
(u) is
undefined).
Figure 8 illustrates REDUCE for case 3b. In the figure, Chigh requires over-
writing all d
vertices and therefore costs 3, Cmedium requires overwriting one

?
2
2 1
2
2
1
REDUCE
d0
d
T̂
vo
¯
T0 rdo
T̄ rdo
⇒
T̂
Fig. 8. REDUCE of case 3b: T̄ is replaced with ¯
T0 where w1(rd0 ) = Cmedium −Cmin =
1 and w1(v0) = Chigh − Cmin = 2.
d0 vertex and costs 2 and Cmin is the optimal coloring for T̄ with cost 1. The
new subtree ¯
T0 reflects these weight with w1(rd0 ) = Cmedium − Cmin = 1 and
w1(v0) = Chigh − Cmin = 2.
Claim. OPT (T
, C
, w1) = OPT (T, C, w) − cost(Cmin).
Proof. We first show that OPT (T
, C
, w1) ≤ OPT (T, C, w) − cost(Cmin). Let
C∗
be an optimal recoloring of C satisfying Observation 5, and let X∗
= X(C∗
).
By the discussion above, we may assume that C∗
|V (T̄ ) has one of the forms
Chigh, Cmedium or Cmin. Thus, X∗
∩ V (T̄ ) is either X(Chigh), X(Cmedium) or
X(Cmin). We map C∗
to a coloring C
of T
as follows: for v ∈ V (T̂), C
(v) =
C∗
(v). C
on rd0 and v0 is defined as follows:
– If C∗
|V (T̄ ) = Chigh then C
(rd0 ) = C
(v0) = d0, and cost(C
|V (T̄ ) = w1(v0);
– If C∗
|V (T̄ ) = Cmedium then C
(rd0 ) = C
(v0) = d
, and cost(C
|V (T̄ ) =
w1(rd0 );
– If C∗
|V (T̄ ) = Cmin then C
(rd0 ) = d0, C
(v0) = d
, and cost(C
|V (T̄ ) = 0.
Note that in all three cases, cost(C
) = cost(C∗
) − cost(Cmin).
The proof of the opposite inequality
OPT (T, C, w) − cost(Cmin) ≤ OPT (T
, C
, w1)
is similar.
Corollary 2. C∗
is optimal recoloring of (T, C, w) iff C
is an optimal recoloring
of (T
, C
, w1).

We now can define the UPDAT E function for Subcase 3b: Let X
=3−tree−
APPROX(T
, C
, w1). Then X
is a disjoint union of the sets X̂ = X
∩
V (T̂) and X̄
0 = X
∩ V ( ¯
T0). Moreover, X̄
0 ∈ {{rd0 }, {v0}, ∅}. Then X ←
UPDATE(X
) = X̂ ∪ X̄, where X̄ is X(Chigh) if X̄
0 = {rd0 }, is X(Cmedium)
if X̄
0 = {v0}, and is X(Cmin) if X̄
0 = ∅. Note that w(X) = w(X
) +
cost(Cmin). The following inequalities show that if w1(X
) is a 3-approximation
to OPT (T
, C
, w1), then w(X) is a 3-approximation to OPT (T, C, w):
w(X) = w1(X
) + cost(Cmin) ≤ 3OPT (T
, C
, w1) + cost(Cmin)
3(OPT (T
, C
, w1) + cost(Cmin)) = 3OPT (T, C, w)
5.1 A Linear Time Algorithm for Subcase 3b
In Subcase 3b we need to compute Chigh, Cmedium and Cmin. The computation
of Chigh is immediate. Cmedium and Cmin can be computed by the following
simple, linear time algorithm that finds a minimal cost convex recoloring of a
bi-colored tree, under the constraint that the color of a given vertex r is prede-
termined to one of the two colors.
Let the weighted colored tree (T, C, w) and the vertex r be given, and let
{d1, d2} = C(T ). For i ∈ {1, 2}, let Ci the minimal cost convex recoloring which
sets the color of r to di (note that a coloring with minimum cost in {C1, C2}
is an optimal convex recoloring of (T, C)). We illustrate the computation of C1
(the computation of C2 is similar):
Compute for every edge e = (u → v) a cost defined by
cost(e) = w({v
: v
∈ T (v) and C(v
) = d1}) +
w({v
: v
∈ [T T (v)] and C(v
) = d2})
where T (v) is the subtree rooted at v. This can be done by one post order
traversal of the tree. Then, select the edge e∗
= (u0 → v0) which minimizes this
cost, and set C1(w) = d2 for each w ∈ T (v0), and C1(w) = d1 otherwise.
5.2 Correctness and Complexity
We now summarize the discussion of the previous section to show that the algo-
rithm terminates and return a cover X which is a 3-approximation for (T, C, w).
Let (T = (V, E), C, w) be an input to 3-tree-APPROX. if V support(w)
is a cover then the returned solution is optimal. Else, in each of the cases,
REDUCE(T, C, w) reduces the input to (T
, C
, w1) such that |support(w1)|
|support(w)|, hence the algorithm terminates within at most n = |V | iterations.
Also, as detailed in the previous subsections, the function UPDATE guarantees
that if X
is a 3-approximation for (T
, C
, w1) then X is a 3-approximation to
(T, C, w). Thus after at most n iterations the algorithm provides a 3-approx-
imation to the original input.
Checking whether Case 1, Case 2, Subcase 3a or Subcase 3b holds at each
stage requires O(cn) time for each of the cases, and computing the function

REDUCE after the relevant case is identiﬁed requires linear time in all cases.
Since there are at most n iterations, the overall complexity is O(cn2
). Thus we
have
Theorem 6. Algorithm 3-tree-APPROX is a polynomial time 3-approximation
algorithm for the minimum convex recoloring problem.
6 Discussion and Future Work
In this work we showed two approximation algorithms for colored strings and
trees, respectively. The 2-approximation algorithm relies on the technique of pe-
nalizing a colored string and the 3-approximation algorithm for the tree extends
the local ratio technique by allowing dynamic changes in the underlying graph.
Few interesting research directions which suggest themselves are:
– Can our approximation ratios for strings or trees be improved.
– This is a more focused variant of the previous item. A problem has a polyno-
mial approximation scheme [GJ79, Hoc97], or is fully approximable [PM81],
if for each ε it can be ε-approximated in pε(n) time for some polynomial
pε. Are the problems of optimal convex recoloring of trees or strings fully
approximable, (or equivalently have a polynomial approximation scheme)?
– Alternatively, can any of the variant be shown to be APX-hard [Vaz01]?
– The algorithms presented here apply only to uniform models. The non uni-
form model, motivated by weighted maximum parsimony [San75], assumes
that the cost of assigning color d to vertex v is given by an arbitrary nonneg-
ative number cost(v, d) (note that, formally, no initial coloring C is assumed
in this cost model). In this model cost(C
) is deﬁned only for a total recol-
oring C
, and is given by the sum

v∈V cost(v, C
(v)). Finding non-trivial
approximation results for this model is challenging.
Acknowledgments
We would like to thank Reuven Bar Yehuda, Arie Freund and Dror Rawitz for
very helpful discussions.
References
[AFB96] R. Agrawala and D. Fernandez-Baca. Simple algorithms for perfect phy-
logeny and triangulating colored graphs. International Journal of Founda-
tions of Computer Science, 7(1):11–21, 1996.
[BBF99] V. Bafna, P. Berman, and T. Fujito. A 2-approximation algorithm for the
undirected feedback vertex set problem. SIAM J. on Discrete Mathematics,
12:289–297, 1999.
[BDFY01] A. Ben-Dor, N. Friedman, and Z. Yakhini. Class discovery in gene expres-
sion data. In RECOMB, pages 31–38, 2001.

[Be00] M. Bittner and et.al. Molecular classification of cutaneous malignant
melanoma by gene expression profiling. Nature, 406(6795):536–40, 2000.
[BFW92] H.L. Bodlaender, M.R. Fellows, and T. Warnow. Two strikes against perfect
phylogeny. In ICALP, pages 273–283, 1992.
[BY00] R. Bar-Yehuda. One for the price of two: A unified approach for approxi-
mating covering problems. Algorithmica, 27:131–144, 2000.
[BYE85] R. Bar-Yehuda and S. Even. A local-ratio theorem for approximating the
weighted vertex cover problem. Annals of Discrete Mathematics, 25:27–46,
1985.
[DS92] A. Dress and M.A. Steel. Convex tree realizations of partitions. Applied
Mathematics Letters,, 5(3):3–6, 1992.
[FBL03] D. Fernández-Baca and J. Lagergren. A polynomial-time algorithm for
near-perfect phylogeny. SIAM Journal on Computing, 32(5):1115–1127,
2003.
[Fit81] W. M. Fitch. A non-sequential method for constructing trees and hierar-
chical classifications. Journal of Molecular Evolution, 18(1):30–37, 1981.
[GGP+
96] L.A. Goldberg, P.W. Goldberg, C.A. Phillips, Z Sweedyk, and T. Warnow.
Minimizing phylogenetic number to find good evolutionary trees. Discrete
Applied Mathematics, 71:111–136, 1996.
[GJ79] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to
the Theory of NP-Completeness. W.H. Freeman and Company, 1979.
[Gus91] D. Gusfield. Efficient algorithms for inferring evolutionary history. Net-
works, 21:19–28, 1991.
[Hoc97] D. S. Hochbaum, editor. Approximation Algorithms for NP-Hard Problem.
PWS Publishing Company, 1997.
[HTD+
04] A. Hirsh, A. Tsolaki, K. DeRiemer, M. Feldman, and P. Small. From the
cover: Stable association between strains of mycobacterium tuberculosis
and their human host populations. PNAS, 101:4871–4876, 2004.
[KW94] S. Kannan and T. Warnow. Inferring evolutionary history from DNA se-
quences. SIAM J. Computing, 23(3):713–737, 1994.
[KW97] S. Kannan and T. Warnow. A fast algorithm for the computation and enu-
meration of perfect phylogenies when the number of character states is
fixed. SIAM J. Computing, 26(6):1749–1763, 1997.
[MS03] S. Moran and S. Snir. Convex recoloring of strings and trees. Technical
Report CS-2003-13, Technion, November 2003.
[MS05] S. Moran and S. Snir. Convex recoloring of strings and trees: Definitions,
hardness results and algorithms. In WADS, 2005.
[PM81] A. Paz and S. Moran. Non deterministic polynomial optimization probems
and their approximabilty. Theoretical Computer Science, 15:251–277, 1981.
Abridged version: Proc. of the 4th ICALP conference, 1977.
[San75] D. Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied
Mathematics, 28:35–42, 1975.
[SS03] C. Semple and M.A. Steel. Phylogenetics. Oxford University Press, 2003.
[Ste92] M. Steel. The complexity of reconstructing trees from qualitative characters
and subtrees. Journal of Classification, 9(1):91–116, 1992.
[Vaz01] V. Vazirani. Approximation Algorithms. Springer, Berlin, germany, 2001.

Approximation Algorithms
for Requirement Cut on Graphs
Viswanath Nagarajan
and Ramamoorthi Ravi
Tepper School of Business, Carnegie Mellon University, Pittsburgh PA 15213
{viswa,ravi}@cmu.edu
Abstract. In this paper, we unify several graph partitioning problems
including multicut, multiway cut, and k-cut, into a single problem. The
input to a requirement cut problem is an undirected edge-weighted graph
G = (V, E), and g groups of vertices X1, · · · , Xg ⊆ V , each with a re-
quirement ri between 0 and |Xi|. The goal is to find a minimum cost set
of edges whose removal separates each group Xi into at least ri discon-
nected components.
We give an O(log n log(gR)) approximation algorithm for the require-
ment cut problem, where n is the total number of vertices, g is the
number of groups, and R is the maximum requirement. We also show
that the integrality gap of a natural LP relaxation for this problem is
bounded by O(log n log(gR)). On trees, we obtain an improved guaran-
tee of O(log(gR)). There is a natural Ω(log g) hardness of approximation
for the requirement cut problem.
1 Introduction
Graph partitioning problems form a fundamental area of the study of approx-
imation algorithms. The simplest graph partitioning problem is probably the
well known s-t minimum cut problem [1]. Here we have an edge weighted graph
containing two specified terminals s and t, and the goal is to find a minimum
weight set of edges to disconnect s and t. The classical result of Ford and Fulk-
erson [1] proved a max-flow min-cut duality which related the maximum flow
and minimum cut problems.
Multicuts: In the seminal work of Leighton and Rao [2], the authors gave an
approximate max-flow min-cut theorem for a generalization of s − t maximum
flow to multicommodity flow. The flow problem considered here is concurrent
multicommodity flow, and the corresponding cut is the sparsest cut. Here we have
an undirected graph G with edge capacities. There are several commodities, each
having a source, sink, and a specified demand. A maximum concurrent flow is
one that maximizes the fraction of all demands that can be feasibly routed.
A sparsest cut is one that minimizes the ratio of edge capacity of the cut to

Supported by NSF ITR grant CCR-0122581 (The ALADDIN project)

Supported in part by NSF grants CCR-0105548, CCF-0430751 and ITR grant CCR-
0122581 (The ALADDIN project).
c

210 Viswanath Nagarajan and Ramamoorthi Ravi
the demand separated by the cut. The original paper [2] considered uniform
instances where there is a commodity for every pair of vertices, and one unit of
demand for each commodity. The results in [2] showed a max-flow min-cut ratio
of O(log n) where n is the number of vertices in the graph. This was extended
to the case of arbitrary demands in a series of papers [3–7]. The best max-
flow min-cut ratio is Θ(log k) (k is the number of commodities). Garg, Vazirani,
and Yannakakis [4] also considered the related maximum multicommodity flow.
Here the objective is to maximize the total flow routed over all commodities.
The corresponding cut problem is multicut - given a set of source-sink pairs
{(s1, t1), · · · , (sk, tk)}, a multicut is a set of edges whose removal separates each
si from ti. In [4], the authors proved that the ratio of the minimum multicut
to the maximum multicommodity flow is O(log k), and they gave an O(log k)
approximation algorithm for the minimum multicut problem.
Multiway Cut: In this problem, we have a set X of terminals, and the goal is
to remove a minimum cost set of edges so that no two terminals are in the same
connected component. Karger et al. [8] gave an approximation algorithm with
guarantee 1.3438.
Multi-multiway Cut: Recently, Avidor and Langberg [9] extended multiway
cut and multicut to a multi-multiway cut problem. Here we are given g sets of
vertices X1, · · · , Xg, and we wish to find a minimum cost set of edges whose
removal completely disconnects each of the sets X1, · · · , Xg. For this problem,
the authors [9] presented an O(log g) approximation algorithm.
Steiner Multicut: Another interesting graph partitioning problem is the Steiner
multicut problem [10]. In this problem, we are given g sets of vertices X1, · · · , Xg.
The goal is to find a minimum cost set of edges that separates each of X1, · · · , Xg.
We say that a set of vertices S is separated, if S is not contained in a single
connected component. Klein et al. [10] presented an O(log3
gt) approximation
algorithm, where t = maxg
i=1 |Xi| is the maximum size of a set of vertices.
k-Cuts: Another well studied graph partitioning problem is the k-cut problem
[11]. Given an undirected graph with edge costs, the goal is to find a minimum
cost set of edges that separates the graph into at least k connected components.
Saran and Vazirani [11] gave the first approximation algorithm for this problem
which achieves a guarantee of 2. More recently, Chekuri et al. [12] considered
the Steiner k-cut problem. This is a generalization of the k-cut problem, where
in addition to the graph, a subset X of vertices is specified as terminals. The
objective is to find a minimum cost set of edges whose removal results in at least k
disconnected components, each containing a terminal. Chekuri et al. [12] showed
that the greedy algorithm of [11] can be modified to get a 2-approximation for
this problem. They also showed how to round a natural LP relaxation to achieve
the same bound.
Our goal in this work is to unify all the graph partitioning problems men-
tioned above into a single problem that contains all these as subproblems (See
Figure 1). The input to a requirement cut problem is an undirected graph
G = (V, E) with non-negative costs c(e) on its edges. There are g groups of

Approximation Algorithms for Requirement Cut on Graphs 211
k-cut
Steiner k-cut
Multiway
cut
Multicut
Multi-multiway cut
Requirement cut
Steiner
multicut
Fig. 1. Containment of cut problems
vertices X1, X2, · · · Xg ⊆ V ; for each group i we are given a requirement ri be-
tween 0 and |Xi|. The aim is to find a minimum cost set of edges whose removal
separates each group i into at least ri disconnected components (each of which
contains at least one member from the group). Below, we summarize how the
requirement cut problem generalizes many of the graph partitioning problems.
Multicut : |Xi| = ri = 2 for all i = 1, · · · , g.
Steiner k-cut : g = 1, X1 = X the set of terminals, and r1 = k.
Multi-multiway cut : ri = |Xi| for all i.
Steiner multicut : ri = 2 for all i.
1.1 Our Results
We obtain an O(log n log(gR)) approximation algorithm for the requirement cut
problem on general graphs, where n is the number of vertices in the graph, g
is the number of groups and R is the maximum requirement of any group. We
present two algorithms achieving this guarantee. The first (and more interest-
ing) algorithm is via rounding a suitable LP relaxation. We also show that when
restricted to trees, the approximation ratio can be improved to O(log(gR)). On
the other hand, we know that this problem is at least Ω(log g) hard to approx-
imate, even on a star (Section 2.2). The LP rounding procedure is described in
Section 2. The second algorithm is based on the greedy heuristic for set cover.
Here we compute a minimum cost effective cut (defined later) at each stage and
add it to the solution. This greedy step is an interesting problem in itself, and
has been studied in Klein et al. [10] as the Steiner ratio cut problem. We provide
an improved approximation algorithm for this problem, and show that it yields
an O(log n log(gR)) approximation algorithm for the requirement cut problem.
The greedy algorithm is presented in Section 3.
The LP rounding algorithm on trees generalizes the randomized rounding
for set cover, whereas the second method extends the greedy algorithm. While
the first algorithm relies on the tree structure for rounding and hence incurs

a log-squared overhead, the second method leaves some hope for improvement.
The running time of the first algorithm is better, as it involves solving only one
linear program.
As noted in the introduction, the Steiner multicut problem of Klein et al.
is a special case of the requirement cut problem; so our algorithm implies an
approximation of O(log n log g) for Steiner multicut. However, we note that using
a similar idea even the algorithm in [10] achieves this improved guarantee.
2 LP Based Algorithm for Requirement Cut
In this section, we present an O(log n · log(gR)) approximation algorithm for
requirement cut via LP rounding. We first formulate requirement cut as an
integer program, and obtain its linear relaxation. Then we consider the case
when the input graph is a tree, and show that randomized rounding gives an
O(log(gR)) approximation. Finally we show how requirement cut on a graph can
be reduced to requirement cut on a tree, to obtain an approximation algorithm
for the general case.
2.1 IP Formulation and a Linear Relaxation
Here we look at an integer program for the requirement cut problem, and a linear
relaxation for it. This formulation is a generalization of that used in Chekuri et
al. [12] for the Steiner k-cut problem. We also prove a property of the optimal
LP solution that is useful in the rounding process. We may assume that the
input graph G = (V, E) is complete, by adding edges of zero cost. Our IP has a
0-1 variable de for each edge e ∈ E, which represents whether or not this edge
is cut.
min

e∈E cede
s.t.
(Steiner − IP)

e∈Ti
de ≥ ri − 1 ∀Ti : Steiner tree on Xi, ∀i = 1 · · · g
de ∈ {0, 1} ∀e ∈ E
Note that this IP is an exact formulation of the requirement cut problem. Unfor-
tunately, the LP relaxation of this IP does not have a polynomial time separation
oracle, and we don’t know how to solve it efficiently. So we consider a relaxation
of the Steiner tree constraints, by requiring that all spanning trees on the induced
graph G[Xi] have length at least ri −1. Since the minimum spanning tree can be
efficiently computed, this LP has a polynomial time separation oracle and can
be solved in polynomial time using the ellipsoid algorithm.
min

e∈E cede
s.t.
(LP)

e∈Ti
de ≥ ri − 1 ∀Ti : spanning tree in G[Xi], ∀i = 1 · · · g
d : metric
de ∈ [0, 1] ∀e ∈ E

Let d∗
be an optimal solution to this LP, and OPTf its cost. Clearly the cost
of the optimum requirement cut OPT ≥ OPTf . Let d be defined by de =
min{2 · d∗
e, 1}. It is easy to check that d is a metric, and all edge lengths are in
[0, 1]. The following claim is easy to see using the MST heuristic for Steiner tree.
Claim 1 For any group Xi (i = 1, · · · g), the minimum Steiner tree on Xi w.r.t.
d has length at least ri − 1.
2.2 LP Rounding for Tree Requirement Cut
We consider here the special case that the input graph is a tree T = (V, E).
We show how to round the above mentioned LP relaxation within a factor of
O(log gR) to obtain an approximation algorithm for requirement cut on trees.
We note that even this restriction is at least as hard as set-cover. Consider a star
with edges corresponding to sets in the set-cover instance. For each element j, we
create a group that contains the root, and all the leaves of edges that correspond
to sets containing j. Further, we set all requirements to 2, and all edge costs equal
to 1. Clearly any feasible solution in this requirement cut instance is equivalent
to a solution in the set cover instance. This shows that requirement cut on trees
is at least Ω(log g) hard to approximate [13].
The algorithm begins by solving (LP) optimally and obtaining the solution d
defined in the previous section. We now describe how d is rounded to an integral
solution. Our algorithm proceeds in phases, and augments the (partial) solution
in each phase. Let Ck
⊆ E denote the partial solution at the start of the phase
k (C1
= φ). Fk
= T Ck
denotes the forest in round k, with the edges in Ck
removed. The rounding in phase k flips a coin with probability of success de for
each edge e ∈ Fk
, independently, and adds all edges that were successes, to the
partial solution to get Ck+1
. It is clear that the expected cost in this phase is
exactly

e∈F cede ≤ 2

e∈E ced∗
e = 2 · OPTf .
We define the residual requirement rk
i of a group Xi at the start of phase
k to be the difference between ri (requirement of group i) and the number of
connected components containing Xi in Fk
. The rounding procedure ends when
the residual requirement of each group is 0. We will bound the expected number
of phases by O(log gR) which gives us the desired approximation guarantee. For
this, we show that in each phase, the total residual requirement (summed over
all groups) goes down by a constant factor in expectation. This technique was
used in the paper of Konjevod et al [14] on the covering Steiner problem.
Rounding in a Single Phase. The analysis here is for a single phase, and we
drop the superscript k for ease of notation. For a group i, define Fi to be the
sub-forest induced by Xi in F. Let Hi be the forest obtained from Fi by short
cutting over all degree 2 Steiner (non Xi) vertices. So all Steiner vertices in Hi
have degree at least 3. Such a forest is useful because of the following lemma,
which is used in bounding the residual requirement at the end of this phase. The
proof of this lemma is a simple counting argument, and we omit it.

Lemma 1 In any Steiner forest with each non terminal having degree at least
3, removal of any m edges results in at least m+1
2 more components containing
terminals.
Let r
i denote the residual requirement of group i at the start of the current
phase. For a subgraph F
of F, by length d(F
) of F
we mean the total length
of edges in F
, d(F
) =

e∈F de. Let pu,v denote the probability that vertices
u and v are disconnected in this phase of rounding. Then we have the following
lemma.
Lemma 2 The total probability weight on Hi,

e∈Hi
pe, is at least (1− 1
e )d(Hi).
Proof: For those edges e of Hi which are also edges in F, it is clear that pe = de.
Now consider an edge (u, v) ∈ Hi that is obtained by short cutting a path P in Fi.
We claim that pu,v ≥ (1−1
e )du,v. Since each edge is rounded independently, and u
and v are separated if any of the edges in P is removed, pu,v = 1−Πe∈P (1−de) ≥
1 − e−d(P )
. We consider the following 2 cases:
– d(P) ≤ 1. Note that 1 − e−y
≥ (1 − 1/e)y for y ∈ [0, 1]. So pu,v ≥ (1 −
1/e)d(P) ≥ (1 − 1/e)du,v, since d is a metric.
– d(P) ≥ 1. In this case pu,v ≥ 1 − 1/e ≥ (1 − 1/e)du,v, since du,v ∈ [0, 1].
Thus, summing pu,v over all edges (u, v) ∈ Hi we get the lemma.
Now, consider adding edges to forest Hi to make it a Steiner tree on Xi. If
Xi appears in ci connected components in F, we need to add ci − 1 edges. Since
every edge has d-length at most 1, adding ci − 1 edges increases the length of
Hi by at most ci − 1. But from Claim 1, every Steiner tree on Xi has length
at least ri − 1. So we get d(Hi) ≥ (ri − 1) − (ci − 1) = r
i. Lemma 2 then gives

e∈Hi
pe ≥ (1 − 1/e)r
i ≥
r
i
2 .
Consider a 0/1 random variable Zi
e for each edge e = (u, v) ∈ Hi which is 1
iﬀ the connection between u and v is broken. Note that Pr[Zi
e = 1] = pe. Let
Yi =

e∈Hi
Zi
e which is the number of edges cut in forest Hi. Note that although
Hi is not a subgraph of F, separating vertices in Hi is equivalent to separating
them in F. Now, E[Yi] =

e∈Hi
E[Zi
e] =

e∈Hi
pe ≥
r
i
2 . Let Ni be the increase
in the number of components of group i in this phase. Since Hi is a forest that
satisﬁes the conditions of Lemma 1, we have Ni ≥ Yi/2. Thus E[Ni] ≥
r
i
4 .
Bounding the Number of Phases. Let random variable Rk
i denote the resid-
ual requirement of group i at the start of phase k, and Nk
i the increase in the
number of components of group i in phase k. So, Nk
i = Rk
i −Rk+1
i . Our analysis
so far holds for any round k. So we have
E[Rk+1
i |Rk
i = r
i] = E[Rk
i − Nk
i |Rk
i = r
i]
= r
i − E[Nk
i |Rk
i = r
i]
≤ r
i −
r
i
4
=
3r
i
4

So unconditionally, E[Rk+1
i ] ≤
3E[Rk
i ]
4 . Now let Rk
=
g
i=1 Rk
i be the to-
tal residual requirement at the start of round k. By linearity of expectation,
E[Rk+1
] ≤ 3
4
g
i=1 E[Rk
i ] = 3
4 E[Rk
]. Thus, we have by induction that af-
ter k rounds, E[Rk
] ≤ (3/4)k
E[R0
] ≤ (3/4)k
gR. So choosing k = 4 ln(gR),
E[Rk
] ≤ 1/2, and by Markov inequality, Pr[Rk
≥ 1] ≤ 1/2. Thus, we have
proved the following.
Theorem 1 There is a randomized rounding procedure which computes a solu-
tion to the requirement cut on trees, of cost O(log(gR))·OPTf and succeeds with
probability at least half.
2.3 LP Rounding for General Graphs
In this section we show how the LP relaxation (LP) yields an O(log n log(gR))
approximation algorithm for requirement cut on general graphs. This will also
show that the integrality gap of (LP) is at most O(log n log(gR)). The integrality
gap of the linear program for set cover gives an Ω(log g) integrality gap for LP.
Let LPG denote the LP relaxation of the requirement cut on G = (V (G), E(G)),
and OPTf its optimal cost. Our rounding procedure uses the optimal solution
d∗
of LPG, and embeds it into a distribution of tree metrics. Then we use the
rounding procedure for trees to solve requirement cut on the resulting tree.
We use the embedding results [15] of F,RT, to embed metric (d∗
, G) into
a distribution of dominating trees T , with a logarithmic overhead. In [15], it is
shown that this can be done in polynomial time. Let (κ, T ) ∈R T be a random
instance from this distribution, where T = (V (T ), E(T )) is a Steiner tree on
V (G), and κ is the tree metric obtained by assigning some distances to E(T ).
For an edge e ∈ E(T ), we denote by sepe the set of edges of G that are separated
by e in the tree T , i.e. sepe = {(i, j)|i, j ∈ V (G), e lies on i − j path in T}. We
deﬁne costs on the edges of T as c
e =

(i,j)∈sepe
ci,j, and on non tree edges as
0. Consider a tree requirement cut instance on T with the cost function c
. The
groups X1, · · · , Xg and corresponding requirements r1, · · · , rg are the same as
in the original requirement cut on G. It is easy to see that a feasible (integral)
solution to the requirement cut on T of cost C, corresponds to a feasible integral
solution on G, of cost at most C.
Now consider the linear relaxation LPT of this requirement cut on tree T .
Deﬁne κ
u,v = min{κu,v, 1} for all u, v ∈ V (T ). Note that κ
is a metric in [0, 1].
Since d∗
is a metric with values only in [0, 1], and κ restricted to V (G) dominates
d∗
, so does κ
. So for any spanning tree Ti on a group Xi, its length under κ
,
κ
(Ti) ≥ d∗
(Ti) its length under d∗
. Since d∗
is a feasible solution for LPG, we
get κ
(Ti) ≥ ri − 1. Thus κ
is a feasible solution for LPT on tree T .
The cost of this fractional solution is
C =
u,v∈V (T )
c
u,v · κ
u,v =
e∈E(T )
κ
e
(i,j)∈sepe
ci,j =
i,j∈V (G)
ci,j · κ
i,j
Since κ dominates κ
, we get C ≤

i,j∈V (G) ci,j · κi,j. Now from [15] we have
the following theorem

Theorem 2 ([15]) : For all i, j ∈ V (G), κi,j ≥ d∗
i,j and the expected value
E[κi,j] ≤ ρ · d∗
i,j, where ρ = O(log n).
Using this, and linearity of expectation, we get E[C] ≤ ρ ·

i,j∈V (G) ci,jd∗
i,j =
ρ · OPTf . Now using the rounding for trees (Theorem 1), we get the following
approximation for general graphs.
Theorem 3 There is a randomized rounding procedure which computes a so-
lution to the requirement cut on graphs, of cost O(log n · log(gR)) · OPTf and
succeeds with probability at least half.
Remark: Improved approximation for small sized groups. We note that
there is an alternate algorithm that gives an O(t log g) approximation ratio,
where t is the size of the largest group. This follows from the algorithm for
multi-multiway cut [9]. We first solve the LP relaxation (LP) to get a metric d∗
.
The following argument holds for any group i. The MST on Xi, Ti, has length
at least ri − 1. Since each edge has length at most 1, Ti has at least ri − 1
edges of length ≥ 1
t . Consider removing the ri − 1 longest edges in Ti to get
ri connected components. Pick one vertex from each component to obtain set
Si = {s1, s2, · · · , sri } ⊆ Xi. Since Ti is an MST on Xi, each pairwise distance
in Si is at least 1
t .
We define metric d
= min{t · d∗
, 1}. In metric d
, all pairwise distances in
Si are 1 (i = 1, · · · , g). We define a multi-multiway cut instance on the sets
S1, S2, · · · , Sg. Now d
is a feasible fractional solution to this instance. So using
the region growing procedure described in [9], we get a cut that separates each
group Xi into the required number of pieces and has cost at most O(t · log g) ·
OPTf . This implies an improvement when the largest group size is a constant.
3 Deterministic Greedy Algorithm
In this section we present an alternate greedy algorithm that achieves the same
guarantee of O(log(gR) log n) for the requirement cut problem. As before, we
assume that the graph G is complete. Our algorithm works in phases, and we
maintain a partial solution in every phase of the algorithm. Each phase is a
greedy step and augments this partial solution. The algorithm ends when all
requirements have been satisfied. A partial solution is just a set of edges A ⊆ E.
Recall the definitions of residual graph, and residual requirement, w.r.t. a partial
solution A (Section 2.2). At the start of our algorithm, A = φ and the residual
requirement of group i (i = 1, · · · , g) is ri − 1. We call a group i active if it has
positive residual requirement in the current residual graph.
Consider a residual graph G
= (V, E
). A bond in G
is an edge cut of the
form ∂S = {(i, j) ∈ E
|i ∈ S, j /
∈ S}, where S ⊆ V is a set of vertices. We define
cov(∂S), the coverage of ∂S to be the number of active groups separated by C
in G
. The cost-effectiveness of bond ∂S in G
is eff(∂S) = c(∂S)
cov(∂S) , the ratio of
the cost of the bond to its coverage.
The greedy approach for the requirement cut problem is to remove a bond
of minimum cost effectiveness in each phase. But the problem of finding the

minimum cost effective bond generalizes sparsest cut (See next section), and
we do not know of an exact algorithm. However, we can obtain an approximate
solution for this problem, and work with this as the greedy choice. For a phase k,
let Gk denote the residual graph at the start of this phase, φk the total residual
requirement in Gk, and Mk the minimum cost effective bond in Gk. Below OPT ∗
is the cost of the optimal requirement cut on G.
Lemma 3 eff(Mk) ≤ 2OP T ∗
φk
. Without loss of generality, Mk is contained in
some connected component of Gk.
Proof: Consider the connected components S1, S2, · · · Sl in the graph G af-
ter removing the edges of the optimal solution. It is clear that each edge that
is cut is incident on exactly 2 of these connected components. So we have
l
j=1 c(∂Sj) = 2OPT ∗
. Let S
1, S
2, · · · , S
p be the restrictions of S1, S2, · · · Sl
to the connected components of Gk. Note that the total additional require-
ment satisfied on removing ∂S
j in Gk is just cov(∂S
j), for any j = 1, · · · , p.
Now since the optimal solution must satisfy all the residual requirements in Gk,
we have
p
j=1 cov(∂S
j) ≥ φk. Let c
(∂S) denote the cost of bond S in graph
Gk (this may only be smaller than the cost of bond S in G). It is clear that
p
j=1 c
(∂S
j) =
l
j=1 c
(∂Sj) ≤ 2OPT ∗
. We now have
p
min
j=1
c
(∂S
j)
cov(∂S
j)
≤
p
j=1 c
(∂S
j)
p
j=1 cov(∂S
j)
≤ 2
OPT ∗
φk
So the minimum cost effective bond Mk has eff(Mk) ≤ 2OPT ∗
φk
.
By a similar averaging argument, we may assume that Mk is in some con-
nected component of Gk.
3.1 Approximating the Minimum Cost Effective Bond
In making the greedy choice, we wish to compute a bond of minimum cost
effectiveness. Formally, we are given a connected graph H = (VH , EH) with
costs on edges, and g groups of vertices X1, · · · , Xg ⊆ VH. We want a bond that
minimizes the ratio of the cost of the cut to the number of groups it separates.
We observe that this is the minimum ratio Steiner cut problem that is solved
in Klein et al [10]. In [10] they give an O(log2
gt) approximation algorithm for
this problem, where t is the maximum size of a group. Here, using the results
of Fakcharoenphol et al. [15], we improve the approximation ratio to O(log n).
Note that when all group sizes and requirements are 2, the Steiner ratio problem
reduces to sparsest cut [2].
The following is an LP relaxation for the Steiner ratio problem. The MST
algorithm gives us a polynomial time separation oracle for this LP. We may
assume that the input graph H is complete, by adding edges of zero cost.

min

e∈H cele
s.t.
g
i=1

e∈Ti
le ≥ 1 ∀(T1, · · · , Tg) where
(LPH) Ti : Spanning tree on H[Xi], i = 1, · · · , g
le ∈ [0, 1]
l : metric
To observe that this is indeed a relaxation of the Steiner ratio problem, consider
the 0-1 metric l
corresponding to the optimal solution B (which is a bond) :
l
(u, v) = 0 iff u and v are in the same connected component in H ∂B, and 1
otherwise. Let σ be the number of groups separated by ∂B. Clearly, the sum of
the MSTs in H[Xi] (i = 1, · · · g) is also σ. Now l = 1
σ l
is a feasible solution to
(LPH), and has cost c(∂B)
σ = eff(∂B).
Let l be the optimal solution of (LPH), and OPT
=

e∈H ce · le its cost.
We now describe how to round l to get an approximate integral cut. We use the
FRT-embedding to embed metric (l, H) to a distribution T of dominating tree
metrics. Let (T, κ) ∈R T be a random sample from T . Then restating Theorem
2, we have the following : For all i, j ∈ VH , κi,j ≥ li,j and the expected value
E[κi,j] ≤ O(log n) · li,j.
Each edge f in T corresponds to a bond sepf (defined in Section 2.3) in
the original graph H. We say edge f ∈ T separates group i (denoted f|Xi)
if removing edge f from T disconnects Xi. We denote by T [Xi], the smallest
subtree of T containing Xi. We have,
min
f∈T
c(sepf )
cov(sepf )
= min
f∈T
κf c(sepf )
κf
g
i=1 1f|Xi
≤

f∈T κf

(i,j)∈sepf
ci,j

f∈T κf
g
i=1 1f|Xi
=

i,j ci,j

f:(i,j)∈sepf
κf

f∈T κf
g
i=1 1f|Xi
=

i,j ci,jκi,j
g
i=1

f:f|Xi
κf
=

i,j ci,jκi,j
g
i=1 κ(T [Xi])
≤ 2 ·

i,j ci,jκi,j
g
i=1 κ(MSTi)
(1)
≤ 2 ·

i,j ci,jκi,j
g
i=1 l(MSTi)
(2)
≤ 2 ·
i,j
ci,jκi,j (3)
where the only non trivial inequalities are the last three. Since T [Xi] is a Steiner
tree on Xi, the length of an MST on Xi is at most 2 times the length of T [Xi] -
so (1) follows. For (2) note that κ (restricted to VH) dominates l, and (3) follows
from the feasibility of l in (LPH). In [15], they describe how their tree embedding
algorithm can be derandomized to find a single tree with small weighted average
stretch. Here if we think of ci,j as the weights, we can find a single tree metric
(T, κ) s.t.

i,j ci,jκi,j ≤ ρ ·

i,j ci,jli,j, where ρ = O(log n). The minimum cost

effective bond C corresponding to an edge of T is output as the approximate
solution. From the above argument, eff(C) ≤ 2ρ·OPT
, and we get an O(log n)
approximation.
The greedy step in phase k of the requirement cut algorithm is as follows. Run
the Steiner ratio algorithm for each connected component of the residual graph
Gk. Pick the bond Ck of minimum cost effectiveness among those obtained. Let
hk be the number of active groups separated by Ck (in Gk). Then lemma 3, with
the Steiner ratio approximation gives the following.
Lemma 4 The cost effectiveness of the cut picked in the k-th iteration of the
greedy algorithm eff(Ck) = c(Ck)
hk
≤ 2ρ · eff(Mk) ≤ 4ρ · OPT ∗
φk
.
Our analysis to bound the cost incurred by the greedy algorithm is similar
to that used in other set cover based algorithms [16]. The decrease in residual
requirement is φk − φk+1 = hk, the coverage of the greedy step. Using lemma 4,
we get φk+1 ≤ (1 − c(Ck)
4ρOP T ∗ )φk. Using this recursively, we get
φm ≤ Πm−1
k=1 (1 − c(Ck)
4ρOP T ∗ ) · φ1
≤ e−
m−1
k=1
c(Ck)
4ρOP T ∗
· φ1
where m is the number of phases in the greedy algorithm. So we get
m−1
k=1 c(Ck)
≤ 4ρ·ln( φ1
φm
)OPT ∗
≤ 4ρ ln(gR)OPT ∗
, since φm ≥ 1 and φ1 ≤ gR. Observe that
the coverage in the last phase, hm = φm. Now from lemma 4, c(Cm) ≤ 4ρOPT ∗
.
So the cost of the greedy solution is at most 4ρ(ln(gR) + 1)OPT ∗
. Also, the
number of iterations is at most gR, which gives a polynomial time algorithm.
Theorem 4 The (approximate) greedy algorithm is an O(log n log(gR)) approx-
imation algorithm for the requirement cut problem on graphs.
As a consequence of this theorem, we see that when restricted to trees, the greedy
step is trivial : bonds are just edges, and we can enumerate all of them. So the
greedy algorithm is an O(log(gR)) approximation for trees.
4 Conclusions and Open Problems
In this paper, we introduced the requirement cut problem, which contains several
graph partitioning problems as special cases. For general graphs, we have two
O(log n log(gR)) approximation algorithms : one based on LP rounding, and the
other a greedy algorithm. When restricted to trees, both these algorithm have a
better performance guarantee of O(log(gR)).
One obvious open question is whether the approximation ratio can be im-
proved. One approach is to try improving the approximation guarantee for the
Steiner ratio problem. The recent techniques of Arora, Rao, Vazirani [17], do
not seem to apply directly here. The rounding procedure in Section 2 shows that
the integrality gap of (LP) is at most O(log n log(gR)). There is a lower bound

of Ω(log g) due to set cover. It will be interesting to know if there is a worse
example. Even in the case of a tree there is a gap between the upper bound of
O(log(gR)) and lower bound of Ω(log g).
References
1. L.R. Ford, J., Fulkerson, D.: Maximal flow through a network. Canadian Journal
of Math. 8 (1956)
2. Leighton, T., Rao, S.: An approximate max-flow min-cut theorem for uniform
multicommodity flow problems, with applications to approximation algorithms. In
Proc. 29th IEEE Annual Symposium on Foundations of Computer Science (1988)
422–431
3. Klein, P., Rao, S., Agarwal, A., Ravi, R.: An approximate max-flow min-cut rela-
tion for multicommodity flow, with applications. Proc. 31st IEEE Annnual Sym-
posium on Foundations of Computer Science (1990) 726–737
4. Garg, N., Vazirani, V.V., Yannakakis, M.: Approximate max-flow min-(multi)cut
theorems and their applications. ACM Symposium on Theory of Computing (1993)
698–707
5. Plotkin, S.A., Tardos, E.: Improved bounds on the max-flow min-cut ratio for
multicommodity flows. ACM Symposium on Theory of Computing (1993) 691–
697
6. Aumann, Y., Rabani, Y.: An log k approximate min-cut max-flow theorem and
approximation algorithm. SIAM J. Comput. 27 (1998) 291–301
7. Linial, N., London, E., Rabinovich, Y.: The geometry of graphs and some of its
algorithmic applications. FOCS 35 (1994) 577–591
8. D.R.Karger, P.N.Klein, C.Stein, M.Thorup, N.E.Young: Rounding algorithms for
a geometric embedding for minimum multiway cut. ACM Symposium on Theory
of Computing (1999) 668–678
9. Avidor, A., Langberg, M.: The multi-multiway cut problem. Proceedings of SWAT
(2004) 273–284
10. Klein, P.N., Plotkin, S.A., Rao, S., Tardos, E.: Approximation algorithms for
steiner and directed multicuts. J. Algorithms 22 (1997) 241–269
11. Saran, H., Vazirani, V.V.: Finding k cuts within twice the optimal. SIAM J.
Comput. 24 (1995)
12. Chekuri, C., Guha, S., Naor, J.S.: The steiner k-cut problem. Proc. of ICALP
(2003)
13. Feige, U.: A threshold of ln n for approximating set cover. J. ACM 45 (1998)
634–652
14. Konjevod, G., Ravi, R., Srinivasan, A.: Approximation algorithms for the covering
steiner problem. Random Struct. Algorithms 20 (2002) 465–482
15. Fakcharoenphol, J., Rao, S., Talwar, K.: A tight bound on approximating arbitrary
metrics by tree metrics. Proceedings of the 35th Annual ACM Symposium on
Theory of Computing. (2003)
16. Klein, P.N., Ravi, R.: A nearly best-possible approximation algorithm for node-
weighted steiner trees. J. Algorithms 19 (1995) 104–115
17. Arora, S., Rao, S., Vazirani, U.: Expander flows, geometric embeddings and graph
partitioning. STOC (2004) 222–231

Approximation Schemes for Node-Weighted
Geometric Steiner Tree Problems
Jan Remy and Angelika Steger
Institut für Theoretische Informatik, ETH Zürich, Switzerland
{jremy,steger}@inf.ethz.ch
Abstract. In this paper, we consider the following variant of the ge-
ometric Steiner tree problem. Every point u which is not included in
the tree costs a penalty of π(u) units. Furthermore, every Steiner point
we use costs cS units. The goal is to minimize the total length of the
tree plus the penalties. We prove that the problem admits a polynomial
time approximation scheme, if the points lie in the plane. Our PTAS
uses a new technique which allows us to bypass major requirements of
Arora’s framework for approximation schemes for geometric optimization
problems [1]. It may thus open new possibilities to ﬁnd approximation
schemes for geometric optimization problems that have a complicated
topology. Furthermore the techniques we use provide a more general
framework which can be applied to geometric optimization problems
with more complex objective functions.
1 Introduction
The Model. In this paper we consider a geometric optimization problem which
we call Node-Weighted Geometric Steiner Tree or NWGST for short.
An instance of this problem consists of a set of points P in the plane, a penalty
function π : P → Q+ and a cS ∈ Q+. A solution ST is a (geometric) spanning
tree on V (ST ) ∪ S(ST ), where V (ST ) ⊆ P and S(ST ) ⊂ Rd
. The points in
S(ST ) are called Steiner points. Let E(ST ) denote the set of line segments or
edges used by ST . We minimize
val (ST ) =
{u,v}∈E(ST )
q ({u, v}) +
u∈P−V (ST )
π(u) + |S(ST )|cS, (1)
where q ({u, v}) = dq (u, v) measures the length of the edge {u, v} in the Lq-
metric. Since it is easy to compute a spanning tree of minimum length, the main
problem is in fact to choose V (ST ) and S(ST ) optimally.
The objective function can be intuitively understood as follows. We pay for
the total length of the tree plus a penalty of cS units for every Steiner point we
use. Furthermore, we are charged π(u) units for every point u ∈ P which is not
included in the tree. Throughout the paper ST ∗
denotes the optimal solution
for the input set P.
If we chose cS = 0 and π(u) = ∞ then we obtain the well-known geometric
Steiner tree problem [7, 10]. Furthermore, by choosing just cS = 0, we obtain the
c

222 Jan Remy and Angelika Steger
prize-collecting variant of the geometric Steiner tree problem. In this paper we
will prove
Theorem 1. Let d = 2, that is, P ∈ R2
. Then for every ﬁxed ε, there is a
polynomial time algorithm which computes a (1+ε)-approximation to NWGST .
This means that NWGST and thus also the Prize-Collecting Geomet-
ric Steiner Tree admit a polynomial time approximation scheme (PTAS).
We would like to mention that our PTAS can also be applied to further variants
of node-weighted Steiner problems and other tree and tour problems. We will
give examples in Section 5. There, we will also show that the problem admits
a quasi-polynomial time approximation scheme (QPTAS) in higher dimensions.
It is well-known, that the existence of a QPTAS implies that the problem is not
APX-hard, provided SAT /
∈ DTIME[npolylog(n)
].
For the sake of brevity we will use the following notations in this paper.
Unless stated otherwise, log n always denotes the logarithm of n to base 2. For
every solution ST , q (ST ) denotes the total length of the ST . Furthermore,
π(ST ) and cS(ST ) denote the total amount of penalties we pay for unconnected
points and the use of Steiner points, respectively.
Previous Work. The input of the Steiner tree problem in networks is a weighted
graph G = (V, E) and a set K ⊆ V . The goal is to ﬁnd a tree of minimum
weight in G which includes all vertices in K. This problem is known to be APX-
complete [6] and it does therefore not admit a PTAS unless P = NP. However,
there are many constant-factor approximation algorithms for this problem. The
currently best algorithm is by Robins and Zelikovsky [15]. It yields a (1+ln 3/2)-
approximation. There is also a node-weighted version of this problem [10] which
is similar to our model. It was shown by Klein and Ravi [11] that there is
an approximation algorithm with performance ratio 2 ln|K|. Furthermore, they
proved that it is not possible to achieve a ratio better than logarithmic, unless
SAT ∈ DTIME[npolylog(n)
]. A similar setup was considered by moss and Rabani
[13]. In essence they combined the unweighted Steiner tree problem with packing
problems. A 2-approximation algorithm for the price collecting variant of the
Steiner tree problem in networks is due to Goemans and Williamson [9].
It was shown by Arora [1, 2] and independently by Mitchell [12] that the
geometric Steiner tree problem admits a PTAS. This is also the best we can
hope for, since the geometric variant is strongly NP-hard [8]. Furthermore, Arora
claimed in [2] that there is a QPTAS for the price collecting variant. Talwar [16]
shows that Arora’s method extends to metrics that satisfy certain properties but
the complexity of his approximation schemes is quasi-polynomial. The Steiner
tree problem is extensively discussed in the textbooks of Hwang, Richards and
Winter [10] and of Prömel and Steger [14].
Motivation. NWGST is a very natural generalization of the geometric Steiner
tree problem and it has interesting applications. In practice it usually costs to
include an additional node into a network. For example, additional switches
could reduce the cable length of the computer network but they are expensive
to install.

Approximation Schemes for Steiner Tree Problems 223
We can not simply adapt Arora’s PTAS for the geometric Steiner tree prob-
lem. One reason is that the so-called patching lemma seems to be no longer true
if one induces costs for the Steiner points. The patching lemma states that one
can reduce the number of crossing of some Steiner tree ST with a line segment
S of length x to 1 and that the required modifications to ST increase its length
only slightly. Arora’s method extends to many geometric optimization problems
provided one can find an equivalent for the patching lemma. In very few cases
this lemma is not required, like for the k-median problem [4]. On the other hand,
there are some open geometrical problems which include complicated topology
for which no equivalent of the patching lemma is known. Examples are trian-
gulation problems or the degree-restricted spanning tree problem [3]. Thus, for
the design of approximation schemes for geometrical problems, new techniques
avoiding the patching lemma are of interest of their own.
Outline of the Algorithm. We briefly review the main ideas of Arora’s ap-
proximation scheme for the geometric Steiner tree problem. Arora subdivides
the smallest rectangle enclosing all points recursively by using a quadtree such
that the rectangles of the leafs of the quadtree contain at most one point. The
structure of a Steiner tree ST inside a rectangle of this quadtree can be de-
scribed by specifying i) the locations where edges of ST cross the boundary
of the rectangle and ii) how these locations are connected inside the rectangle.
Both parameters define the configuration of this rectangle. Given a rectangle
and a configuration, Arora’s algorithm optimizes locally as follows. It computes
the best solution for this configuration by enumerating all configurations of the
children of the rectangle and combining their optimum solutions. As the opti-
mum solutions for rectangles containing at most one point are easy to compute,
the optimum solutions of all rectangles can be computed bottom-up by dynamic
programming. In other words a look-up table is maintained which contains for
each configuration-rectangle pair the corresponding optimum. The complexity
of this dynamic program depends on the size of this table, i.e., the size of the
quadtree and the number of different configurations per rectangle. We shall see
later that the quadtree has logarithmic height and contains therefore polynomial
many rectangles. In order to reduce the number of configurations per rectangle,
one has to do some rounding. This is achieved as follows. Firstly, edges are only
allowed to cross the boundary of a rectangle at one out of O (log n) prespeci-
fied locations. Secondly, the tree may cross the boundary of a rectangle at most
constantly many times. In this way one can show that there exist only polynomi-
ally many different configurations for each rectangle and the dynamic program
therefore terminates in polynomial time. On the other hand, by reducing the
number of configurations the dynamic program optimizes only over a subclass of
all trees. One therefore also has to show that this subclass contains a tree whose
length differs only slightly from those of an optimum tree. By using the patching
lemma and introducing randomness into the quadtree one can show that this is
indeed the case.
Unfortunately, our problem NWGST seems not to fit in this framework.
Although one can use a similar setup, the fact that additional Steiner points

add additional costs to the tree makes it hard to imagine that an optimum
tree can always be changed in such a way that it crosses the boundary of all
rectangles only at constantly many positions. We therefore use a completely
different approach to define configurations. Unlike Arora we do not specify the
locations where the tree edges cross the boundary of the a rectangle. Instead we
aim at specifying the structure of the tree within a rectangle and at specifying
at which locations it should be connected to points outside of the rectangle.
Basically, this is achieved by subdividing the rectangle into O (log n) many cells.
A configuration specifies which cells contain an endpoint of an edge crossing the
boundary and specifies to which component inside of the rectangle those points
belong. In this way we restrict the number of different locations of end points of
edges crossing the boundary of a rectangle but not the number of such edges.
In the remainder of this paper we explain the details of our algorithm. Firstly,
we need a slightly different definition of a quadtree; this is described in Section 2.
Secondly, we need an appropriate subdivision of a rectangle into O (log n) many
cells. The structure we use is defined in Section 3.1. Finally, we have to define a
class of standardized trees that can be represented by a configuration. Such trees
are discussed in Section 3.2. The dynamic program is described in Section 4. It
is not possible to compute the optimal standardized tree but we can show that
our algorithm computes a (1 + ε)-approximation to it. Due to space limitations
all proofs are omitted. For details the reader is referred to a preprint of the full
paper which is available at http://www.ti.inf.ethz.ch/as/people/remy.
2 Preliminaries
An instance of the NWGST problem is well-rounded if all points and Steiner
points have odd integral coordinates, and the side length of the bounding box
is L = O

n2

, i.e., P ⊂ {1, 3, 5, ..., L − 1}2
. Note that this definition slightly
differs from the one Arora uses for his approximation scheme [1].
Lemma 1. If there is a PTAS for well-rounded instances of NWGST then there
is also a PTAS for arbitrary instances.
Furthermore, we adapt the concept of shifted quadtrees or shifted dissections
[1, 2] to our purposes. We require that L is a power of 2. If this is not the
case, we simply enlarge the bounding box appropriately. We choose two integers
a and b with a, b ∈ {0, 2, ..., L − 2}. The vertical line with x-coordinate a and
the horizontal line with y-coordinate b split the bounding box into four smaller
rectangles. We enlarge those rectangles such that their side length is L. By S0
we denote the area covered by the four enlarged rectangles. Note that S0 is a
square of side length L0 = 2L. We now subdivide S0 using an ordinary quad
tree until we obtain squares of size 2 × 2. We denote the quad tree generated in
this fashion by QTa,b to emphasize that its structure depends on the choice of a
and b.
Throughout we denote vertical dissection lines with even x-coordinates by
V and horizontal ones with even y-coordinates by H. A vertical line V with x-
coordinate x has level l with respect to a if there is an odd (potentially negative)

integer i, such that x = a + i · L
2l . We write: lev(V, a) = l. Similarly we define
lev(H, b). The level of a square S in the quadtree QTa,b is defined as follows:
The squares in the very first subdivision have level 0 and a square S at level
l is subdivided by a horizontal line H and a vertical line V with lev(H, b) =
lev(V, a) = l+1 into four squares at level l+1. The level of a square S is denoted
by lev(S). Let us shortly summarize salient properties of shifted quadtrees.
Observation 2 For a shifted quadtree QTa,b the following is true:
1. The subdivision uses only horizontal and vertical lines that have even coor-
dinates.
2. The depth of QTa,b is log L =: t.
3. The side length of a square S at level l is S = L/2l
.
4. The number of vertical (horizontal) lines in QTa,b at level l is at most 2l
.
Note that t is integral by choice of L. This property will turn out to be useful
in Section 3.1.
3 (s, t)-Maps and Standardized Solutions
3.1 (s, t)-Maps
From now on, we only consider well-rounded instances of NWGST. As men-
tioned in the introduction we need some subdivision of S ∈ QTa,b into cells.
A natural idea would be to use a regular (d × d)-grid. Unfortunately, it will
turn out that we have to choose d = Ω(st) = Ω(log n) to obtain a (1 + 1/s)-
approximation, where s = O (1/ε). Therefore this grid has Ω(log2
n) many cells.
This is to much, since we later store some bits per rectangle and since we require
that we have polynomial many states per rectangle. Indeed, such grids already
appear in quasi-polynomial approximation schemes of [4, 16].
This problem is solved by using so-called (s, t)-maps which have only O (log n)
cells. The intuition behind (s, t)-maps is very simple. In the PTAS special care
has to be taken for edges that cross boundaries of squares. Due to complexity,
we do not explicitly specify the endpoints of such edges. Instead, we will argue
that it suffices to describe them as ”edges” between cells. This has the effect
that our PTAS can only reconstruct that the endpoints are somewhere within
the cells. Thus, it may choose any point inside the cell. However, the length of
the edge varies by at most twice the side lengths (L1-metric) of the cells. The
definition of (s, t)-maps comes directly from the following simple observation.
We can estimate the endpoints of long edges more roughly than those of short
edges, since we have only to assure that the absolute error is at most 1/s of the
edge’s optimal length. Edges that reach deep into S are long and thus we can
make the cells inside S larger than those which are close to the boundary. The
definition of an (s, t)-map follows exactly this idea. We will assure that the side
length of a cell is at most
max

1
st
side length of S,
1
s
distance between C and boundary of S

and that the map has O

s2
t

many cells. In essence, an (s, t)-map can be re-
garded as a (st × st)-grid with cell sizes growing to the interior.
Now we define (s, t)-maps more formally. For arbitrary s ∈ N, we define a
function β : N0 → N0 as
β(i) =
⎧
⎪
⎨
⎪
⎩
j
4s

, j 8s
2 , 8s ≤ j ≤ 9s − 1
j−s
4s

, j ≥ 9s
.
Let S be the side length of S and let d = d(s) = 8s2τ
. Here τ is the smallest
integer such that 2τ
≥ t, where t is the depth of the shifted quadtree. Note that
this implies τ 1 + log t. Therefore, we have 8st ≤ d 16st.
0...3
4...7
8
9
10
12
11
13
14
15
16
17
18
Ring No.
35
34
33
32
31
30
29
28
27
26
25
24
23
12...23
0...11
Ring No.
(a) (b)
Fig. 1. The upper-left part of an (s, t)-map with s = 1 (a). The cell size doubles every
four rings - except the 12th
ring. In (b) the same region is covered by an (s, t)-map
with multiplicity s = 3
We subdivide S in rings as illustrated in Figure 1 for s = 1 and s = 3. The
jth ring has width 2β(j)
S/d and consists of congruent axis parallel squares of
the same size. We call this subdivision the (s, t)-map and write M[S]. We have
groups of 4s rings of same size. The only exception is the third group which
contains 5s rings. Note that the boundaries of the cells overlap. Since we later
require that the cells partition the rectangle we assume that the upper and left
boundary always belong to the corresponding neighboring cell if there is any.
Let γ count the number of groups. Since the width of a ring in the ith
group
is 2i−1
S/d, γ must satisfy (5s − 4s)22 S
d + 4s
γ
i=1
2i−1
S
d
!
= S
2 , and thus γ =
log(d/8s) ≤ 1 + log t. Note that γ is integral by definition of d. In the remainder
of this section we state some properties of (s, t)-maps. In some sense, (s, t)-maps
and grids share salient properties. However, the number of cells in a d × d grid
is d2
= s2
t2
while an (s, t)-map has O (s · d) cells.
Lemma 3. M[S] contains O

s2
t

cells.

Lemma 4. Let C ∈ M[S] have side length C ≥ 2S/d and let p ∈ P be a point
which is contained in C. Then the distance of p to the boundary of S in any
Lq-metric is at least 2sC.
Let S be a square at level l and let M[S] be an (s, t)-map on S. Furthermore,
let S
be a child of S in QTa,b and let C
be a cell of M[S
]. A cell C in M[S] is
a parent of C
if C
∩C = ∅, i.e. if the cells overlap. Analogously, we write child [C]
for the child set of C.
Lemma 5. Every cell has a unique parent.
3.2 Standardized Solutions
In this section, we will consider trees that enjoy certain structural properties.
Roughly speaking, our PTAS optimizes over all such trees. The main result of this
section is Theorem 2 which states that there exists an almost optimal tree having
this structure. Let a and b be fixed and let QTa,b denote the corresponding shifted
quadtree with(s, t)-maps on all its squares. Furthermore, let ST be an arbitrary
Steiner tree. An internal component of S ∈ QTa,b is a connected component in
the induced Steiner tree ST [X ∩S], where X is the set of points spanned by ST .
Let k(S) count the number of internal components of S. An edge {u, v} ∈ ST
is an external edge of S, if exactly one endpoint of {u, v} is contained in S. Let
E (S, ST ) denote the set of endpoints (within S) of external edges. An edge
{u, v} ∈ ST has level l or appears at level l, if l is the least level, where {u, v} is
external. The level of {u, v} is denoted by lev({u, v}). By removing all external
edges, we obtain a set of connected components within S. These are exactly the
internal components of S.
Definition 1. Let r ∈ N, r ≥ 2. A Steiner tree ST is (r, s)-standardized with
respect to a and b if every square S ∈ QTa,b is (r, s)-standardized, i.e., S
satisfies
(S1) k(S) ≤ r.
(S2) For all C ∈ M[S], all points in E (S, ST ) ∩ C belong to the same internal
component of S.
The following definition is essentially important for the proof of Theorem
Theorem 3. Since it is omitted in this paper, the reader can skip this paragraph.
Let S∗
denote the set of all Steiner trees which have the same point set as ST ∗
,
i.e., ST ∈ S∗
if and only if V (ST ) = V (ST ∗
) and S(ST ) = S(ST ∗
). For fixed
a and b, one can define a transitive relation on S∗
. For ST , ST
∈ S∗
we have
ST ST
if and only if there exists a bijection γ : E(ST ) → E(ST
) such that
lev(e) ≤ lev(γ(e)).
Theorem 2. For all a, b ∈ {0, 2, 4, . . ., L − 2} there exists an (r, s)-standardized
Steiner tree ST ∗
a,b ∈ S∗
with ST ∗
ST ∗
a,b such that if a and b are chosen u.a.r.
then E

val

ST ∗
a,b

− val (ST ∗
)

≤ O (1/s + 1/
√
r) val (ST ∗
).

As mentioned earlier, our PTAS tries to find the optimal standardized tree.
Unfortunately it is not possible to compute the optimal one, thus we will compute
a (good) approximation. We later bound the difference between the cost of our
solution and val

ST ∗
a,b

. There we need the relation to compare this difference
with the optimum (cf. Theorem 3).
4 The Approximation Scheme
Although (r, s)-standardized trees have nice properties, we do not see an ap-
proach to compute the optimal one in polynomial time. Instead, our PTAS com-
putes an almost optimal (r, s)-standardized tree. For fixed a and b, consider the
optimal standardized Steiner tree ST ∗
a,b and two squares Su and Sv at level l.
Assume that the two squares have a common parent S. Let e = {u, v} be an
edge that is contained within S but has one endpoint in Su an the other in Sv,
i.e, e appears at level l. There are cells Cu and Cv containing the endpoints of e
in Su and Sv, respectively. The tree ST ∗
a,b has the following structural property.
All edges which appear at level l and have an endpoint in Cu connect to the same
internal component of Su. The points in Cu which belong to this component are
from a structural point of view equivalent, since instead of connecting v to u
we can connect v to any other of those points without adding a cycle or loosing
connectivity. Although the length of such an edge may be greater than q (e),
the absolute error is at most twice the side length of Cu. Of course, the same is
true for Cv. Therefore, the distance between the centers of Cu and Cv is a good
estimation to the length of e.
If we are willing to accept this error, it is obvious that the following infor-
mation is sufficient to describe a standardized Steiner tree within S. Firstly, we
have to specify in which cells we have endpoints of external edges. Intuitively
speaking, such cells, later called portals, represent the component to which all
external edges should be connected. Secondly, it is necessary to encode which
cells represent the same internal component of S. This motivates the following
strategy for our PTAS. In a first phase we compute the ”optimal” structure of
our solution. This structure is described by edges between centers of cells and
we minimize the total length of the cell-to-cell edges we use. In a second phase,
those cell-to-cell edges are replaced by ”real” edges.
4.1 Square Configuration
We describe the encoding in more detail. Let S be a square in QTa,b. The
configuration C(S) of S is with respect to the irregular map M[S] and is given
by the following parameters.
1. For every cell C we store a bit ρ(C) which indicates whether C is a portal
cell.
2. For every cell C we store a bit α(C) which indicates whether C is an anchor
cell. We require, that every anchor is also a portal, i.e., α(C) ⇒ ρ(C).
3. A partition Z of the portals of S into at most r sets.

The anchor bit is not required to encode the interior of a square but necessary
to guarantee that our PTAS returns a connected tree. This has the following
reason. Intuitively speaking, ρ(C) indicates an obligation to connect to C on
some larger square. In this context the anchor bit indicates that this obligation is
handed over to the next larger square. This means that we have to connect at the
current level to every portal which is not an anchor. Since d = 8st = O (log n),
we obtain
Lemma 6. The number of configurations for S is at most 2O(s2
t log r).
The configuration, where no bit is set is called the empty configuration. We
write C(S) = (. The configuration of QTa,b, denoted by C(QTa,b) is a set which
contains tuples S, C(S) representing the configurations of all the squares in
QTa,b including S0.
4.2 The Algorithm: First Phase
For every square S and every configuration C(S), we compute a value T [S, C(S)]
which can be seen as an almost sharp upper bound to the cost of the optimal
subtree within S which has the structural properties specified by C(S). Then,
T [S0, (] is an upper bound to val

ST ∗
a,b

. Since val

ST ∗
a,b

is a good upper
bound to val (ST ∗
), it suffices to quantify the gap between val

ST ∗
a,b

and
T [S0, (]. This will be done in Section 4.3.
If S is a leaf of QTa,b then we have exactly one cell C0 which contains a point
with odd integral coordinates. This point need not be contained in P. First we
check whether ρ(C) = 0 for all C ∈ M[S] − {C0}. If this is not the case, we store
T [S, C(S)] = ∞. Otherwise, we have to determine the penalty we have to pay
in S. We have three cases. If ρ(C0) = 1 and P ∩ C0 = ∅ then we place a Steiner
point into C0 and set T [S, C(S)] = cS. If ρ(C0) = 0 and P ∩ C0 = {u} then we
store T [S, C(S)] = π(u). In all other cases, T [S, C(S)] = 0.
If S is an internal node of QTa,b we proceed as follows. Assume that S has
level l and let S1, ..., S4 denote its children. We enumerate all combinations of
configurations for S1, ..., S4. Let C(S1), ..., C(S4) be such configurations. For
every square Si, C(Si) defines a partition Z(Si) of the portals in Si into at
most r classes. We now consider a (not necessarily unique) forest Z(Si) with
the portals of Si as vertices. Z(Si) must have the property that two portals C1
and C2 belong to the same connected component of Z(Si) if and only if they are
in the same partition of Z(Si). For the sake of brevity, we use Z
=
4
i=1 Z(Si).
Let C be a portal and let C
∈ child [C]. If α(C
) = 1 then we say that C
is an
anchor hook of C. Recall, that Z
is a graph on the portals of the children of S.
A choice of configurations C(S1), . . . , C(S4) is admissible if each of the following
conditions holds.
(C1) Every portal in S has at least one anchor hook.
(C2) The parent in S of every anchor in some Si is a portal, unless S = S0.
(C3) All anchors of a portal in S belong to the same connected component
of Z
.

Furthermore, there exists a graph D on the portals of

i Si which has the
following properties.
(C4) Every portal in Si, 1 ≤ i ≤ 4 which is not an anchor is incident to some
edge of D.
(C5) Every connected component of D ∪ Z
contains at least one anchor, unless
S = S0.
(C6) Two portals C1 and C2 of S belong to the same partition class of Z(S)
if and only if their anchors belong to the same connected component of
D ∪ Z
.
(C7) D uses only edges which cross the dissection lines that divide S into its
children.
The relaxations of (C2) and (C5) are necessary to compute T [S0, (]. Since
the configuration ( does not contain any portal at all, there would be no ad-
missible configuration of the children.
The length of an edge in D is the distance of the centers of the portals
which it connects in the Lq-metric. If there is no combination which is admis-
sible, we store ∞ in T [S, C(S)]. Otherwise, we choose admissible configurations
C(S1), . . . , C(S4) and a graph D such that
4
i=1
T [Si, C(Si)] + q (D) +
{c1,c2}∈D
[C(c1) + C(c2)] (2)
is minimal. Here C(c1) and C(c2) denote the side length of the portal which
has center c1 and c2, respectively. In essence, the third part (the sum over all
edges in D) of the formula above is necessary to bound the deviation in length
from the optimal standardized solution. In the sequel, this will be explained
in more detail. For each choice of configurations we have to find the optimal
graph D. This can be achieved using some generalization of Kruskal’s algorithm.
Details are omitted due to space constraints. Then it is clear that we can find
the configurations C∗
(S1), . . . , C∗
(S4) that minimize (2) by exhaustive search.
Let D∗
S denote the optimal graph D corresponding to C∗
(·).
We compute T [S0, (] by using a dynamic programming approach. If we
proceed bottom-up, it suffices to read the T [Si, C(Si)] from the lookup table.
We obtain T [S0, (] which corresponds to some configuration C(QTa,b) that we
have computed implicitly. One can easily check, that the running time of this
dynamic program is nO(s2
·log r) by using d = O

s2
t

= O

s2
log n

.
4.3 The Algorithm: Second Phase
In the first phase we obtain a configuration for QTa,b corresponding to T [S0, (].
This configuration contains a collection of graphs D(P) := {D∗
S : S ∈ QTa,b}.
In some sense, this collection specifies which of the components are connected in
which square. Even more, the edges to be used are roughly described by edges
between centers of cells. During the second phase of our approximation scheme,

we construct a Steiner tree ST from D(P). This time, we proceed in top-down
fashion. Let ST = ∅. For some square S at level l, we consider the graph D∗
S.
Let c be some edge in D∗
S which connects (the centers) of two cells C1 and C2.
By recursively following the anchors of both C1 and C2 we determine points u1
and u2 which serve as endpoints. In other words, starting at portal C1 we choose
one of its anchors, then we choose an anchor hook of this anchor, and so on.
Thus, we go to lower and lower levels in QTa,b until we reach a leaf. In a leaf,
we find by construction a point in P or Steiner point which we choose as u1.
We proceed similarly for C2, add the edge {u1, u2} to ST and mark c as done.
As u1 and u2 are within C1 and C2, (2) implies
Lemma 7. q (ST ) ≤ T [S0, (].
In addition, we have to prove that ST is a Steiner tree on V (ST ) ∪ S(ST ),
i.e., ST is cycle free and connected and spans all points in V (ST )∪S(ST ). This
can be checked using (C1) - (C7). Details are omitted due to space constraints.
Now, one can show for a, b uniformly chosen at random that
E

T [S, (] − val

ST ∗
a,b

= O

1/s + 1/
√
r

val (ST ∗
).
We can even show that there are a, b such that both T [S, (] and val

ST ∗
a,b

are good upper bounds. Then one can prove using Theorem 2 and Lemma 7 the
following statement.
Theorem 3. There are a, b ∈ {0, 2, 4 . . ., L − 2} such that
q (ST ) ≤ O

1 +
1
s
+
1
√
r

val (ST ∗
).
Choosing s = O (1/ε) and r = O

1/ε2

proves Theorem 1.
5 Generalizations
A natural question is, whether there is also PTAS for P ∈ Rd
, for fixed d ≥
3. In this case the number of cells in a (s, t)-map is O

td−1

and thus the
running time of our algorithm is no longer polynomial. However, we can construct
a quasi-polynomial time approximation scheme. Instead of using (s, t)-maps it
is more convenient to place a d-dimensional grid of granularity O (Rε/t) on a
square of side length S. The remaining parts of our proofs have straight-forward
equivalents.
Our approach also extends to other tree problems. For example, assume that
the input contains an additional set S0 which restricts locations for Steiner
points, i.e., every solution should satisfy S(ST ) ⊆ S0. This problem has a simple
PTAS, if Steiner points have zero cost. It is then sufficient to add S0 to P, to
set the corresponding penalties to 0 and to chose cS = ∞. If we assign (potential
different) costs to the locations in S0, the problem still admits a PTAS which

can be obtained by slightly modifying our algorithm. This is insofar surprising,
as the corresponding network problem is not likely to be in APX [11].
A third example is the Vehicle Routing-Allocation Problem (VRAP) [5] for
which our method also yields a PTAS. In VRAP not all customers need be
visited by the vehicle on its salesman tour. However customers not visited either
have to be allocated to some customer on one of the vehicle tour or left isolated.
As in the NWGST there are penalties for allocated or isolated customers.
References
1. S. Arora. Polynomial time approximation schemes for euclidean traveling salesman
and other geometric problems. Journal of the ACM, 45(5):753–782, 1998.
2. S. Arora. Approximation schemes for NP-hard geometric optimization problems:
A survey. Mathematical Programming, Series B, 97:43–69, 2003.
3. S. Arora and K. Chang. Approximation schemes for degree-restricted MST and
red-blue separation problems. Algorithmica, 40(3):189–210, 2004.
4. S. Arora, P. Raghavan, and S. Rao. Approximation schemes for Euclidean k-
medians and related problems. In Proceedings of the 30th Annual ACM Symposium
on Theory of Computing, pages 106–113, 1998.
5. J. E. Beasley and E. Nascimento. The vehicle routing allocation problem: a unifying
framework. TOP, 4:65–86, 1996.
6. M. Bern and P. Plassmann. The Steiner tree problem with edge length 1 and 2.
Information Processing Letters, 32:171–176, 1989.
7. R. Courant and H. Robbins. What is Mathematics? Oxford University Press, 1941.
8. M. Garey, R. Graham, and D. Johnson. The complexity of computing Steiner
minimal trees. SIAM Journal on Applied Mathematics, 32:835–859, 1977.
9. M. X. Goemans and D. P. Williamson. A general approximation technique for
constrained forest problems. SIAM Journal on Computing, 24(2):296–317, 1995.
10. F. K. Hwang, D. Richards, and P. Winter. The Steiner Tree Problem, volume 53
of Annals of Discrete Mathematics. North-Holland, 1995.
11. P. Klein and R. Ravi. A nearly best-possible approximation algorithm for node-
weighted Steiner trees. Journal of Algorithms, 19:104–115, 1995.
12. J. S. B. Mitchell. Guillotine subdivisions approximate polygonal subdivisions:
A simple polynomial-time approximation scheme for geometric TSP, k-MST and
related problems. SIAM Journal on Computing, 28(4):1298–1309, 1999.
13. A. Moss and Y. Rabani. Approximation algorithms for constrained node weighted
Steiner tree problems. In Proceedings of the 33rd Annual ACM Symposium on
Theory of Computing, pages 373–382, 2001.
14. H. J. Prömel and A. Steger. The Steiner Tree Problem — A Tour through Graphs,
Algorithms, and Complexity. Advanced Lectures in Mathematics. Vieweg Verlag,
2002.
15. G. Robins and A. Zelikovsky. Improved Steiner tree approximation in graphs.
In Proceedings of the 11th annual ACM-SIAM symposium on Discrete algorithms,
pages 770–779, 2000.
16. K. Talwar. Bypassing the embedding: algorithms for low dimensional metrics. In
Proceedings of the 36th Annual ACM Symposium on Theory of Computing, 2004.

Towards Optimal Integrality Gaps
for Hypergraph Vertex Cover
in the Lovász-Schrijver Hierarchy
Iannis Tourlakis
Princeton University, Department of Computer Science, Princeton, NJ 08544
itourlak@cs.princeton.edu
Abstract. “Lift-and-project” procedures, which tighten linear relax-
ations over many rounds, yield many of the celebrated approximation
algorithms of the past decade or so, even after only a constant number
of rounds (e.g., for max-cut, max-3sat and sparsest-cut). Thus prov-
ing super-constant round lowerbounds on such procedures may provide
evidence about the inapproximability of a problem.
We prove an integrality gap of k − for linear relaxations obtained from
the trivial linear relaxation for k-uniform hypergraph vertex cover
by applying even Ω(log log n) rounds of Lovász and Schrijver’s LS lift-
and-project procedure. In contrast, known PCP-based results only rule
out k − 1 − approximations. Our gaps are tight since the trivial linear
relaxation gives a k-approximation.
1 Introduction
Several papers [1, 2, 5, 9] have appeared over the past few years studying the
quality of linear programming (LP) and semidefinite programming (SDP) relax-
ations derived by so-called “lift-and-project” procedures. These procedures (for
examples, see Lovász and Schrijver [12] and Sherali and Adams [13]) start with
an initial relaxation and add, in a controlled manner, more and more inequalities
satisfied by all integral solutions in a bid to obtain tighter and hence, hopefully,
better relaxations.
One motivation for studying lift-and-project procedures is the large gap that
remains for some problems (such as vertex cover) between approximation
ratios achieved by known algorithms and ratios ruled out by PCPs. On the one
hand, lift-and-project methods may lead to improved approximation algorithms
for such problems. For example, the celebrated SDP-based algorithms of [11]
for max-cut and of [4] for sparsest cut are efficiently derived by lift-and-
project methods. On the other hand, ruling out good approximation algorithms
based on lift-and-project methods might give evidence about the problem’s true
inapproximability.
In this paper we concentrate on proving integrality gaps for relaxations ob-
tained by the LS lift-and-project technique defined by Lovász and Schrijver [12]
(definitions in Section 2). The problem we study is vertex cover on k-uniform
c

234 Iannis Tourlakis
hypergraphs. Progress on graph-related problems is sometimes made by consider-
ing hypergraph analogues, motivating our study of hypergraph vertex cover.
As with graph vertex cover, there is a gap between the factors achieved
by known approximation algorithms and those ruled out by PCPs: While PCP-
based hardness results for k-uniform hypergraph vertex cover rule out k−1−
polynomial-time approximations [7], only k −o(1) approximation algorithms are
known. Our main result is that for all 0 and all sufficiently large n there
exist k-uniform hypergraphs on n vertices for which Ω(log log n) rounds of LS
are necessary to obtain a relaxation for vertex cover with integrality gap less
than k − .
1.1 Related Work
Alekhnovich et al. [1] also considered integrality gaps of LS tightenings for k-
uniform hypergraph vertex cover proving that a gap of k − 1 − remains
after a linear number of even LS+ tightenings (LS+, defined in Section 2, is a
stronger version of LS where positive semidefinite constraints are also added).
However, the work most closely related to ours is the paper of Arora et al. [2].
They show that an integrality gap of 2 − remains for graph vertex cover
even after Ω(
√
log n) rounds of LS. However, obtaining an analogous integrality
gap of k− for k-uniform hypergraph vertex cover for a non-constant number
of rounds remained open since their analysis was especially tailored for graphs.
The main technical contribution of this paper is adapting the proof framework
of [2] to reason about hypergraphs. We discuss this further after some necessary
technical definitions in Section 2 below.
Other recent papers studying Lovász-Schrijver tightenings are [1, 5, 6, 9,
10, 14]. Stephen and Tunçel [14] show that Ω(
√
n) rounds of even the stronger
LS+ procedure are required to derive some simple inequalities for the matching
polytope. Cook and Dash [6] and Goemans and Tunçel [10] both show how for
some relaxations n rounds of LS+ are required to derive some simple inequalities.
Feige and Krauthgamer [9] show that a large gap remains for independent set
after Ω(log n) rounds of LS+, while Buresh-Oppenheim et al. [5] show that the
gap for max-ksat, k ≥ 5 remains (2k
−1)/2k
− even after Ω(n) rounds of LS+.
In addition to their results for hypergraph vertex cover mentioned above,
Alekhnovich et al. [1] extend the results of [5] to max-3sat as well as proving
an integrality gap of (1 − ) ln n after Ω(n) rounds of LS+ for set cover.
2 Lovász-Schrijver Liftings and Our Methodology
We will use the vertex cover problem to explain relaxations and how to
tighten them using LS liftings. The integer program (IP) characterization for a k-
uniform hypergraph G = (V, E) is: minimize

i∈V vi such that vj1 +. . .+vjk
≥ 1
for all hyperedges {j1, . . . , jk}, where vi ∈ {0, 1}. The integer hull I is the convex
hull of all solutions to this program. The standard LP relaxation is to allow
0 ≤ vi ≤ 1. Clearly the value of the LP is at most that of the IP. Let P be the

Towards Optimal Integrality Gaps for Hypergraph Vertex Cover 235
convex hull of all solutions to the LP in [0, 1]n
. A linear relaxation is tightened
by adding constraints that also hold for the integral hull; in general this gives
some polytope P
such that I ⊆ P
⊆ P. The quality of a linear relaxation is
measured by the ratio optimum value over I
optimum value over P , called its integrality gap.
Lovász and Schrijver [12] present several “lift-and-project” techniques for de-
riving tighter and tighter relaxations of a 0-1 integer program. These procedures
take an n dimensional relaxed polytope, “lift” it to n2
dimensions, add new con-
straints to the high-dimensional polytope, and then project back to the original
space. Three progressively stronger such techniques, LS0, LS and LS+ are given
in [12]; we will focus on LS, but also define LS+ below for completeness.
The notation uses homogenized inequalities. Let G be a k-uniform hyper-
graph and let VC(G) be the cone in )n+1
that contains (x0, x1, . . . , xn) iff it
satisfies the edge constraints xi1 + . . . + xik
≥ x0 for each edge {i1, . . . , ik} ∈ G.
All cones in what follows will be in )n+1
and we will be interested in the slice cut
by the hyperplane x0 = 1. Denote by Nr
(VC(G)) and Nr
+(VC(G)) the feasible
cones of all inequalities obtained from r rounds of the LS and LS+ lifting pro-
cedures, respectively. The following lemma characterizes the effect of one round
of the N and N+ operators. (In the remainder of the paper we index columns
and rows starting from 0 rather than 1, and ei denotes the ith unit vector.)
Lemma 1 ([12]). Let Q be a cone in )n+1
. Then y ∈ )n+1
is in N(Q) iff there
exists a symmetric matrix Y ∈ )(n+1)×(n+1)
such that:
1. Y e0 = diag(Y ) = y,
2. Y ei, Y (e0 − ei) ∈ Q, for all i = 1, . . . , n.
Moreover, y ∈ N+(Q) iff Y is in addition positive semidefinite.
To prove that y ∈ Nr+1
(Q), we have to construct a specific matrix Y and prove
that the 2n vectors defined in Lemma 1 are in Nr
(Q). Such a matrix Y is called
a protection matrix for y since it “protects” y for one round. The points y we
protect will always have y0 = 1 in which case condition 2 above is equivalent to
the following condition (below P is the projection of Q along x0 = 1):
2
. For each i such that xi = 0, Y ei = 0; for each i such that xi = 1, Y ei = y;
Otherwise Y ei/xi and Y (e0 − ei)/(1 − xi) are both in P.
The simplest possible Y has Yij = yiyj except along the diagonal where Yii = yi.
Then Y ei/xi and Y (e0−ei)/(1−xi) are identical to y except in the ith coordinate
which are now 1 and 0, respectively. This matrix was used in [5, 10].
However, such a protection matrix does not work for us: To prove gaps of k−
on k-uniform hypergraphs we will protect the point yγ = (1, 1
k + γ, . . . , 1
k + γ)
where γ 0 is an arbitrarily small constant. For yγ, the vector Y (e0−ei)/(1−xi)
given by the simple protection matrix is not guaranteed to be in VC(G): if
coordinate i is changed to 0 in yγ, we violate all edge constraints involving the
ith vertex. So more sophisticated protection matrices are needed.
For the case when k = 2 (i.e., for graphs), Arora et al. [2] used LP duality to
show that appropriate protection matrices exist for at least Ω(
√
log n) rounds. In

particular, they used a special combinatorial form of Farkas’s lemma specifically
applicable to the constraints involved in defining protection matrices for graphs.
We will also reduce the problem of showing our protection matrices exists to
the feasibility of linear programs; however, we will need a much more complicated
form of Farkas’s lemma than in the graph case. As such, we will only be able to
carry out our arguments for O(log log n) rounds.
3 Lowerbounds for Hypergraph Vertex Cover
Theorem 1. For all k ≥ 2, 0 there exist constants n0(k, ), δ(k, ) 0
s.t. for every n ≥ n0(k, ) there exists a k-uniform hypergraph G on n vertices for
which the integrality gap of Nr
(VC(G)) is at least k− for all r ≤ δ(k, ) log log n.
Theorem 1 will follow from the following two theorems.
Theorem 2. For all k ≥ 2 and any 0 there exists an n0(k, ) such that for
every n ≥ n0(k, ) there exist k-uniform hypergraphs with n vertices and O(n)
hyperedges having Ω(log n) girth but no independent set of size greater than n.
Theorem 3. Fix γ 0 and let yγ = (1, 1
k + γ, 1
k + γ, . . . , 1
k + γ). Let G be a
k-uniform hypergraph such that girth(G) ≥ 20
γ r5r
. Then yγ ∈ Nr
(VC(G)).
Theorem 2 is an easy extension to hypergraphs of a result proved by Erdős [8].
The remainder of this section will be devoted to proving Theorem 3.
3.1 Proof of Theorem 3
As in previous LS lowerbounds, the theorem will be proved by induction where
the inductive hypothesis will require some set of vectors to be in Nm
(VC(G))
for m ≤ r. These vectors will be essentially all-(1
k + γ) except possibly for a few
small neighbourhoods where they can take arbitrary nonnegative values so long
as the edge constraints for G are satisfied. The exact characterization is given
by the following definition similar to an analogous definition in [2]:
Definition 1. Let S ⊆ [n], R be a positive integer, and γ 0. A nonnegative
vector (α0, α1, . . . , αn) with α0 = 1 is an (S, R, γ)-vector if the entries satisfy the
edge constraints and if αj = 1
k +γ for each j ∈ ∪w∈SBall(w, R). Here Ball(w, R)
denotes the set of vertices within distance R of w in the graph.
Let Rr = 0 and let Rm = 5Rm+1 + 1/γ for 0 ≤ m r. Note Rm ≤ girth(G)/20.
To prove Theorem 3 we prove the following inductive claim. The theorem is
a subcase of m = r:
Inductive Claim for Nm
(VC(G)): For every set S of r − m vertices, every
(S, Rm, γ)-vector is in Nm
(VC(G)).
The base case m = 0 is trivial since (S, Rm, γ)-vectors satisfy the edge con-
straints for G. In the remainder of this section we prove the Inductive Claim for
m + 1 assuming truth for m.

To that end, let α be an (S, Rm+1, γ)-vector where |S| = r − m − 1. By
induction, every (S ∪ {i}, Rm, γ)-vector is in Nm
(VC(G)). So to prove that α ∈
Nm+1
(VC(G)) it suffices by Lemma 1 to exhibit an (n + 1) × (n + 1) symmetric
matrix Y such that:
A. Y e0 = diag(Y ) = α,
B. For each i such that αi = 0, Y ei = 0; for each i such that α1 = 1, Y e0 = Y ei;
otherwise, Y ei/αi and Y (e0 − ei)/(1 − αi) are (S ∪ {i}, Rm, γ)-vectors.
As was done in [2] and [12], we will write these conditions as a linear program
and show that the program is feasible, proving the existence of Y . Our notation
will assume symmetry, namely Yij represents Y{i,j}.
Condition A requires:
Ykk = αk ∀k ∈ {1, . . . , n} . (1)
Condition B requires first of all that Y ej/αj and Y (e0 − ej)/(1 − αj) satisfy
the edge constaints: For all i ∈ {1, . . . , n} and all {j1, . . . , jk} ∈ E:
αi ≤ Yij1 + . . . + Yij
≤ αi + (αj1 + . . . + αjk
− 1). (2)
Vertices i, t are called a distant pair if t ∈ ∪w∈S∪{i}Ball(w, 5Rm+1 + 1/γ).
(Note then that αt = 1
k + γ.) Condition B requires that for such a pair, the tth
coordinates of Y ei/αi and Y (e0 − ei)/(1 − αi) are 1
k + γ. In particular,
Yit = αiαt = αi

1
k
+ γ

. (3)
Note that distant pairs have the property that every path in G that connects
them contains at least 4Rm + 1/γ − 1 hyperedges each of which α oversatisfies
by kγ, and at most Rm hyperedges which α does not oversatisfy by kγ.
Finally, condition B requires that Y ei/αi, Y (e0 −ei)/(1−αi) are in [0, 1]n+1
:
0 ≤ Yij ≤ αi, ∀i, j ∈ {1, . . . , n} , i = j (4)
−Yij ≤ 1 − αi − αj, ∀i, j ∈ {1, . . . , n} , i = j (5)
The above constraints are equivalent to the following four constraint families:
Yij ≤ β(i, j), ∀i, j ∈ {1, . . . , n} (6)
−Yij ≤ δ(i, j), ∀i, j ∈ {1, . . . , n} (7)
Yij1 + . . . + Yijk
≤ a(i, j1, . . . , jk), ∀ {j1, . . . , jk} ∈ E (8)
−Yij1 − . . . − Yijk
≤ b(i, j1, . . . , jk), ∀ {j1, . . . , jk} ∈ E (9)
Here (1) β(i, j) = αiαj if i, j is a distant pair and β(i, j) = min(αi, αj) otherwise;
(2) δ(i, j) = −αi if i = j; δ(i, j) = −αiαj if i, j is a distant pair; and δ(i, j) =
1 − αi − αj otherwise; (3) a(i, j1, . . . , jk) = αi + (αj1 + . . . + αjk
− 1); and (4)
b(i, j1, . . . , jk) = −αi. Note that β(i, j) + δ(i, j) ≥ 0 always, since α ∈ [0, 1]n+1
,
To prove the consistency of constraints (6)–(9), we will use a special combi-
natorial version of Farkas’s Lemma similar in spirit to that used in [12] and [2].
Before giving the exact form, we require some definitions.

Definition 2. A tiling (W, P, N) for G is a connected k-uniform hypergraph
H = (W, P, N) on vertices W and two disjoint k-edge sets P and N such that:
1. Each vertex in W is labelled by Yij, i, j ∈ {1, . . . , n}. Note that distinct
vertices need not have different labels.
2. Each vertex belongs to at most one edge in P and at most one edge in N (in
particular, all edges in P are mutually disjoint, as are all edges in N).
3. All edges in P ∪ N are of the form {Yij1 , . . . , Yijk
} where {j1, . . . , jk} ∈ E.
The edges of a tiling are called tiles. Vertices in W not incident to any tile in
P are called unmatched negative vertices; vertices in W not incident to any tile
in N are called unmatched positive vertices. Let UN and UP denote the sets
of unmatched negative and unmatched positive vertices, respectively. A vertex
labelled Yii is called diagonal. Denote the set of unmatched positive diagonal
vertices in W by UP D. A vertex labelled Yij is called a distant pair if {i, j} are
a distant pair. Given a tile {Yij1 , . . . , Yijk
} in P or N, call {j1, . . . , jk} ∈ E
its bracing edge and i its bracing node. An edge {j1, . . . , jk} in G is called
overloaded if αj1 = . . . = αjk
= 1
k + γ. A tile {Yij1 , . . . , Yijk
} in H is overloaded
if its bracing edge is overloaded.
Given a tiling H = (W, P, N) for G, define the following sums:
SH
1 =
{Yij1 ,...,Yijk }∈P
a(i, j1, . . . , jk) +
{Yij1 ,...,Yijk }∈N
b(i, j1, . . . , jk), (10)
SH
2 =
Yij ∈UP
δ(i, j) +
Yij ∈UN
β(i, j). (11)
Finally, let SH
= SH
1 + SH
2 .
Lemma 2 (Special Case of Farkas’s Lemma). Constraints (6)–(9) are un-
satisfiable iff there exists a tiling H = (W, P, N) for G such that SH
0.
Proof. Note first that by the general form of Farkas’s lemma constraints (6)–(9)
are unsatisfiable iff there exists a positive rational linear combination of them
where the LHS is 0 and the RHS is negative.
Now suppose that there exists a tiling H = (W, P, N) such that SH
0.
Consider the following linear integer combination of the constraints: (1) For
each tile e = {Yij1 , . . . , Yijk
} ∈ H, if e ∈ P, add the constraint Yij1 +. . .+Yijk
≤
a(i, j1, . . . , jk); if e ∈ N, add the constraint −Yij1 − . . . − Yijk
≤ b(i, j1, . . . , jk);
(2) For each v ∈ UN labelled Yij, add the constraint Yij ≤ β(i, j); (3) For
each v ∈ UP labelled Yij, add the constraint −Yij ≤ δ(i, j). But then, for this
combination of constraints the LHS equals 0 while the RHS equals SH
0. So
by Farkas’s lemma the constraints are unsatisfiable.
Now assume on the other hand that the constraints are unsatisfiable. So there
exists a positive rational linear combination of the constraints such that the LHS
is 0 and the RHS is negative. By clearing out denominators, we can assume that
the linear combination has integer coefficients. Hence, as β(i, j) + δ(i, j) ≥ 0

always, our combination must contain constraints of type (8) and (9). Moreover,
since the LHS is 0, for each Yij appearing in the integer combination there
must be a corresponding occurrence of −Yij. But then, it is easy to see that the
constraints in the integer linear combination can be grouped into a set of tilings
{Hi = (Wi, Pi, Ni)} such that the RHS of the linear combination equals

i SHi
.
Since the RHS is negative, it must be that at least one of the tilings H in the
set is such that SH
0. The lemma follows.
So to show that the constraints for the matrix Y are consistent and thus
complete the proof of the Inductive Claim for m + 1 (and complete the proof of
Theorem 3), we will show that SH
≥ 0 for any tiling H = (W, P, N) for G. To
that end, fix a tiling H = (W, P, N) for G. Our analysis divides into three cases
depending on the size of UP D, the set of unmatched positive diagonal vertices
in H. In the first (and easiest) case, |UP D| = 0; in the second, |UPD| ≥ 2; and
in the final, |UP D| = 1. We will show that SH
≥ 0 in all these cases . To reduce
clutter, we drop the superscript H from SH
1 , SH
2 and SH
. In what follows, let C
be the subgraph of G induced by the bracing edges of all tiles in H.
We first note two easy facts about H used below:
Proposition 1. 1. Suppose H contains a diagonal vertex. Then C is con-
nected. Moreover, for any vertex labelled Yij in H, there exists a path p
in H such that the bracing edges corresponding to the tiles in p form a path
p
from i to j in C.
2. The distance between any two diagonal vertices in H is at least girth(G)/2.
Proof. Part (2) is sketched in the Appendix. We leave part (1) as an easy exercise.
Case 1: No Unmatched Positive Diagonal Vertices
Consider the following sum:
S
2 =
Yij ∈UN
αiαj −
Yij ∈UP
αiαj. (12)
Note that since α ∈ [0, 1]n+1
, it follows that −αiαj ≤ 1−αi −αj and αiαj ≤ αi.
So since there are no unmatched positive diagonal vertices in the tiling, αiαj ≤
β(i, j) and −αiαj ≤ δ(i, j) for all unmatched vertices labelled Yij in the tiling,
and hence, S2 ≥ S
2. So to show S ≥ 0 in this case, it suffices to show S1 +S
2 ≥ 0.
To that end, consider the following sum:
{Yij1 ,...,Yijk }∈P
(−αiαj1 −. . .−αiαjk
)+
{Yij1 ,...,Yijk }∈N
(αiαj1 +. . .+αiαjk
). (13)
By properties (2) and (3) of a tiling, it follows that (13) telescopes and equals
S
2. But then, to show that S ≥ 0 in this case it suffices to show that for each
{Yij1 , . . . , Yijk
} ∈ P,
a(i, j1, . . . , jk) − [αi(αj1 + . . . + αjk
)] = (1 − αi)(αj1 + . . . + αjk
− 1), (14)

is nonnegative (the first term comes from the term in S1 for the tile, and the
second from the term for the tile in (13)), and for each {Yij1 , . . . , Yijk
} ∈ N,
b(i, j1, . . . , jk) + [αi(αj1 + . . . + αjk
)] = αi(αj1 + . . . + αjk
− 1), (15)
is nonnegative (again, the first term comes from S1 and the second from (13)).
But (14) and (15) are both nonnegative since the bracing edges for all tiles are
in G and α satisfies the edge constraints for G. Hence, S ≥ 0 in this case.
Case 2: At Least 2 Unmatched Positive Diagonal Vertices
We first define some notation to refer to quantities (14) and (15) which will
be used again in this case: For a tile e = {Yij1 , . . . , Yijk
}, define ζ(e) to be
(1 − αi)(αj1 + . . . + αjk
− 1) if e ∈ P and be αi(αj1 + . . . + αjk
− 1) if e ∈ N.
In Case 1 we showed that S ≥ 0 by (1) defining a sum S
2 such that S ≥
S1 + S
2, then (2) noting that S1 + S
2 =

e∈H ζ(e), and finally (3) showing that
ζ(e) ≥ 0 for all tiles e in H. Unfortunately, in the current case, since H contains
unmatched positive diagonal vertices, it is no longer true that S ≥ S1 + S
2.
However, it is easy to see that S ≥ (S1 +S
2)−

Yii∈UP D
(αi −α2
i ). In particular,
S ≥

e∈H ζ(e)−

Yii∈UP D
(αi −α2
i ). So since ζ(e) ≥ 0 for all tiles, to show that
S ≥ 0 for the current case, it suffices to show that for “many” tiles e in H, ζ(e)
is sufficiently large so that

e∈H ζ(e) ≥

Yii∈UP D
(αi − α2
i ).
We require the following lemma:
Lemma 3. If there are ≥ 2 diagonal vertices in H, then there exist disjoint
paths of length girth(G)/4 − 2 in H which do not involve any diagonal vertices.
Proof. Let Q be a tree in H that spans all the diagonal vertices in H (such a
tree exists since H is connected). For each diagonal vertex u ∈ H let B(u) be
the tiles in Q with distance at most girth(G)/4 −1 from u. Then for all diagonal
vertices u, v ∈ H, the balls B(u) and B(v) must be disjoint: otherwise, the
distance between u and v would be less than girth(G)/2, contradicting part (2)
of Proposition 1. Since Q is a spanning tree, for each diagonal vertex u ∈ H,
there exists a path of length at least girth(G)/4 − 2 in B(u).
By the lemma, for each vertex v ∈ UP D there exists a path pv in H of length
girth(G)/4 − 2 such that for distinct u, v ∈ UP D the paths pu and pv share no
tiles. We will show (provided that n is sufficiently large) that for each v ∈ UPD,

e∈pv
ζ(e) ≥ 1. Since αi − α2
i ≤ 1
4 , it will follow that S ≥ 0, completing the
proof for this case.
So fix v ∈ UPD. Since Rm
≤ girth(G)/20, pv contains at least girth(G)/20
disjoint pairs of adjacent overloaded tiles (i.e., αj = 1
k + γ for all vertices j in
the bracing edges of the two tiles in the pair). Let e1 = {Yqr1 , . . . , Yqrk
} ∈ P
and e2 = {Yst1 , . . . , Ystk
} ∈ N be such a pair. Note that ζ(e1) = (1 − αi)kγ
and ζ(e2) = αskγ. Now, either s = r for some ∈ [k], or s = q. If s =
r, then αs = 1
k + γ and ζ(e1) + ζ(e2) ≥ kγ(1
k + γ). If instead s = q, then
ζ(e1) + ζ(e2) ≥ kγ. In either case, ζ(e1) + ζ(e2) ≥ kγ(1
k + γ). Hence, summing

over all girth(G)/20 disjoint pairs of adjacent overloaded tiles, it follows that

e∈pv
ζ(e) ≥ kγ(1
k + γ) girth(G)/20. As desired, the latter is indeed greater
than 1 for large n since girth(G) = Ω(log n).
Case 3: Exactly 1 Unmatched Positive Diagonal Vertex
The argument in Case 2 actually rules out the possibility that the tiling contains
two or more diagonal vertices, unmatched or not. Hence, we will assume that
our tiling contains exactly one diagonal vertex labelled without loss of generality
by Y11. We consider two subcases.
Subcase 1: At least one of the unmatched vertices is a distant pair:
Arguing as in Case 2, we have in this subcase that S = [

e∈H ζ(e)]−(α1−α2
1)
where, moreover, ζ(e) ≥ 0 for all tiles e ∈ H. Hence, to show that S ≥ 0 in
this subcase, it suffices to show that there exists a subset H
⊆ H such that

e∈H ζ(e) ≥ 1
4 ≥ α1 − α2
1.
To that end, let v be an unmatched vertex in H labelled Yij where {i, j} is
a distant pair. By part (1) of Proposition 1 there is a path p in H such that
the bracing edges corresponding to the tiles in p form a path q in C connecting
vertices i and j. Since i, j are distant, q must contain, by definition, at least
4Rm+1 + 1/γ overloaded edges. Moreover, it is not hard to see that there exists
a sub-path p
of p of length at most 5Rm+1 + 1/γ such that the bracing edges of
the tiles in p
include all these overloaded edges. Hence p
must contain at least
1/γ disjoint pairs of overloaded tiles. But then, arguing as in Case 2, it follows
that

e∈p ζ(e) ≥ (1/γ)(kγ(1
k + γ)) ≥ 1
4 , and hence that S ≥ 0 in this subcase.
Subcase 2: None of the unmatched vertices is a distant pair:
Note first that we can assume that H contains no cycles: If it does, then it
is easy to see that C must also contain a cycle, and hence, we can use the ideas
from Case 2 to show that S ≥ 0.
So assume H has no cycles and define a tree T as follows: There is a node
in T for each tile in H. The root root(T ) corresponds to the tile containing Y11.
There is an edge between two nodes in T iff the tiles corresponding to the nodes
share a vertex. Note that T is a tree since H is acyclic. For a node v ∈ T , let
Tile(v) denote its corresponding tile in H. Finally, for a node v ∈ T we abuse
notation and say v ∈ P (resp., v ∈ N) if Tile(v) ∈ P (resp., Tile(v) ∈ N).
Recursively define a function t on the nodes of T as follows: Let v be a node
in T and suppose Tile(v) = {Yij1 , . . . , Yijk
}. If v ∈ P, then
t(v) = a(i, j1, . . . , jk) +
v∈Child(v)
t(v
) +
unmatched Yij
in T ile(v)
δ(i, j). (16)
If instead v ∈ N, then
t(v) = b(i, j1, . . . , jk) +
v∈Child(v)
t(v
) +
unmatched Yij
in T ile(v)
β(i, j). (17)

A simple induction now shows that t(root(T )) = S (recall that root(T ) corre-
sponds to the tile containing Y11). Hence, to show that S ≥ 0 in this subcase it
suffices to show that t(root(T )) ≥ 0. This will follow from the following lemma:
Lemma 4. Given v ∈ T , let e = Tile(v) = {Yij1 , . . . , Yijk
}. Assume, without
loss of generality, that Yij1 , . . . , Yij
(where 0 ≤ ≤ k) are the labels of the
vertices in e shared with the tile corresponding to v’s parent (where = 0 iff
v = root(T )). Then,
1. If v = root(T ), then t(v) = t(root(T )) ≥ 0; otherwise,
2. If v ∈ P, then t(v) ≥ min(αi,

r=1 αjr );
3. If instead v ∈ N, then t(v) ≥ min(0, 1 − αi −

r=1 αjr ).
Proof. The proof is by induction on the size of the subtree at v. For the base
case, let v be a leaf. Hence, the vertices labelled by Yij+1
, . . . , Yijk
in e are all
unmatched non-distant pair vertices (since H contains no unmatched distant
pair vertices). Suppose v ∈ P. By equation (16),
t(v) = αi +
k
r=1
αr − 1 +
k
r=+1
δ(i, jr). (18)
There are two subcases to consider. In the first subcase, jr = i for all r =
+ 1, . . . , k. Then δ(i, jr) = 1 − αi − αr for all r = + 1, . . . , k and it follows
that t(v) =

r=1 αr + (k − − 1)(1 − αi) ≥

r=1 αr. In the second subcase,
jr = i for some r, say r = k. This can only happen for one r since H only
contains one diagonal vertex, namely Y11. In particular, v must be root(T ).
Then δ(i, jk) = −αi and δ(i, jr) = 1 − αi − αr for all r = + 1, . . . , k − 1. Since
α satisfies the edge constraints for G, it follows that t(v) = t(root(T )) ≥ 0.
Suppose instead that v ∈ N. By equation (17), t(v) = −αi +
k
r=+1 β(i, jr).
Now, β(i, jr) ≥ min(αi, αjr ) for all r = + 1, . . . , k. If β(i, jr) = αi for some
jr, then t(v) ≥ 0 as desired. So assume β(i, jr) = αjr for all r = + 1, . . . , k.
Since α satisfies the edge constraints for G,
k
r=1 αjr ≥ 1. But then, t(v) ≥
1 − αi −

r=1 αjr , completing the proof for the base case.
For the inductive step, let v be any node in T and assume the lemma holds for
all children of v. Assume first that v ∈ P and hence, all children of v are in N (by
definition of a tiling). By the induction hypothesis, there are two possibilities:
either (a) t(v
) ≥ 0 for all children v
of v, and there are no unmatched vertices
in Tile(v), or (b) there exists some unmatched vertex in Tile(v) or t(v
) ≥
1 − αi −
t
s=1 αjrs
for some child v
of v. In case (a), it follows from (16) and
from from the fact that α satisfies the edge constraints for G that t(v) ≥ αi. In
case (b), using the arguments from the base case, it follows that t(v) ≥ 0 when
v = root(T ), and t(v) ≥

r=1 αjr when v = root(T ).
Now assume that v ∈ N. The arguments from the above case when v ∈ P,
as well as the arguments from the base case, can now be adapted to show that
t(v) ≥ min(0, 1 − αi −

r=1 αjr ). Note that in in this case v = root(T ) since
root(T ) ∈ P. The inductive step, and hence the lemma, now follow.
So S ≥ 0 in Case 3 also, and the Inductive Claim now follows for m + 1.

4 Discussion
The integrality gap of 2 − obtained in [2] for graph vertex cover holds for
Ω(
√
log n) rounds of LS (improved to Ω(log n) rounds in [3]). We conjecture that
our integrality gaps in the hypergraph case should also hold for at least Ω(log n)
rounds. Indeed, examining the proof of Theorem 3, it can be shown that by
redefining the recursive definition of Rm to be Rm = Rm+1 + 1/γ, then all cases
considered in the proof except Subcase 1 of Case 3 can be argued for Ω(log n)
rounds. While it can be argued that S ≥ 0 for Ω(
√
log n) rounds for graphs in
this subcase (and in fact for Ω(log n) rounds if the definition of (S, R, γ)-vectors
is further refined as in [3]), a proof for hypergraphs eludes us.
The integrality gaps of k−1− given in [1] for k-uniform hypergraph vertex
cover held not only for LS but also for LS+ liftings, the stronger semidefinite
version of Lovász-Schrijver liftings. As mentioned in the introduction, obtaining
optimal integrality gaps for both graph and hypergraph vertex cover for LS+
remains a difficult open question.
Finally, can our techniques be applied to other problems? For instance, could
our techniques be used to show that, say, even after log log n rounds of LS the
integrality gap of the linear relaxation for max-cut remains larger than the ap-
proximation factor attained by the celebrated SDP-based Goemans-Williamson
algorithm [11]?
Acknowledgements
I would like to thank Sanjeev Arora and Elad Hazan for many helpful discussions.
I would also like to thank the anonymous referees for helpful comments.
References
1. M. Alekhnovich, S. Arora, and I. Tourlakis. Towards strong nonapproximabil-
ity results in the Lovász-Schrijver hierarchy. In Proceedings of the 37th Annual
Symposium on the Theory of Computing, New York, May 2005. ACM Press.
2. S. Arora, B. Bollobás, and L. Lovász. Proving integrality gaps without knowing the
linear program. In Proceedings of the 43rd Symposium on Foundations of Computer
Science (FOCS-02), pages 313–322, Los Alamitos, Nov. 16–19 2002.
3. S. Arora, B. Bollobás, L. Lovász, and I. Tourlakis. Proving integrality gaps without
knowing the linear program. Manuscript, 2005.
4. S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and
graph partitioning. In Proceedings of the thirty-sixth annual ACM Symposium on
Theory of Computing (STOC-04), pages 222–231, New York, June 13–15 2004.
ACM Press.
5. J. Buresh-Oppenheim, N. Galesi, S. Hoory, A. Magen, and T. Pitassi. Rank bounds
and integrality gaps for cutting planes procedures. In FOCS: IEEE Symposium on
Foundations of Computer Science (FOCS), 2003.
6. W. Cook and S. Dash. On the matrix-cut rank of polyhedra. Mathematics of
Operations Research, 26(1):19–30, 2001.

7. I. Dinur, V. Guruswami, S. Khot, and O. Regev. A new multilayered PCP and the
hardness of hypergraph vertex cover. In ACM, editor, Proceedings of the Thirty-
Fifth ACM Symposium on Theory of Computing, San Diego, CA, USA, June 9–11,
2003, pages 595–601, New York, NY, USA, 2003. ACM Press.
8. P. Erdős. Graph theory and probability. Canadian Journal of Mathematics, 11:34–
38, 1959.
9. U. Feige and R. Krauthgamer. The probable value of the Lovász–Schrijver relax-
ations for maximum independent set. SIAM Journal on Computing, 32(2):345–370,
Apr. 2003.
10. M. X. Goemans and L. Tunçel. When does the positive semidefiniteness constraint
help in lifting procedures. Mathematics of Operations Research, 26:796–815, 2001.
11. M. X. Goemans and D. P. Williamson. .878-approximation algorithms for MAX
CUT and MAX 2SAT. In Proceedings of the Twenty-Sixth Annual ACM Sym-
posium on the Theory of Computing, pages 422–431, Montréal, Québec, Canada,
23–25 May 1994.
12. L. Lovász and A. Schrijver. Cones of matrices and set-functions and 0-1 optimiza-
tion. SIAM Journal on Optimization, 1(2):166–190, May 1991.
13. H. D. Sherali and W. P. Adams. A hierarchy of relaxations between the continuous
and convex hull representations for zero-one programming problems. SIAM Journal
on Discrete Mathematics, 3(3):411–430, Aug. 1990.
14. T. Stephen and L. Tunçel. On a representation of the matching polytope via
semidefinite liftings. Mathematics of Operations Research, 24(1):1–7, 1999.
A Proof Sketch of Part (2) of Proposition 1
Since H is connected, there exists a path between any two diagonal vertices in
H. We will show that the subgraph of G induced by the bracing edges of the
tiles in such path must contain a cycle. Part (2) will then follow.
To that end, let q be an arbitrary path in H comprised, in order, of tiles
e1, . . . , er. Consider these tiles in order beginning with e1. As long as the bracing
node in successive tiles does not change, then the bracing edges of these tiles
form a path q
in G. If the bracing node changes at some tile, say ei, then the
bracing edge for ei starts a new path q
in G. Moreover, some vertex w ∈ G from
the last edge visited in q
becomes the (new) bracing node for ei. The bracing
edges of the tiles following ei now extend q
until an edge ej is encountered with
yet a different bracing node. But then, the bracing edge for ej must contain w.
Hence the bracing edge for ej must extend q
. Continuing this argument, we see
that each time the bracing node switches, the tiles switch back and forth from
contributing to q
or q
.
Each time the current tile e contains some diagonal vertex labelled Yii, the
bracing node i for e is also contained in the bracing edge for e. But then, since
the bracing node belongs to, say, q
while the bracing edge belongs to q
, it
follows that q
and q
must intersect at vertex i in G. Hence, if a tile path q
contains two diagonal vertices, q
and q
must intersect twice. Part (2) follows.

Bounds for Error Reduction with Few Quantum Queries
Sourav Chakraborty1
, Jaikumar Radhakrishnan2,3
, and Nandakumar Raghunathan1
1
University of Chicago, Chicago
IL 60637, USA
{sourav,nanda}@cs.uchicago.edu
2
Toyota Technological Institute at Chicago
IL 60637, USA
jaikumar@tti-c.org
3
School of Technology and Computer Science,
Tata Institute of Fundamental Research, Mumbai 400005, India
Abstract. We consider the quantum database search problem, where we are
given a function f : [N] → {0, 1}, and are required to return an x ∈ [N] (a
target address) such that f(x) = 1. Recently, Grover [G05] showed that there
is an algorithm that after making one quantum query to the database, returns an
X ∈ [N] (a random variable) such that
Pr[f(X) = 0] = 3
,
where = |f−1
(0)|/N. Using the same idea, Grover derived a t-query quantum
algorithm (for infinitely many t) that errs with probability only 2t+1
. Subse-
quently, Tulsi, Grover and Patel [TGP05] showed, using a different algorithm,
that such a reduction can be achieved for all t. This method can be placed in a
more general framework, where given any algorithm that produces a target state
for some database f with probability of error , one can obtain another that makes
t queries to f, and errs with probability 2t+1
. For this method to work, we do
not require prior knowledge of . Note that no classical randomized algorithm
can reduce the error probability to significantly below t+1
, even if is known. In
this paper, we obtain lower bounds that show that the amplification achieved by
these quantum algorithms is essentially optimal. We also present simple alterna-
tive algorithms that achieve the same bound as those in Grover [G05], and have
some other desirable properties. We then study the best reduction in error that can
be achieved by a t-query quantum algorithm, when the initial error is known to
lie in an interval of the form [, u]. We generalize our basic algorithms and lower
bounds, and obtain nearly tight bounds in this setting.
1 Introduction
In this paper, we consider the problem of reducing the error in quantum search algo-
rithms by making a small number of queries to the database. Error reduction in the
form of amplitude amplification is one of the central tools in the design of efficient
quantum search algorithms [G98a, G98b, BH+02]. In fact, Grover’s database search al-
gorithm [G96, G97] can be thought of as amplitude amplification applied to the trivial
c

246 Sourav Chakraborty, Jaikumar Radhakrishnan, and Nandakumar Raghunathan
algorithm that queries the database at a random location and succeeds with probability
at least 1
N . The key feature of quantum amplitude amplification is that it can boost the
success probability from a small quantity δ to a constant in O(1/
√
δ) steps, whereas,
in general a classical algorithm for this would require Ω(1/δ) steps. This basic algo-
rithm has been refined, taking into account the number of solutions and the desired final
success probability 1 − . For example, Buhrman, Cleve, de Wolf and Zalka [BC+99]
obtained the following:
Theorem [BC+99]: Fix η ∈ (0, 1), and let N 0, ≥ 2−N
, and t ≤ ηN. Let
T be the optimal number of queries a quantum computer needs to search with
error ≤ through an unordered list of N items containing at least t solutions.
Then log 1/ ∈ Θ(T 2
/N + T t/N) (Note that the constant implicit in the Θ
notation can depend on η).
Recently, Grover [G05] considered error reduction for algorithms that err with small
probability. The results were subsequently refined and extended by Tulsi, Grover and
Patel [TGP05]. Let us describe their results in the setting of the database search prob-
lem, where, given a database f : [N] → {0, 1}, we are asked to determine an x ∈
f−1
(1). If |f−1
(0)| = N, then choosing x uniformly at random will meet the require-
ments with probability at least 1 − . This method makes no queries to the database. If
one is allowed one classical query, the error can be reduced to 2
and, in general, with t
classical queries one can reduce the probability of error to t+1
. It can be shown that no
classical t-query randomized algorithm for the problem can reduce the probability of
error significantly below t+1
, even if the value of is known in advance. Grover [G05]
presented an interesting algorithm that makes one quantum query and returns an x that
is in f−1
(1) with probability 1 − 3
. Tulsi, Grover and Patel [TGP05] showed an iter-
ation where one makes just one query to the database and performs a measurement, so
that after t iterations of this operator the error is reduced to 2t+1
. This algorithm works
for all and is not based on knowing in advance. Thus this iteration can be said to
exhibit a “fixed point” behavior [G05, TGP05], in that the state approaches the target
state (or subspace) closer with each iteration, just as it does in randomized classical
search. The iteration used in the usual Grover search algorithm [G98a, G98b] does not
have this property. Note, however, that if the initial success probability is 1
N , these new
algorithms make Ω(N) queries to the database, whereas the original algorithm makes
just O(
√
N) queries.
In [G05], the database is assumed to be presented by means of an oracle of the form
|x → exp(f(x)πi/3))|x. The standard oracle for a function f used in earlier works
on quantum search is |x|b *→ |x|b ⊕ f(x), where ⊕ is addition modulo two. It can
be shown that the oracle assumed in [G05] cannot be implemented by using just one
query to the standard oracle. In Tulsi, Grover and Patel [TGP05] the basic iteration uses
the controlled version of the oracle, namely |x|b|c *→ |x|b ⊕ c · f(x)|c.
In this paper, we present a version of the algorithm that achieves the same reduction
in error as in [G05], but uses the standard oracle. In fact, our basic one-query algorithm
has the following natural interpretation. First, we note that for δ ≤ 3
4 , there is a one-
query quantum algorithm Aδ that makes no error if |f−1
(0)| = δN. Then, using a sim-
ple computation, one can show that the one-query algorithm corresponding to δ = 1
N
errs with probability less than 3
when |f−1
(0)| = N. One can place these algorithms

Bounds for Error Reduction with Few Quantum Queries 247
in a more general framework, just as later works due to Grover [G98a, G98b] and Bras-
sard, Hoyer, Mosca and Tapp [BH+02] placed Grover’s original database search algo-
rithm [G96, G97] in the general amplitude amplification framework. The framework is
as follows: Suppose there is an algorithm G that guesses a solution to a problem along
with a witness, which can be checked by another algorithm T . If the guess returned by
G is correct with probability 1 − , then there is another algorithm that uses G, G−1
,
makes t queries to T , and guesses correctly with probability 1 − 2t+1
.
These algorithms show that, in general, a t-query quantum algorithm can match
the error reduction obtainable by any 2t-query randomized algorithm. Can one do even
better? The main contribution of this paper are the lower bounds on the error probabil-
ity of t-query algorithms. We show that the amplification achieved by these algorithms
is essentially optimal (see Section 2.1 for the precise statement). Our result does not
follow immediately from the result of Buhrman, Cleve, de Wolf and Zalka [BC+99]
cited above because of the constants implicit in the θ notation, but with a slight mod-
ification of their proofs one can derive a result similar to ours (see Section 2.1). Our
lower bound result uses the polynomial method of Beals, Cleve, Buhrman, Mosca and
de Wolf [BB+95] combined with an elementary analysis based on the roots of low de-
gree polynomials, but unlike previous proofs using this method, we do not rely on any
special tools for bounding the rate of growth of low degree polynomials.
2 Background, Definitions and Results
We first review the standard framework for quantum search. We assume that the reader
is familiar with the basics of quantum circuits, especially the quantum database search
algorithm of Grover [G96, G97] (see, for example, Nielsen and Chuang [NC, Chapter
6]). The database is modelled as a function f : [N] → S, where [N]
Δ
= {0, 1, 2, . . ., N−
1} is the set of addresses and S is the set of possible items to be stored in the database.
For our purposes, we can take S to be {0, 1}. When thinking of bits we identify [N]
with {0, 1}n
. Elements of [N] will be called addresses, and addresses in f−1
(1) will be
referred to as targets. In the quantum model, the database is provided to us by means of
an oracle unitary transformation Tf , which acts on an (n + 1)-qubit space by sending
the basis vector |x|b to |x|b⊕f(x). For a quantum circuit A that makes queries to a
database oracle in order to determine a target, we denote by A(f) the random variable
(taking values in [N]) returned by A when the database oracle is Tf .
Definition 1. Let A be a quantum circuit for searching databases of size N. For a
database f of size N, let errA(f) = Pr[A(f) is not a target state]. When N is an
integer in {0, 1, 2, . . ., N}, let errA() = max
f:|f−1(0)|=N
errA(f).
Using this notation, we can state Grover’s result as follows.
Theorem 1 (Grover [G05]). For all N, there is a one-query algorithm A, such that
for all (assuming N is an integer), errA() = 3
.
This error reduction works in a more general setting. Let [N] represent the set of
possible solutions to some problem, and let f : [N] → {0, 1} be the function that checks

that the solution is correct; as before we will assume that we are provided access to this
function via the oracle Tf . Let G be a unitary transform that guesses a solution in [N]
that is correct with probability 1 − . Our goal is to devise another guessing algorithm
B that using Tf , G and G−1
produces a guess that is correct with signiﬁcantly better
probability. Let B(Tf, G) be the answer returned by B when the checker is Tf and the
guesser is G.
Theorem 2 (Grover [G05]). There is an algorithm B that uses Tf once, G twice and
G−1
once, such that Pr[f(B(Tf , G)) = 0] = 3
, where is the probability of error of
the guessing algorithm G.
Note that Theorem 1 follows from Theorem 2 by taking G to be the Hadamard trans-
formation, which produces the uniform superposition on all N states when applied to
the state |0.
Theorem 3 ([G05, TGP05]). For all t ≥ 0 and all N, there is a t-query quantum
database search algorithm such that, for all (N is an integer), errA() = 2t+1
.
In Grover [G05], this result was obtained by recursive application of Theorem 2, and
worked only for inﬁnitely many t. Tulsi, Grover and Patel [TGP05] rederived Theo-
rems 1 and 2 using a different one-query algorithm, which could be applied iteratively
to get Theorem 3.
From now on when we consider error reduction for searching a database f and use
the notation errA(), will refer to |f−1
(0)|/N; in particular, we assume that N ∈
{0, 1, . . ., N − 1}. However, for the general framework, can be any real number in
[0, 1].
2.1 Our Contributions
As stated earlier, in order to derive the above results, Grover [G05] and Tulsi, Grover
and Patel [TGP05] assume that access to the database is available using certain special
types of oracles. In the next section, we describe alternative algorithms that establish
Theorem 1 while using only the standard oracle Tf : |x|b → |x|b ⊕ f(x). The same
idea can be used to obtain results analogous to Theorems 2. By recursively applying
this algorithm we can derive a version of Theorem 3 for t of the form 3i
−1
2 where i
is the number of recursive applications. Our algorithms and those in Tulsi, Grover and
Patel [TGP05] use similar ideas, but were obtained independently of each other.
We also consider error reduction when we are given a lower bound on the error
probability , and obtain analogs of Theorems 1 and 2 in this setting.
Theorem 4 (Upper bound result).
(a) For all N and δ ∈ [0, 3
4 ], there is a one-query algorithm Aδ such that for all ≥ δ,
errAδ
(f) ≤
'
− δ
1 − δ
(2
.

(b) For all δ ∈ [0, 3
4 ], there is an algorithm Bδ that uses Tf once and G twice and G−1
once, such that
errBδ
(Tf , G) ≤
'
− δ
1 − δ
(2
.
The case = δ corresponds to the fact that one can determine the target state with
certainty if |f−1
(0)| is known exactly and is at most 3N
4 . Furthermore, Theorems 1
and 2 can be thought of as special cases of the above Proposition corresponding to
δ = 0. In fact, by taking δ = 1
N in the above proposition, we obtain the following slight
improvement over Theorem 1.
Corollary 1. For all N, there is a one-query database search algorithm A such that
for all (where N ∈ {0, 1, . . ., N}), we have errA() ≤
'
− 1
N
1 − 1
N
(2
.
Lower bounds: The main contribution of this work is our lower bound results. We show
that the reduction in error obtained in Theorem 1 and 2 are essentially optimal.
Theorem 5 (Lower bound result). Let 0 ≤ u 1 be such that N and uN are
integers.
(a) For all one-query database search algorithms A, for either = or = u,
errA() ≥ 3

u −
u + − 2u
2
.
(b) For all t-query database search algorithms A, there is an ∈ [, u] such that N is
an integer, and
errA() ≥ 2t+1

b − 1
b + 1
−
1
N(b + 1)
2t
,
where b = (u
)
1
t+1 , and we assume that N(b − 1) 1.
In particular, this result shows that to achieve the same reduction in error, a quantum
algorithm needs to make roughly at least half as many queries as a classical randomized
algorithm. A similar result can be obtained by modifying the proof in Buhrman, Cleve,
de Wolf and Zaka [BC+99]: there is a constant c 0, such that for all u 0, all large
enough N and all t-query quantum search algorithms for databases of size N, there is
an ∈ (0, u] (N is an integer) such that errA() ≥ (cu)2t+1
.
3 Upper Bounds: Quantum Algorithms
In this section, we present algorithms that justify Theorems 1, 2 and 3, but by using the
standard database oracle. We then modify these algorithms to generalize and slightly
improve these theorems.

3.1 Alternative Algorithms Using the Standard Oracle
We ﬁrst describe an alternative algorithm A0 to justify Theorem 1. This simple algo-
rithm (see Figure 1) illustrates the main idea used in all our upper bounds. We will work
with n qubits corresponding to addresses in [N] and one ancilla qubit. Although we do
not simulate Grover’s oracle directly, using the ancilla, we can reproduce the effect the
complex amplitude used there by real amplitudes. As we will see, the resulting algo-
rithm has an intuitive explanation and also some additional properties not enjoyed by
the original algorithm.
6 6 6 6
6
6
6 6 6
6 6 6
6 6 6
6 6 6
6 6
6
Target Non-target
Target Non-target
Non-target
Non-target
Target
Target
Step 1
Step 2
Step 3
1+
2
√
N
1
√
N
“
1
2
+
”
1
√
N
1
2
√
N

q
3
4N
q
3
4N
States of the form |x|0
States of the form |x|0 States of the form |x|1
States of the form |x|1
Target Non-target
Fig. 1. The one-query algorithm
Step 1: We start with the uniform superposition on [N] with the ancilla bit in the state
|0.
Step 2: For targets x, transform |x|0 to 1
2 |x|0 +

3
4 |x|1. The basis states |x|0
for x ∈ f−1
(0), are not affected by this transformation.
Step 3: Perform an inversion about the average controlled on the ancilla bit being |0,
and then measure the address registers.
Step 1 is straightforward to implement using the n-qubit Hadamard transform Hn. For
Step 2, using one-query to Tf , we implement a unitary transformation Uf , which maps

|x|0 to |x|0 if f(x) = 0 and to |x

1
2 |0 +

3
4 |1

, if f(x) = 1. One such trans-
formation is Uf = (In ⊗ R−1
)Tf (In ⊗ R), where In is the n-qubit identity operator
and R is the one-qubit gate for rotation by π
12 (that is, |0
R
*→ cos( π
12 )|0 + sin( π
12 )|1
and |1
R
*→ cos( π
12 )|1 − sin( π
12 )|0). The inversion about the average is usually imple-
mented as An = −Hn(In − 2|00|)Hn. The controlled version we need is then given
by
An,0 = An ⊗ |00| + In ⊗ |11|.
Let H
= Hn ⊗ I. The final state is |φf = An,0Uf H
|0|0.
To see that the algorithm works as claimed, consider the state just before the opera-
tor An,0 is applied. This state is
1
√
N
⎡
⎣
x∈f−1(1)
1
2
|x|0
x∈f−1(0)
|x|0
⎤
⎦ +
1
√
N x∈f−1(1)
-
3
4
|x|1.
Suppose |f−1
(0)| = N. The “inversion about the average” is performed only on the
first term, so the non-target states receive no amplitude from the second term. The aver-
age amplitude of the states in the first term is 1
2
√
N
(1+) and the amplitude of the states
|x|0 for x ∈ f−1
(0) is 1
√
N
. Thus, after the inversion about the average the amplitude
of |x|0 for x ∈ f−1
(0) is
√
N
. It follows that if we measure the address registers in
the state |φf , the probability of observing a non-target state is exactly
|f−1
(0)| ·
2
N
= 3
.
Remark: Note that this algorithm actually achieves more. Suppose we measure the
ancilla bit in |φf , and find a 1. Then, we are assured that we will find a target on mea-
suring the address registers. Furthermore, the probability of the ancilla bit being 1 is
exactly 3
4 (1 − ). One should compare this with the randomized one-query algorithm
that with probability 1− provides a guarantee that the solution it returns is correct. The
algorithm in [G05] has no such guarantee associated with its solution. However, the al-
gorithm obtained by Tulsi, Grover and Patel [TGP05] gives a guarantee with probability
1
2 (1 − ).
The general algorithm B0 needed to justify Theorem 2 is similar. We use G instead
of Hn; that is, we let H
= G ⊗ I, An = G(2|00| − In)G−1
, and, as before,
An,0 = An ⊗ |00| + In ⊗ |11|.
The final state is obtained in the same way as before |φf = An,0Uf H
|0|0.
Remark: As stated, we require the controlled version of G and G−1
to implement
An,0. However, we can implement G with the uncontrolled versions themselves from
the following alternative expression for An,0:
An,0 = (G ⊗ I)[(2|00| − In) ⊗ |00| + In ⊗ |11|](G−1
⊗ I).

We can estimate the error probability of this algorithm using the following standard
calculation. Suppose the probability of obtaining a non-target state on measuring the
address registers in the state G|0 is . Let us formally verify that the probability of
obtaining a non-target state on measuring the address registers in the state |φf is 3
.
This follows using the following routine calculation. We write
G|0 = α|t + β|t
,
where |t is a unit vector in the “target space” spanned by {|x : f(x) = 1}, and |t
is
a unit vector in the orthogonal complement of the target space. By scaling |t and |t

by suitable phase factors, we can assume that α and β are real numbers. Furthermore
β2
= . The state after the application of Uf is then given by
α
2
|t + β|t

|0 +
-
3
4
α|t|1. (1)
Now, the second term is not affected by An,0, so the amplitude of states in the subspace
of non-target states is derived entirely from the first term, which we denote by |u. To
analyze this contribution we write |u, using the basis
|v = α|t + β|t
; (2)
|v
= β|t − α|t
. (3)
That is, |u =

α2
2
+ β2

|v|0 −
αβ
2
|v
|0.
Since An,0|v = |v and An,0|v
= −|v
, we have
An,0|u =

α2
2
+ β2

|v|0 +
αβ
2
|v
|0.
Returning to the basis |t and |t
(using (2) and (3)), we see that the amplitude associ-
ated with |t
in this state is β3
. Thus, the probability that the final measurement fails to
deliver a target address is exactly β6
= 3
.
Remark: The algorithm B0 can be used recursively to get a t-query algorithm that
achieves the bound Theorem 3. Just as in the one-query algorithm, by measuring the
ancilla bits we can obtain a guarantee; this time the solution is accompanied with guar-
antee with probability at least (1 − 1
t − 6 log t
t(log 1
)log3 4 ). The t-query algorithm obtained
by Tulsi, Grover and Patel [TGP05] has significantly better guarantees: it certifies that
its answer is correct with probability at least 1 − 2t
.
3.2 Algorithms with Restrictions on
As stated above, for each δ ∈ [0, 1], there is a one-query quantum algorithm Aδ that
makes no error if the |f−1
(0)| = δN (or, in the general setting, if G is known to err
with probability at most 3
4 ). Let us explicitly obtain such an algorithm Aδ by slightly

modifying the algorithm above. The idea is to ensure that the inversion about the aver-
age performed in Step 3 reduces the amplitude of the non-target states to zero. For this,
we only need to replace Uf by Uf,δ, which maps |x|0 to |x|0 if f(x) = 0 and to
|x (α|0 + β|1), if f(x) = 1, where α =
1 − 2δ
2(1 − δ)
and β = 1 − α2.
Also, one can modify the implementation of Uf above, replacing π
12 by sin−1
(α)
2
(note that δ ≤ 3
4 implies that |α| ≤ 1), and implement Uf,δ using just one-query to Tf .
Proposition 1. Let |f−1
(0)| = δ ≤ 3
4 . Then, errAδ
(f) = 0.
An analogous modification for the general search gives us an algorithm Bδ(T, G)
that has no error when G produces a target state for T with probability exactly 1 − δ.
We next observe that the algorithms Aδ and Bδ perform well not only when the original
probability is known to be δ but also if the original probability is ≥ δ. This justifies
Theorem 4 claimed above.
Proof of Theorem 4: We will only sketch the calculations for part (a). The average
amplitude of all the states of the form |x|0 is ( 1
√
N
)(1−2δ +)/(2(1−δ)). From this
it follows that the amplitude of a non-target state after the inversion about the average
is ( 1
√
N
)(−δ)/(1−δ). Our claim follows from this by observing that there are exactly
N non-target states.

4 Lower Bounds
In this section, we show that the algorithms in the previous section are essentially opti-
mal. For the rest of this section, we fix a t-query quantum search algorithm to search a
database of size N. Using the polynomial method we will show that no such algorithm
can have error probability significantly less than t+1
, for a large range of .
The proof has two parts. First, using standard arguments we observe that errA() is
a polynomial of degree at most 2t + 1 in .
Lemma 1. Let A be a t-query quantum search algorithm for databases of size N. Then,
there is a univariate polynomial r(Z) with real coefficients and degree at most 2t + 1,
such that for all
errA() ≥ r().
Furthermore, r(x) ≥ 0 for all x ∈ [0, 1].
In the second part, we analyze such low degree polynomials to obtain our lower
bounds. We present this analysis first, and return to the proof of Lemma 1 after that.
4.1 Analysis of Low Degree Polynomials
Definition 2 (Error polynomial). We say that a univariate polynomial r(Z) is an error
polynomial if (a) r(z) ≥ 0 for all z ∈ [0, 1], (b) r(0) = 0, and (c) r(1) = 1.
Our goal is to show that an error polynomial of degree at most 2t+1 cannot evaluate
to significantly less than 2t+1
for many values of . For our calculations, it will be
convenient to ensure that all the roots of such a polynomial are in the interval [0, 1).

Lemma 2. Let r(Z) an error polynomial of degree 2t + 1 with k 2t + 1 roots in the
interval [0, 1). Then, there is another error polynomial q(Z) of degree at most 2t + 1
such that q(z) ≤ r(z) for all z ∈ [0, 1], and q(Z) has at least k +1 roots in the interval
[0, 1).
Proof. Let α1, α2, . . . , αk be the roots of r(x) in the interval [0, 1). Hence we can write
r(Z) =
k
!
i=1
(Z − αi)r
(Z),
where r
(Z) does not have any roots in [0, 1). Now, by substituting Z = 1, we conclude
that r
(1) ≥ 1. Since r
(Z) does not have any roots in [0, 1), it follows that r
(z) 0
for all z ∈ [0, 1).
The idea now, is to subtract a suitable multiple of the polynomial 1 − Z from r
(Z)
and obtain another polynomial r
(Z) which has a root in [0, 1). Since 1 − Z is positive
in [0, 1), r
(Z) is at most r
(Z) in this interval. The polynomial q(Z) will be deﬁned by
q(Z) =

α∈R(Z − α)r
(Z). To determine the multiple of 1 − Z we need to subtract,
consider λ(c) = minz∈[0,1) r
(Z) − c(1 − Z). Since λ(c) is continuous, λ(0) 0 and
λ(c) 0 for large enough c, it follows that λ(c0) = 0 for some c0 0. Now, let
r
(Z) = r(Z) − c0(1 − Z).

By repeatedly applying Lemma 2 we obtain the following.
Lemma 3. Let r(Z) be an error polynomial of degree at most 2t + 1. Then, there is an
error polynomial q(Z) of degree exactly 2t + 1 such that q(z) ≤ r(z) for all z ∈ [0, 1],
and q(Z) has 2t + 1 roots in the interval [0, 1).
We can now present the proof of Theorem 5, our main lower bound result.
Proof of Theorem 5: Consider the case t = 1. By Lemma 1, it is enough to show
that an error polynomial r(Z) of degree at most three is bounded below as claimed. By
Lemma 3, we may assume that all three roots of r(Z) lie in [0, 1). Since r(0) = 0 and
r(z) ≥ 0 in [0, 1), we may write r(Z) = aZ(Z − α)2
for some α ∈ [0, 1) and some
positive a; since r(1) = 1, we conclude that a = 1
(1−α)2 . Thus, we need to determine
the value of α so that t(α) = maxx∈{,u}
r(x)
x3 is as small as possible. Consider the
function tx(α) = r(x)
x3 =

x−α
(1−α)x
2
. Note that for all x, tx(α) is monotonically in-
creasing in |x − α|. It follows that t(α) is minimum for some α ∈ [, u]. For α in this
interval t(α) is an increasing function of α and tu(α) is a decreasing function of α.
So t(α) is minimum when t(α) = tu(α). It can be checked by direct computation that
when α = 2u
+u ,
t(α) = tu(α) =

u −
u + − 2u
2
.
This establishes part (a) of Theorem 5.
To establish part (b), we show that an error polynomial of degree at most 2t + 1
satisﬁes the claim. As before, by Lemma 3, we may assume that r(Z) has all its roots

in [0, 1). Furthermore, since r(Z) ≥ 0, we conclude that all roots in (0, 1) have even
multiplicity. Thus we may write
r(Z) =
Z(Z − α1)2
(Z − α2)2
· · · (Z − αt)2
(1 − α1)2(1 − α2)2 · · · (1 − αt)2
.
Now, let b = (u
)
1
t+1 . Consider subintervals {(bj
, bj+1
] : j = 0, 1, . . . , t}. One of
these intervals say bj0
, bj0+1
has no roots at all. Let be the mid point of the interval,
that is, = (bj0
+ bj0+1
)/2. Then, we have ( − αj)2
≥

bj0+1
−bj0
2
2
, and since
(1 − αj)2
≤ 1, we have
r()
2t+1
≥

b − 1
b + 1
2t
.
This establishes part (b). The term − 1
N(b+1) appears in the statement of Theorem 5
because we need to ensure that N is an integer.

4.2 Proof of Lemma 1
We will use the following notation. Let p(X1, X2, . . . , XN ) be a polynomial in N vari-
ables X1, X2, . . . , XN with real coefficients. For a database f : {0, 1}N
→ {0, 1},
let
p(f)
Δ
= p(f(1), f(2), . . . , f(N)).
Also, in the following X denotes the sequence of variables X1, X2, . . . , XN .
The key fact we need is the following.
Theorem 6 ([BB+95]). Let A be a t-query quantum database search algorithm. Then,
for i = 1, 2, . . ., N, there is a multilinear polynomial pi(X) of degree at most 2t, such
that for all f.
Pr[A(f) = i] = pi(f).
Furthermore, pi(x) ≥ 0 for all x ∈ [0, 1]N
.
Lemma 4. Let A be a t-query quantum database search algorithm. Then, there is a
multilinear polynomial p(X) of degree at most 2t + 1 such that for all f,
errA(f) = pA(f).
Proof. Using the polynomials pi(X) from Theorem 6, define
pA(X) =
n
i=1
(1 − Xi)pi(X).
Clearly, p(f) =
n
i=1
(1 − f(i))pi(f) =
i∈f−1(0)
Pr[A(f) = i] = errA(f).

We can now prove Lemma 1. For a permutation σ of N and f : [N] → {0, 1}, let
σf be the function defined by σf(i) = f(σ(i)).

Note that |f−1
(0)| = |(σf)−1
(0)|. Now,
1
N! σ
pA(σf) = E
σ
[errA(σf)] ≤ max
σ
errA(σf) ≤ errA(), (4)
where |f−1
(0)| = N.
Let σX be the sequence Xσ(1), Xσ(2), . . . , Xσ(N), and let
psym
A (X) =
1
N! σ
pA(σX).
Then, by (4), we have psym
A (f) =
1
N! σ
pA(σf) ≤ errA().
Now, psym
A (X) is a symmetric multilinear polynomial in N variables of degree at
most 2t + 1. For any such polynomial, there is a univariate polynomial q(Z) of degree
at most 2t + 1 such that if we let p̂(X) = q(
N
i=1 Xi)/N), then for all f,
p̂(f) = psym
A (f) ≤ errA().
(See Minsky and Papert [MP].) Now, p̂(f) = q((f(1) + f(2) + . . . + f(N))/N) =
q(1 − ). To complete the proof, we take r(Z) = q(1 − Z).

Acknowledgements
We thank the referees for their helpful comments.
References
[BB+95] R. Beals, H. Buhrman, R. Cleve, M. Mosca, R. de Wolf. Quantum lower bounds by
polynomials, FOCS (1998): 352–361. quant-ph/9802049.
[BC+99] H. Buhrman, R. Cleve, R. de Wolf, C. Zalka. Bounds for Small-Error and Zero-Error
Quantum Algorithms, FOCS (1999): 358–368.
[BH+02] G. Brassard, P. Hoyer, M. Mosca, A. Tapp: Quantum amplitude ampliﬁcation and
estimation. In S.J. Lomonaco and H.E. Brandt, editors, Quantum Computation and
Information, AMS Contemporary mathematics Series (305), pp. 53–74, 2002. quant-
ph/0005055.
[G96] L.K. Grover: A fast quantum mechanical algorithm for database search. STOC (1996):
212-219. quant-ph/9605043.
[G97] L.K. Grover: Quantum Mechanics helps in searching for a needle in a haystack. Phys-
ical Review Letters 79(2), pp. 325-328, July 14, 1997. quant-ph/9706033.
[G98a] L.K. Grover: A framework for fast quantum mechanical algorithms, Proc. 30th ACM
Symposium on Theory of Computing (STOC), 1998, 53–63.
[G98b] L.K. Grover. Quantum computers can search rapidly by using almost any transforma-
tion, Phy. Rev. Letters, 80(19), 1998, 4329–4332. quant-ph/9711043.
[G05] L.K. Grover. A different kind of quantum search. March 2005. quant-ph/0503205.
[MP] M.L. Minsky and S.A. Papert. Perceptrons: An Introduction to Computational Geom-
etry. MIT Press (1968).
[NC] M.A. Nielsen and I.L. Chuang: Quantum Computation and Quantum Information.
Cambridge University Press (2000).
[TGP05] T. Tulsi, L.K. Grover, A. Patel. A new algorithm for directed quantum search. May
2005. quant-ph/0505007.

Sampling Bounds for Stochastic Optimization
Moses Charikar1,
, Chandra Chekuri2,
, and Martin Pál2,
1
Computer Science Dept., Princeton University, Princeton, NJ 08544
moses@cs.princeton.edu
2
Lucent Bell Labs, 600 Mountain Avenue, Murray Hill, NJ 07974
chekuri@reserch.bell-labs.com, mpal@acm.org
Abstract. A large class of stochastic optimization problems can be
modeled as minimizing an objective function f that depends on a choice
of a vector x ∈ X, as well as on a random external parameter ω ∈ Ω
given by a probability distribution π. The value of the objective function
is a random variable and often the goal is to find an x ∈ X to minimize
the expected cost Eω[fω(x)]. Each ω is referred to as a scenario. We con-
sider the case when Ω is large or infinite and we are allowed to sample
from π in a black-box fashion. A common method, known as the SAA
method (sample average approximation), is to pick sufficiently many in-
dependent samples from π and use them to approximate π and corre-
spondingly Eω[fω(x)]. This is one of several scenario reduction methods
used in practice.
There has been substantial recent interest in two-stage stochastic ver-
sions of combinatorial optimization problems which can be modeled by
the framework described above. In particular, we are interested in the
model where a parameter λ bounds the relative factor by which costs
increase if decisions are delayed to the second stage. Although the SAA
method has been widely analyzed, the known bounds on the number of
samples required for a (1 + ε) approximation depend on the variance of
π even when λ is assumed to be a fixed constant. Shmoys and Swamy
[13, 14] proved that a polynomial number of samples suffice when f can
be modeled as a linear or convex program. They used modifications to
the ellipsoid method to prove this.
In this paper we give a different proof, based on earlier methods of Kley-
wegt, Shapiro, Homem-De-Mello [6] and others, that a polynomial num-
ber of samples suffice for the SAA method. Our proof is not based on
computational properties of f and hence also applies to integer pro-
grams. We further show that small variations of the SAA method suffice
to obtain a bound on the sample size even when we have only an ap-
proximation algorithm to solve the sampled problem. We are thus able
to extend a number of algorithms designed for the case when π is given
explicitly to the case when π is given as a black-box sampling oracle.

Supported by NSF ITR grant CCR-0205594, DOE Early Career Principal Investi-
gator award DE-FG02-02ER25540, NSF CAREER award CCR-0237113, an Alfred
P. Sloan Fellowship and a Howard B. Wentz Jr. Junior Faculty Award.

Supported in part by an ONR basic research grant MA14681000 to Lucent Bell
Labs.

Work done while at DIMACS, supported by NSF grant EIA 02-05116.
c

258 Moses Charikar, Chandra Chekuri, and Martin Pál
1 Introduction
Uncertainty in data is a common feature in a number of real world problems.
Stochastic optimization models uncertain data using probability distributions.
In this paper we consider problems that are modeled by the two-stage stochastic
minimization program
min
x∈X
f(x) = c(x) + Eω[q(x, ω)]. (1)
An important context in which the problem (1) arises is two-stage stochastic
optimization with recourse. In this model, a first-stage decision x ∈ X has to be
made while having only probabilistic information about the future, represented
by the probability distribution π on Ω. Then, after a particular future scenario
ω ∈ Ω is realized, a recourse action r ∈ R may be taken to ensure that the
requirements of the scenario ω are satisfied. In the two-stage model, c(x) denotes
the cost of taking the first-stage action x. The cost of the second stage in a
particular scenario ω, given a first-stage action x, is usually given as the optimum
of the second-stage minimization problem
q(x, ω) = min
r∈R
{costω(x, r) | (x, r) is a feasible solution for scenario ω}.
We give an example to illustrate some of the concepts. Consider the following
facility location problem. We are given a finite metric space in the form of a
graph that represents the distances in some underlying transportation network.
A company wants to build service centers at a number of locations to best serve
the demand for its goods. The objective is to minimize the cost of building
the service centers subject to the constraint that each demand point is within
a distance B from its nearest center. However, at the time that the company
plans to build the service centers, there could be uncertainty in the demand
locations. One way to deal with this uncertainty is to make decisions in two or
more stages. In the first stage certain service centers are built, and in the second
stage, when there is a clearer picture of the demand, additional service centers
might be built, and so on. How should the company minimize the overall cost
of building service centers? Clearly, if building centers in the second stage is no
more expensive than building them in the first stage, then the company will
build all its centers in the second stage. However, very often companies cannot
wait to make decisions. It takes time to build centers and there are other costs
such as inflation which make it advantageous to build some first stage centers.
We can assume that building a center in the second stage costs at most some
λ ≥ 1 times more than building it in the first stage. This tradeoff is captured
by (1) as follows. X is the set of all n-dimensional binary vectors where n is the
number of potential locations for the service centers. A binary vector x indicates
which centers are built in the first stage and c(x) is the cost of building them.
The uncertainty in demand locations is modeled by the probability distribution
on Ω. A scenario ω ∈ Ω is characterized by set of demand locations in ω. Thus,
given ω and the first stage decision x, the recourse action is to build additional

Sampling Bounds for Stochastic Optimization 259
centers so that all demands in ω are within a distance B of some center. The cost
of building these additional centers is given by q(x, ω). Note that q(x, ω) is itself
an optimization problem very closely related to the original k-center problem.
How does one solve problems modeled by (1)? One key issue is how the prob-
ability distribution π is specified. In some cases the number of scenarios in Ω is
small and π is explicitly known. In such cases the problem can be solved as a
deterministic problem using whatever mathematical programming method (lin-
ear, integer, non-linear) is applicable for minimizing c and q. There are however
situations in which Ω is too large or infinite and it is infeasible to solve the prob-
lem by explicitly listing all the scenarios. In such cases, a natural approach is to
take some number, N, of independent samples ω1, . . . , ωN from the distribution
π, and approximate the function f by the sample average function
ˆ
f(x) = c(x) +
1
N
N
i=1
q(x, ωi). (2)
If the number of samples N is not too large, finding an x̂ that minimizes ˆ
f(x̂)
may be easier than the task of minimizing f. One might then hope that for a
suitably chosen sample size N, a good solution x̂ to the sample average problem
would be a good solution to the problem (1). This approach is called the sample
average approximation (SAA) method. The SAA method is an example of a
scenario reduction technique, in that it replaces a complex distribution π over a
large (or even infinite) number of scenarios by a simpler, empirical distribution
π
over N observed scenarios. Since the function (2) is now deterministic, we can
use tools and algorithms from deterministic optimization to attempt to find its
exact or approximate optimum.
The SAA method is well known and falls under the broader area of Monte
Carlo sampling. It is used in practice and has been extensively studied and ana-
lyzed in the stochastic programming literature. See [9, 10] for numerous pointers.
In a number of settings, in particular for convex and integer programs, it is known
that the SAA method converges to the true optimum as N → ∞. The number
of samples required to obtain an additive ε approximation with a probability
(1 − δ) has been analyzed [6]; it is known to be polynomial in the dimension of
X, 1/ε, log 1/δ and the quantity V = maxx∈X V (x) where V (x) is the variance
of the random variable q(x, w). This factor V need not be polynomial in the
input size even when π is given explicitly.
The two-stage model with recourse has gained recent interest in the theo-
retical computer science community following the work of Immorlica et al. [5]
and Ravi and Sinha [11]. Several subsequent works have explored this topic
[1–4, 7, 13, 14]. The emphasis in these papers is primarily on combinatorial op-
timization problems such as shortest paths, spanning trees, Steiner trees, set
cover, facility location, and so on. Most of these problems are NP-hard even
when the underlying single stage problem is polynomial time solvable, for ex-
ample the spanning tree problem. Thus the focus has been on approximation
algorithms and in particular on relative error guarantees. Further, for technical
and pragmatic reasons, an additional parameter, the inflation factor has been

introduced. Roughly speaking, the inflation factor, denoted by λ, upper bounds
the relative factor by which the second stage decisions are more expensive when
compared to the first stage. It is reasonable to expect that λ will be a small
constant, say under 10, in many practical situations.
In this new model, for a large and interesting class of problems modeled
by linear programs, Shmoys and Swamy [13] showed that a relative (1 + ε)
approximation can be obtained with probability at least (1 − δ) using a number
of samples that is polynomial in the input size, λ, log 1/δ and 1/ε. They also
established that a polynomial dependence on λ is necessary. Thus the dependence
on V is eliminated. Their first result [13] does not establish the guarantee for the
SAA method but in subsequent work [14], they established similar bounds for the
SAA method. Their proof is based on the ellipsoid method where the samples are
used to compute approximate sub-gradients for the separation oracle. We note
two important differences between these results when compared to earlier results
of [6]. The first is that the new bounds obtained in [13, 14] guarantee that the
optimum solution x̄ to the sampled problem ˆ
f satisfies the property that f(x̄) ≤
(1 + ε)f(x∗
) with sufficiently high probability where x∗
is an optimum solution
to f. However they do not guarantee that the value f(x∗
) can be estimated
to within a (1 + ε) factor. It can be shown that estimating the expected value
Eω[q(x, ω)] for any x requires the sample size to depend on V . Second, the
new bounds, by relying on the ellipsoid method, limit the applicability to when
X is a continuous space while the earlier methods applied even when X is a
discrete set and hence could capture integer programming problems. In fact, in
[6], the SAA method is analyzed for the discrete case, and the continuous case
is analyzed by discretizing the space X using a fine grid. The discrete case is of
particular interest in approximation algorithms for combinatorial optimization.
In this context we mention that the boosted sampling algorithm of [3] uses
O(λ) samples to obtain approximation algorithms for a class of network design
problems. Several recent results [4, 5, 7] have obtained algorithms that have
provably good approximation ratios when π is given explicitly. An important
and useful question is whether these results can be carried over to the case when
π can only be sampled from. We answer this question in the positive. Details of
our results follow.
1.1 Results
In this paper we show that the results of Shmoys and Swamy [13, 14] can also
be derived using a modification to the basic analysis framework in the methods
of [6, 10]: this yields a simple proof relying only on Chernoff bounds. Similar
to earlier work [6, 10], the proof shows that the SAA method works because
of statistical properties of X and f, and not on computational properties of
optimizing over X. This allows us to prove a result for the discrete case and
hence we obtain bounds for integer programs. The sample size that we guarantee
is strongly polynomial in the input size: in the discrete setting it depends on
log |X| and in the continuous setting it depends on the dimension of the space
containing X. We also extend the ideas to approximation algorithms. In this

case, the plain SAA method achieves guarantee of (1 +ε) times the guarantee of
the underlying approximation algorithm only with O() probability. To obtain a
probability of 1 − δ for any given δ, we analyze two minor variants of SAA. The
first variant repeats SAA O(1
ε log 1
δ ) times independently and picks the solution
with the smallest sample value. The second variant rejects a small fraction of
high cost samples before running the approximation algorithm on the samples.
This latter result was also obtained independently by Ravi and Singh [12].
2 Preliminaries
In the following, we consider the stochastic optimization problem in its general
form (1). We consider stochastic two stage problems that satisfy the following
properties.
(A1) Non-negativity. The functions c(x) and q(x, ω) are non-negative for every
first stage action x and every scenario ω.
(A2) Empty First Stage. We assume that there is an empty first-stage action,
0 ∈ X. The empty action incurs no first-stage cost, i.e. c(0) = 0, but is least
helpful in the second stage. That is, for every x ∈ X and every scenario ω,
q(x, ω) ≤ q(0, ω).
(A3) Bounded Inflation Factor. The inflation factor λ determines the relative
cost of information. It compares the difference in cost of the “wait and see”
solution q(0, ω) and the cost of the best solution with hindsight, q(x, ω),
relative to the first stage cost of x. Formally, the inflation factor λ ≥ 1 is
the least number such that for every scenario ω ∈ Ω and every x ∈ X, we
have
q(0, ω) − q(x, ω) ≤ λc(x). (3)
We note that the above assumptions capture both the discrete and continuous
case problems considered in recent work [1, 3, 5, 11, 13, 14].
We work with exact and approximate minimizers of the sampled problem.
We make the notion precise.
Definition 1. An x∗
∈ X is said to be an exact minimizer of the function f(·)
if for all x ∈ X it holds that f(x∗
) ≤ f(x). An x̄ ∈ X is an α-approximate
minimizer of the function f(·), if for all x ∈ X it holds that f(x̄) ≤ αf(x).
The main tool we will be using is the Chernoff bound. We will be using the
following version of the bound (see e.g. [8, page 98]).
Lemma 1 (Chernoff bound). Let X1, . . . , XN be independent random vari-
ables with Xi ∈ [0, 1] and let X =
N
i=1 Xi. Then, for any ε ≥ 0, we have
Pr[ |X − E[X] | εN ] ≤ 2 exp(−ε2
N).
Throughout the rest of the paper when we refer to an event happening with
probability β, it is with respect to the randomness in sampling from the distri-
bution π over Ω.

3 Discrete Case
We start by discussing the case when the first stage decision x ranges over a
finite set of choices X. In the following, let x∗
denote an optimal solution to the
true problem (1), and Z∗
its value f(x∗
). In the following we assume that ε is
small, say ε 0.1.
Theorem 1. Any exact minimizer x̄ of the function ˆ
f(·) constructed with
Θ(λ2 1
4 log |X| log 1
δ ) samples is, with probability 1−2δ, a (1+O(ε))-approximate
minimizer of the function f(·).
In a natural attempt to prove Theorem 1, one might want to show that if N
is large enough, the functions ˆ
f will be close to f, in that with high probability
|f(x) − ˆ
f(x)| ≤ εf(x). Unfortunately this may not be the case, as for any par-
ticular x, the random variable q(x, ω) may have very high variance. However,
intuitively, the high variance of q(x, ω) can only be caused by a few “disaster”
scenarios of very high cost but low probability, whose cost is not very sensitive
to the particular choice of the first stage action x. Hence these high cost sce-
narios do not affect the choice of the optimum x̄ significantly. We formalize this
intuition below.
For the purposes of the analysis, we divide the scenarios into two classes. We
call a scenario ω high, if its second stage “wait and see” cost q(0, ω) exceeds a
threshold M, and low otherwise. We set the threshold M to be λZ∗
/ε.
We approximate the function f by taking N independent samples ω1, . . . , ωN .
We define the following two functions to account for the contributions of low and
high scenarios respectively.
ˆ
fl(x) =
1
N
i:ωi low
q(x, ωi) and ˆ
fh(x) =
1
N
i:ωi high
q(x, ωi).
Note that ˆ
f(x) = c(x) + ˆ
fl(x) + ˆ
fh(x). We make a similar definition for the
function f(·). Let p = Prω[ω is a high scenario].
fl(x) = E[q(x, ω)|ω is low] · (1 − p) and fh(x) = E[q(x, ω)|ω is high] · p
so that f(x) = c(x) + fl(x) + fh(x).
We need the following bound on p.
Lemma 2. The probability mass p of high scenarios is at most ε
(1−ε)λ .
Proof. Recall that x∗
is a minimizer of f, and hence Z∗
= f(x∗
). We have
Z∗
≥ fh(x∗
) = p · E[q(x∗
, ω) | ω is high] ≥ p · [M − λc(x∗
)],
where the inequality follows from the fact that for a high scenario ω, q(0, ω) ≥ M
and by Axiom (A3), q(x, ω) ≥ q(0, ω) − λc(x). Substituting M = λZ∗
/ε, and
using that c(x∗
) ≤ Z∗
we obtain
Z∗
≥ Z∗

λ
ε
− λ

p.
Solving for p proves the claim.

To prove Theorem 1, we show that each of the following properties hold with
probability at least 1 − δ (in fact, the last property holds with probability 1).
(P1) For every x ∈ X it holds that |fl(x) − ˆ
fl(x)| ≤ εZ∗
.
(P2) For every x ∈ X it holds that ˆ
fh(0) − ˆ
fh(x) ≤ 2εc(x).
(P3) For every x ∈ X it holds that fh(0) − fh(x) ≤ 2εc(x).
Proving that these properties hold is not difficult, and we will get to it shortly;
but let us first show how they imply Theorem 1.
Proof of Theorem 1. With probability 1 − 2δ, we can assume that all three
properties (P1–P3) hold. For any x ∈ X we have
fl(x) ≤ ˆ
fl(x) + εZ∗
by (P1)
fh(x) ≤ fh(0) by (A2)
0 ≤ ˆ
fh(x) + 2εc(x) − ˆ
fh(0) by (P2)
Adding the above inequalities and using the definitions of functions f and ˆ
f we
obtain
f(x) − ˆ
f(x) ≤ εZ∗
+ 2εc(x̄) + fh(0) − ˆ
fh(0). (4)
By a similar reasoning we get the opposite inequality
ˆ
f(x) − f(x) ≤ εZ∗
+ 2εc(x) + ˆ
fh(0) − fh(0). (5)
Now, let x∗
and x̄ be minimizers of the functions f(·) and ˆ
f(·) respectively.
Hence we have ˆ
f(x̄) ≤ ˆ
f(x∗
). We now use (4) with x = x̄ and (5) with x = x∗
.
Adding them up, together with the fact that ˆ
f(x̄) ≤ ˆ
f(x∗
) we get
f(x̄) − 2εc(x̄) ≤ f(x∗
) + 2εc(x∗
) + 2εZ∗
.
Noting that c(x) ≤ f(x) holds for any x and that Z∗
= f(x∗
), we get that
(1 − 2ε)f(x̄) ≤ (1 + 4ε)f(x∗
). Hence, with probability (1 − 2δ), we have that
f(x̄) ≤ (1 + O(ε))f(x∗
).
Now we are ready to prove properties (P1–P3). We will make repeated use
of the Chernoff bound stated in Lemma 1. Properties (P2) and (P3) are an easy
corrolary of Axiom A3 once we realize that the probability of drawing a high
sample from the distribution π is small; and that the fraction of high samples
we draw will be small as well with high probability. Let Nh denote the number
of high samples in ω1, . . . , ωN .
Lemma 3. With probability 1 − δ, Nh/N ≤ 2ε/λ.
Proof. Let Xi be an indicator variable that is equal to 1 if the sample ωi is
high and 0 otherwise. Then Nh =
N
i=1 Xi is a sum of i.i.d. 0-1 variables, and
E[Nh] = pN. From Lemma 2, p ≤ ε
(1−ε)λ .
Using Chernoff bounds,
Pr
'
Nh − Np
ε
λ
N

2 −
1
1 − ε
(
≤ exp

−
ε2
λ2
(1 − 2ε)2
(1 − ε)2
N

.
With ε 1/3 and N chosen as in Theorem 1, this probability is at most δ.

Corollary 1. Property (P2) holds with probability 1−δ, and property (P3) holds
with probability 1.
Proof. By Lemma 3, we can assume that the number of high samples Nh ≤
2Nε/λ for ε 1/3. Then,
ˆ
fh(0) − ˆ
fh(x) =
1
N
i:ωi high
q(0, ωi) − q(x, ωi) ≤
Nh
N
λc(x).
The inequality comes from Axiom A3. Since Nh/N ≤ 2ε/λ, the right hand side
is at most 2εc(x), and thus Property (P2) holds with probability 1 − δ.
Using Lemma 2, by following the same reasoning (replacing sum by expec-
tation) we obtain that Property (P3) holds with probability 1.
Now we are ready to prove that property (P1) holds with probability 1 − δ.
This is done in Lemma 4. The proof of this lemma is the only place where we
use the fact that X is finite. In Section 5 we show property (P1) to hold when
X ⊆ Rn
, under an assumption that the function c(·) is linear and that q(·, ·)
satisfies certain Lipschitz-type property.
Lemma 4. With probability 1 − δ it holds that for all x ∈ X,
|fl(x) − ˆ
fl(x)| ≤ εZ∗
.
Proof. First, consider a fixed first stage action x ∈ X. Note that we can view
ˆ
fl(x) as the arithmetic mean of N independent copies Q1, . . . , QN of the random
variable Q distributed as
Q =

q(x, ω) if ω is low
0 if ω is high
Observe that fl(x) = E[Q]. Let Yi be the variable Qi/M and let Y =
N
i=1 Yi.
Note that Yi ∈ [0, 1] and E[Y ] = N
M fl(x). We apply the Chernoff bound from
Lemma 1 to obtain the following.
Pr
'

Y −
N
M
fl(x)

ε2
λ
N
(
≤ 2 exp

−
ε4
λ2
N

.
With N as in Theorem 1, this probability is at most δ/|X|. Now, taking the
union bound over all x ∈ X, we obtain the desired claim.
4 Approximation Algorithms and SAA
In many cases of our interest, finding an exact minimizer of the function ˆ
f
is computationally hard. However, we may have an algorithm that can find
approximate minimizers of functions ˆ
f.
First, we explore the performance of the plain SAA method used with an
α-approximation algorithm for minimizing the function ˆ
f. The following lemma
is an adaptation of Theorem 1 to approximation algorithms.

Lemma 5. Let x̄ be a α-approximate minimizer for ˆ
f. Then, with probability
(1 − 2δ),
f(x̄)(1 − 2ε) ≤ (1 + 6ε)αf(x∗
) + (α − 1)( ˆ
fh(0) − fh(0)).
Proof. Again, with probability 1 − 2δ, we can assume that Properties (P1–P3)
hold. Following the proof of Theorem 1, the Lemma follows from Inequalities
(4-5) and the fact that ˆ
f(x̄) ≤ α ˆ
f(x∗
).
From the above lemma, we see that x̄ is a good approximation to x∗
if ˆ
fh(0)−
fh(0) is small. Since ˆ
fh(x∗
) is an unbiased estimator of fh(x∗
), by Markov’s
inequality we have that ˆ
fh(x∗
) ≤ (1 + 1/k)fh(x∗
) holds with probability 1
k+1 .
Thus, if we want to achieve multiplicative error (1+ε), we must be content with
probability of success only proportional to 1/ε. It is not difficult to construct
distributions π where the Markov bound is tight.
There are various ways to improve the success probability of the SAA method
used in conjunction with an approximation algorithm. We propose two of them in
the following two sections. Our first option is to boost the success probability by
repetition: in Section 4.1 we show that by repeating the SAA method ε−1
log δ−1
times independently, we can achieve success probability 1 − δ. An alternate and
perhaps more elegant method is to reject the high cost samples; this does not
significantly affect the quality of any solution, while significantly reducing the
variance in evaluating the objective function. We discuss this in Section 4.2.
4.1 Approximation Algorithms and Repeating SAA
As we saw in Lemma 5, there is a reasonable probability of success with SAA
even with an approximation algorithm for the sampled problem. To boost the
probability of success we independently repeat the SAA some number of times
and we pick the solution of lowest cost. The precise parameters are formalized
in the theorem below.
Theorem 2. Consider a collection of k functions ˆ
f1
, ˆ
f2
, . . . , ˆ
fk
, such that
k = Θ(ε−1
log δ−1
) and the ˆ
fi
are independent sample average approximations
of the function f, using N = Θ(λ2
−4
· k · log |X| log δ−1
) samples each. For
i = 1, . . . , k, let x̄i
be an α-approximate minimizer of the function ˆ
fi
. Let i =
argminj
ˆ
fj
(x̄j
). Then, with probability 1 − 3δ, x̄i
is an (1 + O(ε))α-approximate
minimizer of the function f(·).
Proof. We call the i-th function ˆ
fi
good if it satisfies Properties (P1) and (P2).
The number of samples N has been picked so that each ˆ
fi
is good with proba-
bility at least 1 − 2δ/k and hence all samples are good with probability 1 − 2δ.
By Markov inequality, the probability that ˆ
fi
h(0) (1+ε)fh(0) is at least 1/ε.
Hence the probability that none of these events happens is at most (1 − ε)k
δ.
Thus, with probability 1 − δ we can assume that there is an index j for which

ˆ
fj
h(0) ≤ (1 + ε)fh(0). As fh(0) ≤ (1 + 2ε)Z∗
easily follows from Property (P2),
with probability 1 − δ we have
ˆ
fj
h(0) ≤ (1 + 4ε)Z∗
. (6)
Let i = argmin
ˆ
f
(x̄
). Using the fact that ˆ
fi
(x̄i
) ≤ ˆ
fj
(x̄j
) and that xi
and
xj
are both α-approximate minimizers of their respective functions, we get
ˆ
fi
(x̄i
) ≤
1
α
ˆ
fi
(x̄i
) +
α − 1
α
ˆ
fj
(x̄j
) ≤ ˆ
fi
(x∗
) + (α − 1) ˆ
fj
(x∗
). (7)
Substituting x̄i
and x∗
for x in Inequalities (4) and (5) respectively, we obtain
f(x̄i
) ≤ ˆ
fi
(x̄i
) + 2εZ∗
+ 2εc(x̄i
) + fh(0) − ˆ
fh(0) (8)
ˆ
f(x∗
) ≤ f(x∗
) + 2εZ∗
+ 2εc(x∗
) + ˆ
fh(0) − fh(0). (9)
Adding inequalities (7), (8), and (9), we get
f(x̄i
) − 2εc(x̄i
) ≤ (α − 1) ˆ
fj
(x∗
) + f(x∗
) + 4εZ∗
+ 2εc(x∗
). (10)
Using Equation 6 with Lemma 5 to bound ˆ
fj
(x∗
) ﬁnishes the proof.
4.2 Approximation Algorithms and Sampling with Rejection
Instead of repeating the SAA method multiple times to get a good approxima-
tion algorithm, we can use it only once, but ignore the high cost samples. The
following lemma makes this statement precise.
Lemma 6. Let g : X *→ R be a function satisfying |fl(x)+c(x)−g(x)| = O(ε)Z∗
for every x ∈ X. Then any α-approximate minimizer x̄ of the function g(·) is
also an α(1 + O(ε))-approximate minimizer of the function f(·).
Proof. Let x̄ be an α-approximate minimizer of g. We have
f(x̄) ≤ g(x̄) + O(εZ∗
) + fh(x̄)
g(x̄) ≤ α(c(x∗
) + fl(x∗
) + O(εZ∗
))
By Axiom (A2) and Property (P3) we can replace fh(x̄) in the ﬁrst inequality
by fh(x∗
) + εc(x∗
). Adding up, we obtain
f(x̄) ≤ α(c(x∗
) + fl(x∗
)) + αO(εZ∗
) + εc(x∗
) + fh(x∗
) ≤ (1 + O(ε))αZ∗
.
According to Lemma 6, a good candidate for the function g would be g(x) =
c(x) + ˆ
fl(x), since by Lemma 4 we know that |fl(x) − ˆ
fl(x)| ≤ εZ∗
holds with
probability 1−δ. However, in order to evaluate ˆ
fl(x), we need to know the value
Z∗
to be able to classify samples as high or low. If Z∗
is not known, we can
approximate fl by the function
¯
fl(x) =
1
N
N−2εN/λ
i=1
q(x, ωi)

where we assume that the samples were reordered so that q(0, ω1) ≤ q(0, ω2) ≤
· · · ≤ q(0, ωN ). In other words, we throw out 2εN/λ samples with highest re-
course cost.
Lemma 7. With probability 1 − δ, for all x ∈ X it holds that | ¯
fl(x) − fl(x)| ≤
3εZ∗
.
Proof. By Lemma 3, with probability 1 − δ we can assume that ¯
fl does not
contain any high samples, and hence ¯
fl(x) ≤ ˆ
fl(x). Since there can be at most
2εN/λ samples that contribute to ˆ
fl but not to ¯
fl, and all of them are low, we
get ˆ
fl(x) − ¯
fl(x) ≤ M · 2ε/λ = 2εZ∗
, and hence | ¯
fl(x) − ˆ
fl(x)| ≤ 2εZ∗
. Finally,
by Lemma 4 we have that | ˆ
fl(x) − fl(x)| ≤ εZ∗
.
Theorem 3. Let ω1, ω2, . . . , ωN be independent samples from the distribution π
on Ω with N = Θ(λ2
−4
· log |X| log δ−1
). Let ω
1, ω
2, . . . , ω
N be a reordering of
the samples such that q(0, ω
1) ≤ q(0, ω
2) . . . ≤ q(0, ω
N ). Then any α-approximate
minimizer x̄ of the function ¯
f(x) = c(x)+ 1
N
N
i=1 q(x, ω
i) with N
= (1−2ε/λ)N
is a (1 + O(ε))α-approximate minimizer of f(·).
In many situations, computing q(x, ω) (or even q(0, ω)) requires us to solve
an NP-hard problem. Hence we cannot order the samples as we require in the
above theorem. However, if we have an approximation algorithm with ratio β
for computing q(·, ·), we can use it to order the samples instead. For the above
theorem to be applicable with such an approximation algorithm, the number of
samples, N, needs to increase by a factor of β2
and N
needs to be (1−2ε/(βλ))N.
5 From the Discrete to the Continuous
So far we have assumed that X, the set of first-stage decisions, is a finite set.
In this section we demonstrate that this assumption is not crucial, and extend
Theorems 1, 2 and 3 to the case when X ⊆ Rn
, under reasonable Lipschitz type
assumptions on the functions c(·) and q(·, ·).
Since all our theorems depend only on the validity of the three properties
(P1–P3), they continue to hold in all settings where (P1–P3) can be shown to
hold. Properties (P2) and (P3) are a simple consequence of Axiom (A3) and
hold irrespective of the underlying action space X. Hence, to extend our results
to a continuous setting, we only need to check the validity of Property (P1). In
the rest of this section, we show (P1) to hold for X ⊆ Rn
+, assuming that the
first stage costs is linear, i.e. c(x) = ct
· x for some real vector c ≥ 0, and that
the recourse function satisfies the following property.
Definition 2. We say that the recourse function q(·, ·) is (λ, c)-Lipschitz, if the
following inequality holds for every scenario ω:
|q(x, ω) − q(x
, ω)| ≤ λ
n
i=1
ci|xi − x
i|.

Note that the Lipschitz property implies Axiom (A3) in that any (λ, c)-
Lipschitz recourse function q satisfies q(0, ω) − q(x, ω) ≤ λc(x).1
We use a standard meshing argument: if two functions ˆ
f and f do not differ
by much on a dense enough finite mesh, because of bounded gradient, they must
approximately agree in the whole region covered by the mesh. This idea is by
no means original; it has been used in the context of stochastic optimization by
various authors (among others, [6, 14]). We give the argument for the sake of
completeness.
Our mesh is an n-dimensional grid of points with ε/(nαλci) spacing in each
dimension 1 ≤ i ≤ n.
Since the i-th coordinate x̄i of any α-approximate minimizer x̄ cannot be
larger than αZ∗
/ci (as otherwise the first stage cost would exceed αZ∗
), we can
assume that the feasible region lies within the bounding box 0 ≤ xi ≤ αZ∗
/ci
for 1 ≤ i ≤ n. Thus, the set X
of mesh points can be written as
X
=

i1
εZ∗
nλc1
, i2
εZ∗
nλc2
, . . . , in
εZ∗
nλcn

(i1, i2, . . . , in) ∈ {0, 1, . . ., nαλ/ε}
n
.
We claim the following analog of Lemma 4.
Lemma 8. If N ≥ θ(λ2 1
ε4 n log(nλ/ε) log δ), then with probability 1 − δ we have
that | ˆ
fl(x) − fl(x)| ≤ 3εZ∗
holds for every x ∈ X.
Proof. The size of X
is (1 + nαλ/ε)n
, and hence log |X
| = O(n log(nλ/ε)).
Hence Lemma 4 guarantees that with probability 1 − δ, | ˆ
fl(x
) − fl(x
)| ≤ εZ∗
holds for every x
∈ X
.
For a general point x ∈ X, there must be a nearby mesh point x
∈ X
such that
n
i=1 ci|xi − x
i| ≤ εZ∗
/λ. By Lipschitz continuity of q we have that
|fl(x) − fl(x
)| ≤ εZ∗
and | ˆ
fl(x) − ˆ
fl(x
)| ≤ εZ∗
. By triangle inequality,
| ˆ
fl(x) − fl(x)| ≤ | ˆ
fl(x) − ˆ
fl(x
)| + | ˆ
fl(x
) − fl(x
)| + |fl(x
) − fl(x)| ≤ 3εZ∗
.
In some recent work, Shmoys and Swamy [15] extended their work on two-stage
problems to multi-stage problems. Their new work is not based on the ellipsoid
method but still relies on the notion of a sub-gradient and thus requires X to
be continuous set. We believe that our analysis in this paper can be extended
to the multi-stage setting even when X is a discrete set and we plan to explore
this in future work.
Acknowledgments
We thank Retsef Levi, R. Ravi, David Shmoys, Mohit Singh and Chaitanya
Swamy for useful discussions.
1
This not necessarily true for non-linear c(·).

References
1. K. Dhamhere, R. Ravi and M. Singh. On two-stage Stochastic Minimum Spanning
Trees. Proc. of IPCO, 2005.
2. A. Flaxman, A. Frieze, M. Krivelevich. On the random 2-stage minimum spanning
tree. Proc. of SODA, 2005.
3. A. Gupta, M. Pál, R. Ravi, and A. Sinha. Boosted sampling: Approximation
algorithms for stochastic optimization. In Proceedings of the 36th Annual ACM
Symposium on Theory of Computing, 2004.
4. A. Gupta, R. Ravi, and A. Sinha. An edge in time saves nine: Lp rounding ap-
proximation algorithms. In Proceedings of the 45th Annual IEEE Symposium on
Foundations of Computer Science, 2004.
5. N. Immorlica, D. Karger, M. Minkoﬀ, and V. Mirrokni. On the costs and beneﬁts
of procrastination: Approximation algorithms for stochastic combinatorial opti-
mization problems. In Proceedings of the 15th Annual ACM-SIAM Symposium on
Discrete Algorithms, 2004.
6. A. J. Kleywegt, A. Shapiro, and T. Homem-De-Mello. The sample average appro-
ximation method for stochastic discrete optimization. SIAM J. on Optimization,
12:479-502, 2001.
7. M. Mahdian. Facility Location and the Analysis of Algorithms through Factor-
Revealing Programs. Ph.D. Thesis, MIT, June 2004.
8. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University
Press, 1995.
9. Stochastic Programming. A. Ruszczynski, and A. Shapiro editors. Vol 10 of Hand-
book in Operations Research and Management Science, Elsevier 2003.
10. A. Shapiro. Montecarlo sampling methods. In A. Ruszczynski, and A. Shapiro
editors, Stochastic Programming, Vol 10 of Handbook in Operations Research and
Management Science, Elsevier 2003.
11. R. Ravi and A. Sinha. Hedging uncertainty: Approximation algorithms for stochas-
tic optimization problems. In Proceedings of the 10th International Conference on
Integer Programming and Combinatorial Optimization (IPCO), 2004.
12. R. Ravi and M. Singh. Personal communication, February 2005.
13. D. Shmoys and C. Swamy. Stochastic optimization is (almost) as easy as deter-
ministic optimization. Proc. of FOCS, 2004.
14. D. Shmoys and C. Swamy. The Sample Average Approximation Method for 2-stage
Stochastic Optimization. Manuscript, November 2004.
15. D. Shmoys and C. Swamy. Sampling-based Approximation Algorithms for Multi-
stage Stochastic Optimization. Manuscript, 2005.

An Improved Analysis of Mergers
Zeev Dvir
and Amir Shpilka
Weizmann institute of science
Rehovot, Israel
{zeev.dvir,amir.shpilka}@weizmann.ac.il
Abstract. Mergers are functions that transform k (possibly dependent)
random sources into a single random source, in a way that ensures that
if one of the input sources has min-entropy rate δ then the output has
min-entropy rate close to δ. Mergers have proven to be a very useful
tool in explicit constructions of extractors and condensers, and are also
interesting objects in their own right. In this work we present a new
analysis of the merger construction of [6]. Our analysis shows that the
min-entropy rate of this merger’s output is actually 0.52 · δ instead of
0.5·δ, where δ is the min-entropy rate of one of the inputs. To obtain this
result we deviate from the usual linear algebra methods that were used
by [6] and introduce a new technique that involves results from additive
number theory.
1 Introduction
Mergers are functions that take as input k samples, taken from k (possibly
dependent) random sources, each of length n-bits. It is assumed that one of
these random sources, who’s index is unknown, is suﬃciently random, in the
sense that it has min-entropy1
≥ δn. We want the merger to output an n
-bit
string (n
could be smaller than n) that will be close to having min-entropy
at least δ
n
, where δ
is not considerably smaller than δ. To achieve this, the
merger is allowed to use an additional small number of truly random bits called
a seed. The goals in merger constructions are a) to minimize the seed length, b)
to maximize the min-entropy of the output and c) to minimize the error (that is,
the statistical distance between the merger’s output and some high min-entropy
source).
The notion of merger was ﬁrst introduced by Ta-Shma [11], in the con-
text of explicit constructions of extractors2
. Recently, Lu, Reingold, Vadhan and

Research supported by Israel Science Foundation (ISF) grant.

Research supported by the Koshland fellowship.
1
A source has min-entropy ≥ b if none of its values is obtained with probability larger
than 2−b
.
2
An extractor is a function that transforms a source with min-entropy b into a source
which is close to uniform, with the aid of an additional random seed. For a more
detailed discussion of extractors see [9].
c

An Improved Analysis of Mergers 271
Wigderson [6] gave a very simple and beautiful construction of mergers based on
Locally-Decodable-Codes. This construction was used in [6] as a building block
in an explicit construction of extractors with nearly optimal parameters. More
recently, [7] generalized the construction of [6], and showed how this construction
(when combined with other techniques) can be used to construct condensers3
with constant seed length. The analysis of the merger constructed in [7] was
subsequently refined in [4].
The merger constructed by [6] takes as input k strings of length n, one
of which has min-entropy b, and outputs a string of length n that is close to
having min-entropy at least 0.5 · b. Loosely speaking, the output of the merger
is computed as follows: treat each input block as a vector in the vector space
Fm
, where F is some small finite field, and output a uniformly chosen linear
combination of these k vectors. The analysis of this construction is based on the
following simple idea: In every set of linear combinations, who’s density is at least
γ (where γ is determined by the size of F), there exist two linear combinations
that, when put together, determine the ’good’ source (that is, the ’good’ source
can be computed from both of them deterministically). Therefore, one of these
linear combinations must have at least half the entropy of the ’good’ source (this
reasoning extends also to min-entropy). As a results we get that for most seed
values (linear combinations) the output has high min-entropy, and the result
follows. This is of course an over-simplified explanation, but it gives the general
idea behind the proof.
In this paper we present an alternative analysis to the one just described.
Our analysis relies on two results from the field of additive-number theory. The
first is Szemeredi’s theorem ([10],[5]) on arithmetic progressions of length three
(also known as Roth’s theorem [8]). This theorem states that in every subset of
{1, . . . , N}, who’s density is at least δ(N), there exits an arithmetic progression
of length three. For our purposes we use a quantitative version of this theorem as
proven by Bourgain [3]. The second result that we rely on is a lemma of Bourgain
[2] that deals with sum-sets and difference-sets of integers. Roughly speaking,
the lemma says that if the sum-set of two sets of integers is very small, then their
difference-set cannot be very large (for a precise formulation see Section 3). It
is interesting to note that this is not the first time that results from additive
number-theory are used in the context of randomness extraction. A recent result
of Barak, Imagliazzo and Wigderson [1] uses results from this field to construct
multi-source extractors.
Using these two results we are able to show that the min-entropy outputted
by the aforementioned merger is 0.52 · b and not 0.5 · b as was previously known.
One drawback of our analysis is that the length of the seed is required to be
O(k · γ−2
) in order for the output error to be γ, where in the conventional
analysis the seed length can be as short as O(k · log(γ−1
)). This however does
not present a problem in many of the current applications of mergers, where the
3
A condenser is a function that transforms a source with min-entropy rate δ into a
source which is close to having min-entropy rate δ
δ, with the aid of an additional
random seed.

272 Zeev Dvir and Amir Shpilka
error parameter and the number of input sources are both constants and the seed
length is also required to be a constant. One place where our analysis can be
used in order to simplify an existing construction is in the extractor construction
of [7]. There, the output of the merger is used as an input to an extractor that
requires the min-entropy rate of its input to be larger than one-half. In [7] this
problem is addressed by a more complicated merger construction who’s output
length is shorter than n. our analysis shows that the more simple construction
of [6] could be used instead, since its output min-entropy rate is larger than
one-half.
1.1 Somewhere-Random-Sources
An n-bit random source is a random variable X that takes values in {0, 1}n
.
We denote by supp(X) ⊂ {0, 1}n
the support of X (i.e. the set of values on
which X has non-zero probability). For two n-bit sources X and Y , we define
the statistical distance between X and Y to be
Δ(X, Y )

=
1
2
a∈{0,1}n
|Pr[X = a] − Pr[Y = a]| .
We say that a random source X (of length n bits) has min-entropy ≥ b if for
every x ∈ {0, 1}n
the probability for X = x is at most 2−b
.
Definition 1.1 (Min-entropy) Let X be a random variable distributed over
{0, 1}n
. The min-entropy of X is defined as4
H∞
(X)

= min
x∈supp(X)
log

1
Pr[X = x]

.
Definition 1.2 ((n, b)-Source) We say that X is an (n, b)-source, if X is an
n-bit random source, and H∞
(X) ≥ b.
A somewhere-(n, b)-source is a source comprised of several blocks, such that
at least one of the blocks is an (n, b)-source. Note that we allow the other source
blocks to depend arbitrarily on the (n, b)-source, and on each other.
Definition 1.3 ((n, b)1:k
-Source)) A k-places-somewhere-(n, b)-source, or an
(n, b)1:k
-source, is a random variable X = (X1, . . . , Xk), such that every Xi is
of length n bits, and at least one Xi is of min-entropy ≥ b. Note that X1, . . . , Xk
are not necessarily independent5
.
4
All logarithms in this paper are taken base 2.
5
It is possible to define somewhere-(n, b)-sources in a more general way to also include
convex combinations of sources of the type described by Definition 1.3. However, it
suffices to consider sources of this simpler type for the task of merger constructions.

1.2 Mergers
A merger is a function transforming an (n, b)1:k
-source into a source which is
γ-close (i.e. it has statistical distance ≤ γ) to an (m, b
)-source. Naturally, we
want b
/m to be as large as possible, and γ to be as small as possible. We allow
the merger to use an additional small number of truly random bits, called a seed.
A Merger is strong if for almost all possible assignments to the seed, the output
is close to be an (m, b
)-source. A merger is explicit if it can be computed in
polynomial time.
Definition 1.4 (Merger) A function M : {0, 1}d
× {0, 1}n·k
→ {0, 1}m
is a
[d, (n, b)1:k
*→ (m, b
) ∼ γ]-merger if for every (n, b)1:k
-source X, and for an
independent random variable Z uniformly distributed over {0, 1}d
, the distribu-
tion M(Z, X) is γ-close to a distribution of an (m, b
)-source. We say that M
is strong if the average over z ∈ {0, 1}d
of the minimal distance between the
distribution of M(z, X) and a distribution of an (m, b
)-source is ≤ γ.
We now give a formal definition of the merger we wish to analyze. To simplify
the analysis we will assume in several places that certain quantities are integers.
This, however, will not affect our results in any significant way.
Construction 1.5 ([6]) Let n, k be integers, p a prime number, and let l =
n
log(p) . We define a function M : {0, 1}d
×{0, 1}n·k
→ {0, 1}n
, with d = log(p)·k,
in the following way: Let F denote the field GF(p). Given z ∈ {0, 1}d
, we think
of z as a vector (z1, . . . , zk) ∈ Fk
. Given x = (x1, . . . , xk) ∈ {0, 1}n·k
, we think
of each xi ∈ {0, 1}n
as a vector in Fl
. The function M is now defined as
M(z, x) =
k
i=1
zi · xi ∈ Fl
,
where the operations are preformed in the vector space Fl
. Intuitively, one can
think of M as M : Fk
×

Fl
k
→ Fl
.
1.3 Our Results
We prove the following theorem:
Theorem 1 Let 0 γ 1 be any constant, k 0 a constant integer, and let p
be a prime larger than6
exp(γ−2
). Let
M : {0, 1}d
× {0, 1}n·k
→ {0, 1}n
,
be as in Construction 1.5, where d = log(p)·k (and underlying field GF(p)). Then
for any constant α 0 there exists a constant b0 such that for all n ≥ b ≥ b0,
M is a [d, (n, b)1:k
*→ (n, b
) ∼ γ]-strong merger with
b
= (0.52 − α) · b.
6
In writing exp(f) we mean 2O(f)
.

From Theorem 1 we see that in order to get a merger with error γ we need
to choose the underlying field to be of size at least exp(γ−2
). It is well known
that for every integer n, there is a prime between n and 2n. Therefore we can
take p to be O

exp(γ−2
)

and have that the length of the random seed is d =
log(p) · k = O

k · γ−2

bits long. Hence, for constant γ and k the length of the
random seed used by the merger is constant. Notice that finding the prime p
that answers our demands can be done by testing all integers in the range of
interest for primality (if γ is constant then this will take a constant amount of
time).
1.4 Organization
In Section 2 we give our analysis of Construction 1.5 and prove Theorem 1. The
analysis presented in Section 2 relies on two central claims that we prove in
Section 3.
2 Analysis of Construction 1.5
In this section we present our improved analysis of Construction 1.5, and prove
Theorem 1. The analysis will go along the same lines as in [4] and will differ from
it in two claims that we will prove in Section 3. We begin with some notations
that will be used throughout the paper.
Let X = (X1, . . . , Xk) ∈ {0, 1}n·k
be a somewhere (n, b)-source, and let
us assume w.l.o.g. that H∞
(X1) ≥ b. Let 0 γ 1 be any constant, and
let p ≥ exp(γ−2
) be a prime number. Let M : Fk
×

Fl
k
→ Fl
, be as in
Construction 1.5, where F = GF(p), d = log(p) · k and n = log(p) · l. Our goal is
to analyze the min-entropy of M(Z, X) where Z will denote a random variable
uniformly distributed over Fk
. In particular, we would like to show that the
random variable M(Z, X) is γ-close to having min-entropy ≥ (0.52 − α) · b for
all constant α.
For every z ∈ Fk
we denote by Yz

=M(z, X) the random variable given by the
output of M on the fixed seed value z (recall that, in Construction 1.5, every
seed value corresponds to a specific linear combination of the source blocks).
Let u

=2d
= pk
be the number of different seed values, so we can treat the set
{0, 1}d
as the set7
[u]. We can now define Y

=(Y1, . . . , Yu) ∈ (Fl
)u
. The random
variable Y is a function of X, and is comprised of u blocks, each one of length
n = log(p) · l bits, representing the output of the merger on all possible seed
values. We will first analyze the distribution of Y as a whole, and then use this
analysis to describe the output of M on a uniformly chosen seed.
Definition 2.1 Let D(Ω) denote the set of all probability distributions over a
finite set Ω. Let P ⊂ D(Ω) be some property. We say that μ ∈ D(Ω) is γ-
close to a convex combination of distributions with property P, if there exists
7
For an integer n, we write [n]

={1, 2, . . . , n}.

constants α1, . . . , αt, γ 0, and distributions μ1, . . . , μt, μ
∈ D(Ω) such that the
following three conditions hold8
:(1) μ =
t
i=1 αiμi + γμ
. (2)
t
i=1 αi + γ = 1.
(3)∀i ∈ [t] , μi ∈ P.
Let Y be the random variable defined above, and let μ : (Fl
)u
→ [0, 1] be
the probability distribution of Y (i.e. μ(y) = Pr[Y = y]). We would like to show
that μ is exponentially (in b) close to a convex combination of distributions, each
having a certain property which will be defined shortly.
Given a probability distribution μ on (Fl
)u
we define for each z ∈ [u] the
distribution μz : Fl
→ [0, 1] to be the restriction of μ to the z’s block. More
formally, we define
μz(y)

=
y1,...,yz−1,yz+1,...,yu∈F l
μ(y1, . . . , yz−1, y, yz+1, . . . , yu).
Next, let α 0. We say that a distribution μ : (Fl
)u
→ [0, 1] is α-good if for
at least (1 − γ/2) · u values of z ∈ [u], μz has min-entropy at least (0.52 − α) · b.
The statement that we would like to prove is that the distribution of Y is close
to a convex combination of α-good distributions (see Definition 2.1). As we will
see later, this is good enough for us to be able to prove Theorem 1. The following
lemma states this claim in a more precise form.
Lemma 2.2 (Main Lemma) Let Y = (Y1, . . . , Yu) be the random variable de-
fined above, and let μ be its probability distribution. Then, for any constant
α 0, μ is 2−Ω(b)
-close to a convex combination of α-good distributions.
We prove Lemma 2.2 in subsection 2.1. The proof of Theorem 1, which follows
quite easily from Lemma 2.2, is essentially the same as in [4] (with modified
parameters). Due to space constraints we omit the proof of Theorem 1 (the
proof appears in the full version of the paper).
2.1 Proof of Lemma 2.2
In order to prove Lemma 2.2 we prove the following slightly stronger lemma.
Lemma 2.3 Let X = (X1, . . . , Xk) be an (n, b)1:k
-source, and define Y =
(Y1, . . . ,
Yu) and μ be as in Lemma 2.2. Then for any constant α 0 there exists an
integer t ≥ 1, and a partition of {0, 1}n·k
into t + 1 sets W1, . . . , Wt, W
, such
that:
1. PrX
[X ∈ W
] ≤ 2−Ω(b)
.
8
In condition 1, we require that the convex combination of the μi’s will be strictly
smaller than μ. This is not the most general case, but it will be convenient for us
to use this definition.

2. For every i ∈ [t] the probability distribution of Y | X ∈ Wi (that is - of Y
conditioned on the event X ∈ Wi) is α-good. In other words: for every i ∈ [t]
there exist at least (1 − γ/2) · u values of z ∈ [u] for which
H∞
(Yz|X ∈ Wi) ≥ (0.52 − α) · b.
Before proving Lemma 2.3 we give the proof of Lemma 2.2 which follows easily
from Lemma 2.3.
Proof of Lemma 2.2: The lemma follows immediately from Lemma 2.3 and from
the following equality, which holds for every partition W1, . . . , Wt, W
, and for
every y:
Pr[Y = y] =
t
i=1
Pr[X ∈ Wi] · Pr[Y = y |X ∈ Wi]
+ Pr[X ∈ W
] · Pr[Y = y | X ∈ W
] .
If the partition W1, . . . , Wt, W
satisfies the two conditions of Lemma 2.3
then from Definition 2.1 it is clear that Y is exponentially (in b) close to a
convex combination of α-good distributions.

Proof of Lemma 2.3: Every random variable Yz is a function of X, and so it
partitions {0, 1}n·k
in the following way:
{0, 1}n·k
=
.
y∈{0,1}n
(Yz)−1
(y),
where (Yz)−1
(y)

=

x ∈ {0, 1}n·k
| Yz(x) = y

. For each z ∈ [u] we define the set
Bz

=
.
{y | Pr[Yz=y]2−(0.52−α/2)·b
}
(Yz)−1
(y)
=
/
x
∈ {0, 1}n·k

PrX [Yz(X) = Yz(x
)] 2−(0.52−α/2)·b
0
.
Intuitively, Bz contains all values of x that are ”bad” for Yz, where in ”bad”
we mean that Yz(x) is obtained with relatively high probability in the distribu-
tion Yz(X).
Definition 2.4 (Good Triplets) Let (z1, z2, z3) ∈ [u]3
be a triplet of seed
values. Since each seed value is actually a vector in Fk
we can write each zi
(i = 1, 2, 3) as a vector (zi1, . . . , zik), where each zij is an integer in the range
0...(p − 1). We say that the triplet (z1, z2, z3) is good if the following two con-
ditions hold:
1. For all 2 ≤ j ≤ k, z1j = z2j = z3j.
2. There exists a positive integer 0 a p such that z21 = z11 + a and
z31 = z11 + 2a, where the equalities are over F = GF(p).

That is, the vectors z1, z2, z3 are identical in all coordinates different from one,
and their first coordinates form an arithmetic progression of length three in F =
GF(p).
The next two claims are the place where our analysis differs from that of [6]
and [4]. We devote Section 3 to the proofs of these two claims. The first claim
shows that the intersection of the ”bad” sets Bz1 , Bz2 , Bz3 for a good triplet
(z1, z2, z3) is small:
Claim 2.5 For every good triplet (z1, z2, z3) it holds that
PrX
[X ∈ Bz1 ∩ Bz2 ∩ Bz3 ] ≤ Cα · 2−(α/4)·b
,
where Cα is a constant depending only on α.
The second claim shows that every set of seed values who’s density is larger
than γ/2 contains a good triplet.
Claim 2.6 Let T ⊂ [u] be such that |T | (γ/2) · u. Then T contains a good
triplet.
The proof of Lemma 2.3 continues along the same lines as in [4]. We define
for each x ∈ {0, 1}n·k
a vector π(x) ∈ {0, 1}u
in the following way :
∀z ∈ [u] , π(x)z = 1 ⇐⇒ x ∈ Bz.
For a vector π ∈ {0, 1}u
, let w(π) denote the weight of π (i.e. the number of 1’s
in π). Since the weight of π(x) denotes the number of seed values for which x is
”bad”, we would like to somehow show that for most x’s w(π(x)) is small. This
can be proven by combining Claim 2.5 with Claim 2.6, as shown by the following
claim.
Claim 2.7
PrX
[w(π(X)) (γ/2) · u] ≤ u3
· Cα · 2−(α/4)·b
.
Proof. If x is such that w(π(x)) (γ/2) · u then, by Claim 2.6, we know that
there exists a good triplet (z1, z2, z3) such that x ∈ Bz1 ∩ Bz2 ∩ Bz3 . Therefore
we have PrX
[w(π(X)) (γ/2) · u] ≤ PrX
[∃ a good triplet (z1, z2, z3) s.t x ∈
Bz1 ∩ Bz2 ∩ Bz3 ]. Now, using the union bound and Claim 2.5 we can bound this
probability by u3
· Cα · 2−(α/4)·b
.
From Claim 2.7 we see that every x (except for an exponentially small set)
is contained in at most (γ/2) · u sets Bz. The idea is now to partition the space
{0, 1}n·k
into sets of x’s that have the same π(x). If we condition the random
variable Y on the event π(X) = π0, where π0 is of small weight, we will get an
α-good distribution. We now explain this idea in more details. We define the
following sets
BAD1

= {π
∈ {0, 1}u
| w(π
) (γ/2) · u} ,

BAD2

=
/
π
∈ {0, 1}u
| PrX
[π(X) = π
] 2−(α/2)·b
0
,
BAD

=BAD1 ∪ BAD2.
The set BAD ⊂ {0, 1}u
contains values π
∈ {0, 1}u
that cannot be used in
the partitioning process described in the last paragraph. There are two reasons
why a specific value π
∈ {0, 1}u
is included in BAD. The first reason is that
the weight of π
is too large (i.e. larger than (γ/2) · u), these values of π
are
included in the set BAD1. The second less obvious reason for π
to be excluded
from the partitioning is that the set of x’s for which π(x) = π
is of extremely
small probability. These values of π
are bad because we can say nothing about
the min-entropy of Y when conditioned on the event9
π(X) = π
.
Having defined the set BAD, we are now ready to define the partition re-
quired by Lemma 2.3. Let {π1
, . . . , πt
} = {0, 1}u
BAD. We define the sets
W1, . . . , Wt, W
⊂ {0, 1}n·k
as follows:
– W
= {x | π(x) ∈ BAD}.
– ∀i ∈ [t] , Wi = {x | π(x) = πi
}.
Clearly, the sets W1, . . . , Wt, W
form a partition of {0, 1}n·k
. We will now
show that this partition satisfies the two conditions required by Lemma 2.3. To
prove the first part of the lemma note that the probability of W
can be bounded
by (using Claim 2.7 and the union-bound)
PrX [X ∈ W
] ≤ PrX [π(X) ∈ BAD1] + PrX [π(X) ∈ BAD2]
≤ u3
· Cα · 2−(α/4)·b
+ 2u
· 2−(α/2)·b
= 2−Ω(b)
.
We now prove that W1, . . . , Wt satisfy the second part of the lemma. Let
i ∈ [t], and let z ∈ [u] be such that (πi
)z = 0 (there are at least (1 − γ/2) · u
such values of z). Let y ∈ {0, 1}n
be any value. If Pr[Yz = y] 2−(0.52−α/2)·b
then Pr[Yz = y | X ∈ Wi] = 0 (this follows from the way we defined the sets Bz
and Wi). If on the other hand Pr[Yz = y] ≤ 2−(0.52−α/2)·b
then Pr[Yz = y | X ∈
Wi] ≤ Pr[Yz =y]
Pr[X∈Wi] ≤ 2−(0.52−α/2)·b
/2−(α/2)·b
= 2−(0.52−α)·b
. Hence, for all values
of y we have Pr[Yz = y | X ∈ Wi] ≤ 2−(0.52−α)·b
. We can therefore conclude
that for all i ∈ [t], H∞
(Yz|X ∈ Wi) ≥ (0.52 − α) · b. This completes the proof of
Lemma 2.3.

3 Proving Claim 2.5 and Claim 2.6
Using Results from Additive Number Theory
In this section we prove Claim 2.5 and Claim 2.6. These two claims are the only
place in which our analysis differs from that of [6] and [4]. In the proofs we use
9
Consider the extreme case where there is only one x0 ∈ {0, 1}n·k
with π(x0) = π
.
In this case the min-entropy of Y , when conditioned on the event X ∈ {x0}, is zero,
even if the weight of π(x0) is small.

two results from additive number theory. The first is a quantitative version of
Roth’s theorem [8] given by Bourgain [2]. The second is a Lemma of Bourgain
that deals with sum-sets and difference-sets.
3.1 Proof of Claim 2.5
The proof of the claim relies on the following result from additive number theory
due to Bourgain [2]10
.
Lemma 3.1 ([2]) For every 0 there exists a constant C such that the
following holds: Let A, B be subsets of an abelian group G. Let Γ ⊂ A × B, and
define
S

={a + b | (a, b) ∈ Γ},
D

={a − b | (a, b) ∈ Γ}.
Suppose that there exists K 0 such that |A|, |B|, |S| ≤ K, then
|D| C · K(1/0.52)+
.
Before we can apply Lemma 3.1 we need some notations. Let U

=Bz1 ∩Bz2 ∩
Bz3 . We define for every i = 1, 2, 3 the set Vi

= {Yzi (x) | x ∈ U}. Next, we
define a subset Γ ⊂ V1 × V3 as follows
Γ

={(v1, v3) | ∃x ∈ U s.t Yz1 (x) = v1 and Yz3 (x) = v3}.
We now define the sets S and D as in Lemma 3.1, where the roles of A and B
are taken by V1 and V3.
S

={v1 + v3 | (v1, v3) ∈ Γ},
D

={v1 − v3 | (v1, v3) ∈ Γ}.
We also define
K

=2(0.52−α/2)·b
,
and
Û

={x1 ∈ {0, 1}n
| ∃x2, . . . , xk ∈ {0, 1}n
s.t (x1, . . . , xk) ∈ U}.
the following claim states several facts that, when combined, will enable us
to use Lemma 3.1 on the sets we have defined.
Claim 3.2 the following is true:
1. |V1|, |V2|, |V3| ≤ K.
2. |S| ≤ |V2| ≤ K.
3. |Û| ≤ |D|.
10
This result appears in [2] with respect to subsets of a torsion-free group. However,
it is easily seen from the proof that it holds for subsets of any abelian group.

Proof. 1. Follows directly from the definition of the sets Bzi and Vi. Each value
v ∈ Vi is a ”heavy element” of the random variable Yzi . That is, the proba-
bility that Yzi = v is at least 2−(0.52−α/2)·b
= K−1
, and so there can be at
most K such values.
2. What we will show is that the set S is contained in the set 2V2

={2·v | v ∈ V2}
(these two sets are actually equal, but we will not need this fact). To see
this, recall that from the definition of a good triplet we have that for every
x ∈ {0, 1}n·k
Yz1 (x) + Yz3 (x) = 2 · Yz2 (x). (1)
Let v ∈ S. From the definition of S (and of Γ) we know that there exists
x ∈ U and v1 ∈ V1, v3 ∈ V3 such that Yz1 (x) = v1, Yz3 (x) = v3 and v =
v1 + v3. From Eq.1 we now see that v = 2 · Yz2 (x), and therefore v ∈ 2V2.
The inequality now follows from the fact that |V2| = |2V2|.
3. This follows in a similar manner to 2. We will show that the set Û is contained
in the set c · D

={c · v | v ∈ D}, for some 0 c p (again, the two sets
are actually equal, but we will not use this fact). From the definition of
a good triplet we know that there exists 0 c p such that for every
x = (x1, . . . , xk) ∈ {0, 1}n·k
c · (Yz1 (x) − Yz3 (x)) = x1. (2)
Let x1 ∈ Û. From the definition of Û we know that there exist x2, . . . , xk ∈
{0, 1}n
such that x = (x1, . . . , xk) ∈ U. From Eq.2 it follows that x1 ∈ c · D,
since Yz1 (x) − Yz3 (x) ∈ D by definition.
Let

=α
4 · 1
(0.52−α/2) . From the first two parts of Claim 3.2 we see that we can
apply Lemma 3.1 with A = V1 and B = V3 to get that |D| C · K(1/0.52)+
,
(where C depends only on , which in turn depends only on α). Substituting
the values of and K we see that (we can write Cα instead of C)
|D| Cα · 2b·(0.52−α/2)·(1/0.52+)
= Cα · 2b·(1+(0.52−α/2)− α
2 · 1
0.52 )
= Cα · 2b·(1+ α
4 − α
2 · 1
0.52 ) ≤ Cα · 2b·(1− α
4 ),
where Cα depends only on α.
Using the third part of Claim 3.2 and Eq. 3 we conclude that|Û| ≤ |D| ≤
Cα · 2b·(1− α
4 ). We can therefore bound the probability of U by PrX
[X ∈ U] ≤
PrX1
[X1 ∈ Û] ≤ 2−b
· |Û| ≤ 2−b
·

Cα · 2b·(1− α
4 )

= Cα · 2−(α/4)·b
, (the second
inequality follows from the fact that the min-entropy of X1 is at least b). This
completes the proof of Claim 2.5.

3.2 Proof of Claim 2.6
The claim follows from Roth’s theorem [8] on arithmetic progressions of length
three. For our purposes we require the quantitative version of this theorem as
proven by Bourgain [3].

Theorem 3.3 ([3]) Let δ 0, let N ≥ exp(δ−2
) and let A ⊂ {1, . . . , N} be
a set of size at least δN. Then A contains an arithmetic progression of length
three.
Each element in T is a vector in Fk
(recall that F = GF(p)). A simple counting
argument shows that T must contain a subset T
such that (a) |T
| (γ/2) · p.
(b) All vectors in T
are identical in all coordinates different than one. Using
Theorem 3.3 and using the fact that p was chosen to be greater than exp(γ−2
),
we conclude that there exists a triplet in T
such that the first coordinates of
this triplet form an arithmetic progression. This is a good triplet, since in T
the
vectors are identical in all coordinates different than one.

Acknowledgements
The authors would like to thank Ran Raz, Omer Reingold and Avi Wigderson for
helpful conversations. A.S. would also like to thank Oded Goldreich for helpful
discussions on related problems.
References
1. Boaz Barak, Russell Impagliazzo, and Avi Wigderson. Extracting randomness
using few independent sources. In 45th Symposium on Foundations of Computer
Science (FOCS 2004), pages 384–393, 2004.
2. Jean Bourgain. On the dimension of kakeya sets and related maximal inequalities.
Geom. Funct. Anal., (9):256–282, 1999.
3. Jean Bourgain. On triples in arithmetic progression. Geom. Funct. Anal., (9):968–
984, 1999.
4. Zeev Dvir and Ran Raz. Analyzing linear mergers. Electronic Colloquium on
Computational Complexity (ECCC), (025), 2005.
5. Timothy Gowers. A new proof of szemeredi’s theorem. Geom. Funct. Anal.,
(11):465–588, 2001.
6. Chi-Jen Lu, Omer Reingold, Salil Vadhan, and Avi Wigderson. Extractors: optimal
up to constant factors. In STOC ’03: Proceedings of the thirty-fifth annual ACM
symposium on Theory of computing, pages 602–611. ACM Press, 2003.
7. Ran Raz. Extractors with weak random seeds. STOC 2005 (to appear), 2005.
8. Klaus F Roth. On certain sets of integers. J. Lond. Math. Soc., (28):104–109, 1953.
9. Ronen Shaltiel. Recent developments in extractors. Bulletin of the European
Association for Theoretical Computer Science, 77:67–95, 2002.
10. Endre Szemeredi. On sets of integers containing no k elements in arithmetic pro-
gression. Acta. Arith., (27):299–345, 1975.
11. Amnon Ta-Shma. On extracting randomness from weak random sources (extended
abstract). In STOC ’96: Proceedings of the twenty-eighth annual ACM symposium
on Theory of computing, pages 276–285. ACM Press, 1996.

Finding a Maximum Independent Set
in a Sparse Random Graph
Uriel Feige and Eran Ofek
Weizmann Institute of Science,
Department of Computer Science and Applied Mathematics,
Rehovot 76100, Israel
{uriel.feige,eran.ofek}@weizmann.ac.il
Abstract. We consider the problem of finding a maximum independent set in
a random graph. The random graph G is modelled as follows. Every edge is
included independently with probability d
n
, where d is some sufficiently large
constant. Thereafter, for some constant α, a subset I of αn vertices is chosen at
random, and all edges within this subset are removed. In this model, the planted
independent set I is a good approximation for the maximum independent set
Imax, but both I Imax and Imax I are likely to be nonempty. We present a
polynomial time algorithms that with high probability (over the random choice
of random graph G, and without being given the planted independent set I) finds
a maximum independent set in G when α ≥ c0 log d/d, where c0 is some
sufficiently large constant independent of d.
1 Introduction
Let G = (V, E) be a graph. An independent set I is a subset of vertices which con-
tains no edges. The problem of finding a maximum size independent set in a graph is
NP-hard. Moreover, for any 0 there is no n1−
(polynomial time) approximation
algorithm for it unless NP=ZPP [12]. The best approximation ratio currently known for
maximum independent set [6] is O(n(log log n)2
/(log n)3
).
In light of the above mentioned negative results, one may try to design a heuristic
which performs well on typical instances. Karp [14] proposed trying to find a maximum
independent set in a random graph. However, even this problem appears to be beyond
the capabilities of current algorithms. For example let Gn,1/2 denote the random graph
on n vertices obtained by choosing randomly and independently each possible edge
with probability 1/2. A random Gn,1/2 graph has almost surely maximum independent
set of size 2(1 + o(1)) log2 n. A simple greedy algorithm almost surely finds an inde-
pendent set of size log2 n [11]. However, there is no known polynomial time algorithm
that almost surely finds an independent set of size (1 + ) log2 n (for any 0).
To further simplify the problem, Jerrum [13] and Kucera [15] proposed a planted
model in which a random graph Gn,1/2 is chosen and then a clique of size k is randomly

This work was supported in part by a grant from the G.I.F., the German-Israeli Foundation
for Scientific Research and Development. Part of this work was done while the authors were
visiting Microsoft Research in Redmond, Washington.
c

Finding a Maximum Independent Set in a Sparse Random Graph 283
placed in the graph. (A clique in a graph G is an independent set in the edge complement
of G, hence all algorithmic results that apply to one of the problems apply to the other.)
Alon Krivelevich and Sudakov [2] gave an algorithm based on spectral techniques that
almost surely finds the planted clique for k = Ω(
√
n). More generally, one may extend
the range of parameters of the above model by planting an independent set in Gn,p,
where p need not be 1/2, and may also depend on n. The Gn,p,α model is as follows:
n vertices are partitioned at random into two sets of vertices, I of size αn and C of
size (1 − α)n. No edges are placed within the set I, thus making it an independent set.
Every other possible edge (with at least one endpoint not in I) is added independently at
random with probability p. The goal of the algorithm, given the input G (without being
given the partition into I and C) is to find a maximum independent set. Intuitively, as
α becomes smaller the size of the planted independent is closer to the probable size of
the maximum independent set in Gn,p and the problem becomes harder.
We consider values of p as small as d/n where d is a large enough constant. A
difficulty which arises in this sparse regime (e.g. when d is constant) is that the planted
independent set I is not likely to be a maximum independent set. Moreover, with high
probability I is not contained in any maximum independent set of G. For example, there
are expected to be e−d
n vertices in C of degree one. It is very likely that two (or more)
such vertices v, w ∈ C will have the same neighbor, and that it will be some vertex
u ∈ I. This implies that every maximum independent set will contain v, w and not u,
and thus I contains vertices that are not contained in any maximum independent set.
C
u
w
v
I
Fig. 1. The vertex u ∈ I is not contained in any maximum independent set because no other
edges touch v, w
A similar argument shows that there are expected to be e−Ω(d)
n isolated edges. This
implies that there will be an exponential number of maximum independent sets.
1.1 Our Result
We give a polynomial time algorithm that searches for a maximum independent set of
G. Given a random instance of Gn, d
n ,α, the algorithm almost surely succeeds, when
d d0 and α ≥ c0 log d/d (d0, c0 are some universal constants). The parameter d
can be also an arbitrary increasing function of n.

284 Uriel Feige and Eran Ofek
Remark: A significantly more complicated version of our algorithm works for a wider
range of parameters, namely, for α ≥ c0/d rather than α ≥ c0 log d/d. The im-
proved algorithm will appear in the full version of this paper 1
.
1.2 Related Work
For p = 1/2, Alon Krivelevich and Sudakov [2] gave an efficient spectral algorithm
which almost surely finds the planted independent set when α = Ω(1/
√
n). For the
above mentioned parameters, the planted independent set is almost surely the unique
maximum independent set.
A few papers deal with semi-random models which extend the planted model by
enabling a mixture of random and adversarial decisions. Feige and Kilian [7] consid-
ered the following model: a random Gn,p,α graph is chosen, then an adversary may
add arbitrarily many edges between I and C, and make arbitrary changes (adding or
removing edges) inside C. For any constant α 0 they give a heuristic that almost
surely outputs a list of independent sets containing the planted independent set, when-
ever p (1 + ) ln n/αn (for any 0). The planted independent set may not be the
only independent set of size αn since the adversary has full control on the edges inside
C. Possibly, this makes the task of finding the planted independent set harder.
In [8] Feige and Krauthgamer considered a less adversarial semi-random model in
which an adversary is allowed to add edges to a random Gn, 1
2 , 1
√
n
graph. Their algorithm
almost surely extracts the planted independent set and certifies its optimality.
Heuristics for optimization problems different than max independent set will be
discussed in the following section.
Technique and Outline of the Algorithm. Our algorithm builds on ideas from the
algorithm of Alon and Kahale [1], which was used for recovering a planted 3-coloring
in a random graph. The algorithm we propose has the following 3 phases:
1. Get an approximation of I, C denoted by I
, C
, OUT, where I
is an independent
set. The error term |C C
| + |I I
| should be at most e−c log d
n where c is a
large enough universal constant (this phase is analogous to the first two phases of
[1]).
2. Move to OUT vertices of I
, C
which have non typical degrees.
We stop when I
, C
become promising: every vertex of C
has at least 4 edges to
I
and no vertex of I
has edges to OUT. At this point we have a promising partial
solution I
, C
and the error term (with respect to I, C) is still small. Using the fact
that (almost surely) random graphs have no small dense sets, it can be shown that
I
is extendable: I
⊆ Imax for some optimal solution Imax.
3. Extend the independent set I
optimally using the vertices of OUT. This is done
by finding a maximum independent in graph induced on OUT and adding it to I
.
With high probability the structure of OUT will be easy enough so that a max-
imum independent set can be efficiently found. OUT is a random graph of size
n/poly(d). Notice however, that the set OUT depends on the graph itself thus we
can not argue that it is a random G n
poly(d)
, d
n
graph.
1
Can be found in http://wisdom.weizmann.ac.il/∼erano

The technique of [1] was implemented successfully on various problems in the
planted model: hypergraph coloring, 3-SAT, 4-NAE, min-bisection (by Chen and Frieze
[3] , Flaxman [9] , Goerdt and Lanka [10], Coja-Oghlan [4] respectively).
Perhaps the work closest in nature to the work in the current paper is that of Amin
Coja-Oghlan [4] on finding a bisection in a sparse random graph. Both in our work and
in that of [4], one is dealing with an optimization problem, and the density of the input
graph is such that the planted solution is not an optimal solution. The algorithm for
bisection in [4] is based on spectral techniques, and has the advantage that it provides
a certificate showing that the solution that it finds is indeed optimal. Our algorithm
for maximum independent set does not use spectral techniques and does not provide a
certificate for optimality.
An important difference between planted models for independent set and those for
other problems such as 3-coloring and min-bisection is that in our case the planted
classes I, C are not symmetric. The lack of symmetry between I and C makes some
of the ideas used for the more symmetric problems insufficient. In the approach of [1],
a vertex is removed from its current color class and placed in OUT if its degree into
some other current color class is very different than what one would typically expect to
see between the two color classes. This procedure is shown to ”clean” every color class
C from all vertices that should have been from a different color class, but were wrongly
assigned to class C in phase 1 of the algorithm. (The argument proving this goes as
follows. Every vertex remaining in the wrong color class by the end of phase 2 must
have many neighbors that are wrongly assigned themselves. Thus the set of wrongly
assigned vertices induces a small subgraph with large edge density. But G itself does
not have any such subgraphs, and hence by the end of phase 2 it must be the case that
all wrongly assigned vertices were moved into OUT.) It turns out that this approach
works well when classes are of similar nature (such as color classes, or two sides of
a bisection), but does not seem to suffice in our case where I
is supposed to be an
independent set whereas C
is not. Specifically, the set I
might still contain wrongly
assigned vertices, and might not be a subset of a maximum independent set in the graph.
Under these circumstances, phase 3 will not result in a maximum independent set. Our
solution to this problem involves the following aspects, not present in previous work. In
phase 2 we remove from I
every vertex that has even one edge connecting it to OUT.
This adds more vertices to OUT and may possibly create large connected components
in OUT. Indeed, we do not show that OUT has no large connected components, which
is a key ingredient in previous approaches. Instead, we analyze the 2-core of OUT
and show that the 2-core has no large components. Then, in phase 3, we use dynamic
programming to find a maximum independent set in OUT, and use the special structure
of OUT to show that the algorithm runs in polynomial time.
1.3 Notation and Terminology
Let G = (V, E) and let U ⊂ V . The subgraph of G induced by the vertices of U is
denoted by G[U]. We denote by degE
(v)U the number of edges from E that connect
a vertex v to U ⊂ V ; when E is clear from the context we will use deg(v)U . We use
Γ(U) to denote the vertex neighborhood of U ⊂ V (excluding U). The parameter d
(specifying the expected degree in the random graph G) is assumed to be sufficiently

large, and some of the inequalities that we shall derive implicitly use this assumption,
without stating it explicitly. The term with high probability (w.h.p.) is used to denote a
sequence of probabilities that tends to 1 as n tends to infinity.
2 The Algorithm
FindIS(G)
1. (a) Set: I1 = {v : deg(v) d − αd/2},
C1 = {v : deg(v) ≥ d − αd/2},
OUT1 = ∅.
(b) For every edge (u, v) such that both u, v are in I1, move u, v to OUT1.
2. Set I2 = I1, C2 = C1, OUT2 = OUT1.
A vertex v ∈ C2 is removable if deg(v)I2 4.
Iteratively: find a removable vertex v. Move v and Γ(v) ∩ I2 to OUT2.
3. Output the union of I2 and a maximum independent set of G[OUT2]. We will
explain later how this is done efficiently.
No edges
I2
C2
3
OUT2
Fig. 2. After step 2 of the algorithm, I2 is an independent set, there are no edges between I2 and
OUT2, and every vertex v ∈ C2 has at least four neighbors in I2
3 Correctness
When d n32/c0
a simple argument (using the union and the Chernoff bounds), which
we omit, shows that after step 1a of the algorithm it holds that I1 = I, C1 = C. In fact,
when d n32/c0
the planted independent set is the unique maximum independent set
and this case was already covered in [5]. From now we will assume that d n32/c0
.
Let Imax be a maximum independent set of G. We establish two theorems. Theorem
1 guarantees the algorithm correctness and Theorem 2 guarantees its efficient running
time. Here we present these two theorems, and their proofs are deferred to later sections.
Theorem 1. W.h.p. there exists Imax such that I2 ⊆ Imax, C2 ∩ Imax = ∅.
Definition 1. The 2-core of a graph G is the maximal subgraph in which the minimal
degree is at least 2.

Clearly the 2-core is unique and can be found by iteratively removing vertices whose
degree is smaller than 2.
Theorem 2. W.h.p. the largest connected component in the 2-core of G[OUT2] has
cardinality of at most 2 log n.
Let G be any graph. Those vertices of G that do not belong to the 2-core form trees.
Each such tree is either disconnected from the 2-core or it is connected by exactly one
edge to the 2-core. To find a maximum independent set of G[OUT2] we need to find a
maximum independent set in each connected component of G[OUT2] separately. For
each connected component Di of G[OUT2] we find the maximum independent set as
follows: let Ci be the intersection of Di with the 2-core of G. We enumerate all possible
independent sets in Ci (there are at most 2|Ci|
possibilities), each one of them can be
optimally extended to an independent set of Di by solving (separately) a maximum
independent set problem on each of the trees connected to Ci. For some trees we may
have to exclude the tree vertex which is connected to Ci if it is connected to a vertex
of the independent set that we try to extend. On each tree the problem can be solved by
dynamic programming.
Corollary 1. A maximum independent set of G[OUT2] can be found efficiently.
3.1 Dense Sets and Degree Deviations
In proving the correctness of the algorithm, we will use structural properties of the
random graph G. In particular, such a random graph most likely has no small dense
sets (small sets of vertices that induce many edges). This fact will be used on several
occasions to derive a proof by contradiction. Namely, certain undesirable outcomes of
the algorithm cannot occur, as otherwise they lead to a discovery of a small dense set.
The lemmas relating to these properties are rather standard and their proofs are omitted
due to lack of space.
Lemma 1. Let G be a random graph taken from Gn,p,α (p = d
n , d n32/c0
). The
following holds:
1. W.h.p. for every set U ⊂ V of cardinality smaller than 2n/d5
the number of edges
inside U is bounded by 4
3 |U|.
2. Let c ≥ 3. With probability of at least n−0.9(c−1)
for every set of vertices U of size
smaller than n/d2
the number of edges inside U is less than c|U|.
3. W.h.p. there is no C
⊆ C such that n
2d5 ≤ |C
| ≤ 2n log d
d and |Γ(C
) ∩ I| ≤ |C
|.
Corollary 2. Let G be a graph which has the property from Lemma 1 part 1. Let A, B
be any two disjoint sets of vertices each of size smaller than n/d5
. If every vertex of B
has at least 2 edges going into A, then |A| ≥ |B|/2.
Lemma 2. Let d n32/c0
. The following hold with probability 1 − e−n0.1
:
1. The number of vertices from I which are not I1 is at most e−α2
d/64
n.
2. The number of vertices from C whose degree into I is αd/2 is at most e−α2
d/64
n.
3. The number of vertices from C which are not in C1 is most e−α2
d/64
n.
4. The number of edges that contain a vertex with degree at least 3d is at most 3e−d
dn.

C’
I’
C
I’
I
C’
I OUT’
No edges
I
C’ C
v
w
OUT’
I’
Fig. 3. A vertex v ∈ (I
∩ C) which has exactly one edge into I ∩ C
We would have liked to prove that with high probability I2 ⊆ I ⊆ I2∪OUT2. However,
this is incorrect when d is a constant.
Lemma 3. Let I be any independent set of G and let C

= V I. Let I
, C
, OUT
be
an arbitrary partition of V for which I
is an independent set. If the following hold:
1. |(I
∩ C) ∪ (I ∩ C
)| n/d5
.
2. Every vertex of C
has 4 neighbors in I
. There are no edges between I
and OUT
.
3. The graph G has no small dense subsets as described in Lemma 1 part 1.
then there exists an independent set Inew (and Cnew

= V Inew) such that I
⊆
Inew, C
⊆ Cnew and |Inew| ≥ |I|.
Proof. If we could show that on average a vertex of U = (I
∩C)∪(I ∩C
) contributes
at least 4/3 internal edges to U, then U would form a small dense set that contradicts
Lemma 1. This would imply that U = (I
∩ C) ∪ (I ∩ C
) is the empty set, and we
could take Inew = I in the proof of Lemma 3. The proof below extends this approach
to cases where we cannot take Inew = I.
Every vertex v ∈ C
has at least 4 edges into vertices of I
. Since I is an independent
set it follows that every vertex of I ∩ C
has at least 4 edges into I
∩ C. To complete
the argument we would like to show that every vertex of I
∩ C has at least 2 edges into
I ∩C
. However, some vertices v ∈ I
∩C might have less than two neighbors in I ∩C
.
In this case, we will modify I to get an independent set Inew (and Cnew

= V Inew)
at least as large as I, for which every vertex of I
∩ Cnew has 2 neighbors in Inew ∩ C
.
This is done iteratively; after each iteration we set I = Inew, C = Cnew. Consider a
vertex v ∈ (I
∩ C) which has strictly less than 2 edges into I ∩ C
:
– If v has no neighbors in I ∩C
, then deﬁne Inew = I ∪{v}. Inew is an independent
set because v (being in I
) has no neighbors in I
nor in OUT
.
– If v has only one edge into w ∈ (I ∩C
) then deﬁne Inew = (I {w})∪{v}. Inew
is an independent set because v (being in I
) has no neighbors in I
nor in OUT
.
The only neighbor of v in I ∩ C
is w.
The three properties (from the statement of this lemma) are maintained also with respect
to Inew, Cnew (replacing I, C): properties 2, 3 are independent on the sets I, C and

property 1 is maintained since after each iteration it holds that |(I
∩ Cnew) ∪ (Inew ∩
C
)| |(I
∩ C) ∪ (I ∩ C
)|.
When the process ends, let U denote (I
∩ C) ∪ (I ∩ C
). Each vertex of I
∩ C has
at least 2 edges into I ∩ C
, thus |I ∩ C
| ≥ 1
2 |I
∩ C| (see Corollary 2). Each vertex of
I ∩C
has 4 edges into I
∩C so the number of edges in U is at least 4|I ∩C
| ≥ 4|U|/3
and also |U| n/d5
, which implies that U is empty (by Lemma 1 part 1).
To use Lemma 3 with I = Imax, I
= I2, we need to show that condition 1 is
satisﬁed, i.e. |Imax I2| n/d5
.
Lemma 4. With high probability |Imax I2| n/d5
.
Proof. |Imax I2| ≤ |Imax I| + |I I2|. By Lemma 2 the number of wrongly
assigned vertices (in step 1a) is e−α2
d/64
n. By Lemma 5 |OUT2| e−α2
d/70
n. It
follows that |I I2| n/d5
. It remains to bound |Imax I|:
|Imax I| = |Imax I| + |I Imax| ≤ 2|Imax I| = 2|Imax ∩ C|.
Imax = (Imax∩I)∪(Imax ∩C). One can always replace Imax∩C with Γ(Imax∩C)∩I
to get an independent set (Imax ∩ I) ∪ (Γ(Imax ∩ C) ∩ I). The maximum independent
set in G[C] is w.h.p. of cardinality at most 2n log d
d (the proof is standard using ﬁrst
moment), this upper bounds |Imax ∩C|. From Lemma 1 part 3 if |Imax ∩C| n/(2d5
)
then |Γ(C ∩ Imax) ∩ I| |Imax ∩ C| which contradicts the maximality of Imax.
It follows that I2 is contained in some maximum independent set Imax. It remains
to show that the 2-core of OUT2 has no large connected components.
Lemma 5. With probability of at least 1 − e−c1α2
d log n
the cardinality of OUT2 is at
most e−α2
d/70
n (where c1 is a universal constant independent of c0).
Proof. We will use the assumption d n−32/c0
. If u, v were moved to OUT1 in step
1b then at least one of them belongs to C ∩ I1. Thus:
|OUT1| ≤ |C ∩ I1| + |Γ(C ∩ I1)|
e−α2
d/64
n + 3de−α2
d/64
n + 3e−d
dn e−α2
d/64+log d+O(1)
n
with probability 1 − e−n0.1
(we use Lemma 2 parts 3, 4).
Let C
= {v ∈ C : deg(v)I αd/2}. The cardinality of C
is at most e−α2
d/64
with probability at least 1 − e−n0.1
(using Lemma 2 part 2). We will now show that
|OUT2| ≤ 2(|OUT1| + |C
|). Let U = OUT1 ∪ C
. Start adding to U vertices of
OUT2 U by the order in which the algorithm moved them to OUT2. In each step
we add at most 4 vertices to U. Assume that at some point |U| becomes larger than
2(|OUT1| + |C
|). The number of edges inside U is at least:
1
4
1
2
|U|αd/2 =
αd
16
|U|. (1)

At this point |U| ≤ e−α2
d/64+log d+O(1)
n + 4 n/d5
and U contains αd
16 |U| edges.
By Lemma 1 part 2 the probability that G contains such a dense set U is at most
e−0.9 log n(αd/16−1)
≤ e−c1α2
d log n
(we use here αd α2
d).
Having established that OUT2 is small, we would now like to establish that its
structure is simple enough to allow one to ﬁnd a maximum independent set of G[OUT2]
in polynomial time. Establishing such a structure would have been easy if the vertices
of OUT2 were chosen independently at random, because a small random subgraph
of a random graph G is likely to decompose into connected components no larger than
O(log n). However, OUT2 is extracted from G using some deterministic algorithm, and
hence might have more complicated structure. For this reason, we shall now consider
the 2-core of G[OUT2] and bound the size of its connected components.
Let A denote the 2-core of G[OUT2]. In order to show that A has no large connected
component, it is enough to show that A has no large tree. We were unable to show such
a result for a general tree. Instead, we prove that A has no large balanced trees, that is
trees in which at least 1/3 fraction of the vertices belong to C. Fortunately, this turns
out to be enough. Any set of vertices U ⊂ V is called balanced if it contains at least
|U|
3 vertices from C. We use the following reasoning: any maximal connected com-
ponent of A is balanced - see Lemma 6 below. Furthermore, any balanced connected
component of size at least 2 log n (in vertices) must contain a balanced tree of size is
in [log n, 2 log n − 1] – see Lemma 7. We then complete the argument by showing that
OUT2 does not contain a balanced tree with size in [log n, 2 log n].
Lemma 6. Every maximal connected component of the 2-core of OUT2 is balanced.
Proof. Let Ai be such a maximal connected component. Every vertex of Ai has degree
of at least 2 in Ai because Ai is a maximal connected component of a 2-core. |Ai| ≤
|OUT2| n
d5 . If |Ai∩I|
|Ai| is more than 2
3 , then the number of internal edges in Ai is
2 · 2
3 |Ai| 4
3 |Ai| which contradicts Lemma 1.
Lemma 7. Let G be a connected graph whose vertices are partitioned into two sets:
C and I. Let 1
k be a lower bound on the fraction of C vertices, where k is an integer.
For any 1 ≤ t ≤ |V (G)|/2 there exists a tree whose size is in [t, 2t − 1] and at least 1
k
fraction of its vertices are C.
Proof. We use the following well know fact: any tree T contains a center vertex v such
that each subtree hanged on v contains strictly less than half of the vertices of T .
Let T be an arbitrary spanning tree of G, with center v. We proceed by induction
on the size of T . Consider the subtrees T1, ..., Tk hanged on v. If there exists a subtree
Tj with at least t vertices then also T Tj has at least t vertices. In at least one of
Tj, T Tj the fraction of C vertices is at least 1
k and the lemma follows by induction
on it. Consider now the case in which all the trees have less than t vertices. If in some
subtree Tj the fraction of C vertices is at most 1
k , then we remove it and apply induction
to T Tj. The remaining case is that in all the subtrees the fraction of C vertices is
strictly more than 1
k . In this case we start adding subtrees to the root v until for the ﬁrst
time the number of vertices is at least t. At this point we have a tree with at most 2t − 1
vertices and the fraction of C vertices is at least 1
k . To see that the fraction of C vertices

is at least 1
k , we only need to prove that the tree formed by v and the first subtree has 1
k
fraction of C vertices. Let r be the number of C vertices in the first subtree and let b be
the number of vertices in it. Since k is integer we have: r
b 1
k =⇒ r
b+1 ≥ 1
k .
We shall now prove that OUT2 contains no balanced tree of size in [log n, 2 log n].
To simplify the following computations, we will assume that α 1/2 (the proof
can be modified to work for any α, we omit the details). Fix t to be some value in
[log n, 2 log n]. We will use the fact that the number of spanning trees of a t-clique is
tt−2
(Cayley’s formula). The probability that OUT2 contains a balanced tree of size t
is at most:
T is balanced,
|T |=t
Pr[T ⊆ E] · Pr[V (T ) ⊆ OUT2 | T ⊆ E] ≤ (2)

n
t

tt−1

d
n
t−1
max
T balanced,
|T |=t
{Pr[V (T ) ⊆ OUT2|T ⊆ E]} ≤ (3)
n (ed)
t
max
T is balanced,
|T |=t
{Pr[V (T ) ⊆ OUT2 | T ⊆ E]} ≤ (4)
et(log d+1)+log n
max
T is balanced,
|T |=t
{Pr[V (T ) ⊆ OUT2 | T ⊆ E]} (5)
To upper bound the above expression by o(1/ log n) (so we can use union bound over
all choices of t), it is enough to prove that for some universal constant c1 and any fixed
balanced tree T of size t it holds that:
Pr[V (T ) ⊆ OUT2 | T ⊆ E] ≤ e−c1(α2
dt)
.
The last term is bounded by ≤ e−c1c0t log d
because α2
d ≥ c0 log d. By choosing c0
3/c1 we derive the required bound. We will use the following equality
Pr
E
[V (T ) ⊆ OUT2 | T ⊆ E] = Pr
E
[V (T ) ⊆ OUT2(E ∪ T )], (6)
which is true because the distribution of E given that T ⊆ E is exactly the distribution
of E∪T . We have to show that for any balanced tree T of size t, Pr[V (T ) ⊆ OUT2(E∪
T )] ≤ e−c1α2
dt
. We use the technique introduced at [1] and modify it to our setting. To
simplify the notation we will use F to denote the algorithm FindIS defined at Section 2.
Given a fixed balanced tree T of size ≤ 2 log n, we define an intermediate algorithm F
that knows the partition I, C and also I(T ) (which is I ∩ V (T )). F
has no knowledge
of which vertices are in C(T ) (in fact they can be chosen after the algorithm is run).
F
is identical to F except for the following difference: after step 1(a), it uses a step
1’(a) that puts all vertices of I(T ) and all vertices of C ∩ I1 in OUT1, it then continues
to step 1(b) after which, it also throws from I1 (to OUT1) all the vertices connected to
OUT1 (step 1(c)).
We use OUT
2(E) to denote OUT2 in the outcome of F
(E). As we shall see, F
dominates F in the following sense: OUT
2(E) contains OUT2(E ∪ T ). Nevertheless,

since F
has no knowledge of C(T ), the set C(T ) is likely to be in OUT
2(E) as any
other subset of C with the same size (where the random variable is E). An argument
similar to the one in Lemma 5 (which we omit) shows that OUT
2(E) is likely to be
small: |OUT
2(E)| e−α2
d/70
n with probability 1 − ec1α2
d log n
. We get:
Pr
E
[C(T ) ⊆ OUT2(E ∪ T ) ≤ Pr
E
[C(T ) ⊆ OUT
2(E)]
≤ Pr
E
[C(T ) ⊆ OUT
2(E) | #OUT
2(E) ∩ C e−α2
d/70
] + e−c1α2
d log n
.
Given that |OUT
2(E) ∩ C| = m, the set OUT
2(E) ∩ C is just a random subsets of C
of size m. It then follows that Pr[C(T ) ⊆ OUT
2(E) | #OUT
2(E) ∩ C e−α2
d/70
]
is bounded by the probability that a binary random variable X ∼ Bin(m, p = |C(T )|
|C|−m )
has |C(T )| successes. Since m ≤ e−α2
d/70
n and |C(T )| ≥ t/3 this probability is
bounded by:

m
t/3

pt/6
≤

me
t/3
·
t/3
|C| − m
t/3
≤
1
e−α2
d/70+1
n
n/3
2t/3
≤ e−c1α2
dt
.
In the second inequality we use (1 − α)n − m n/3 which is true for α 1/2. (If
we let (1 − α) to be too small we get a small factor in the denominator which becomes
signiﬁcant. Nevertheless, a more careful estimation in equation (3) yields a factor that
cancels it out.) It follows that PrE[V (T ) ⊆ OUT2(E ∪ T )] e−c1α2
dt
as needed.
It remains to show that OUT2(E ∪ T ) ⊆ OUT
2(E). This is done in Lemma 8.
Lemma 8. OUT2(E ∪ T ) ⊆ OUT
2(E ∪ T ) ⊆ OUT
2(E).
Proof. There are three executions that we consider:
(i) F(E ∪ T ) which produces I
(i)
1 , C
(i)
1 in step 1 and I
(i)
2 , C
(i)
2 in step 2,
(ii) F
(E ∪ T ) which produces I
(ii)
1 , C
(ii)
1 in step 1 and I
(ii)
2 , C
(ii)
2 in step 2,
(iii) F
(E) which produces I
(iii)
1 , C
(iii)
1 in step 1 and I
(iii)
2 , C
(iii)
2 in step 2.
First we analyse step 1 and show that: I
(i)
1 ⊇ I
(ii)
1 ⊇ I
(iii)
1 and C
(i)
1 ⊇ C
(ii)
1 ⊇
C
(iii)
1 . The inclusions I
(i)
1 ⊇ I
(ii)
1 , C
(i)
1 ⊇ C
(ii)
1 are easy as after step 1(a) they are
in fact equalities and in steps 1’(a),1(b),1(c) F
removes more vertices then what F
removes in 1(b) to OUT1. We now prove the inclusions I
(ii)
1 ⊇ I
(iii)
1 , C
(ii)
1 ⊇ C
(iii)
1 .
The only difference (due to T edges) in step 1(a) is that some vertices of I
(iii)
1 ∩ V (T )
will be put in C
(ii)
1 (instead of being in I
(ii)
1 ). This does not pose a problem since
anyway all the vertices of I
(iii)
1 ∩V (T ) are moved to OUT1 by F
at step 1’(a) (because
I
(iii)
1 ∩ V (T ) is contained in I(T ) ∪ (C ∩ I
(iii)
1 )). Any two vertices which are removed
in step 1(b) from I
(ii)
1 will be removed from I
(iii)
2 either in step 1(b) or in step 1(c).
Given the inclusions of step 1 we are ready to prove the inclusions of step 2: I
(i)
2 ⊇
I
(ii)
2 ⊇ I
(iii)
2 and C
(i)
2 ⊇ C
(ii)
2 ⊇ C
(iii)
2 . Notice that these inclusions imply the Lemma.
We begin with I
(i)
2 ⊇ I
(ii)
2 , C
(i)
2 ⊇ C
(ii)
2 . Consider the execution of step 2 of F(E ∪T ).
We show a parallel execution of step 2 of F
(E ∪ T ), for which the invariant I
(i)
2 ⊇

I
(ii)
2 , C
(i)
2 ⊇ C
(ii)
2 is kept. It is enough to show one possible execution of step 2 of
F
(E ∪ T ), because the order by which vertices are removed does not affect the final
outcome (once a vertex becomes removable it will be removed). Initially, the required
inclusions are true due to the inclusions after step 1. Whenever a vertex v ∈ C
(i)
2 is
removed, it becomes removable from C
(ii)
2 and we remove it (if it had already been
removed then also its neighbors from I
(ii)
2 had been removed because F
ensures there
are no edges between I2 and OUT2). Proving the inclusions I
(ii)
2 ⊇ I
(iii)
2 , C
(ii)
2 ⊇
C
(iii)
2 is done in a similar way, only that now we have the edges of T which influence
the execution of F
(E∪T ), but do not exist in the execution of F
(E). Again, whenever
a vertex v ∈ C
(ii)
2 is removed, it becomes removable from C
(iii)
2 and we remove it (if it
is was not already moved to OUT2). Let u be a neighbor of v from I
(ii)
2 that is moved
together with v to OUT2. If (u, v) ∈ E then u can not stay in I
(iii)
2 as there are no
edges between I2 and OUT2 in F
. If (u, v) ∈ T then u is either in I
(iii)
2 ∩ C(T ) or
in I
(iii)
2 ∩ I(T ). In both cases u belonged to either I
(iii)
1 ∩ C or I(t) and was already
removed in step 1’(a).
References
1. N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs. SIAM
Journal on Computing, 26(6):1733–1748, 1997.
2. N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph.
Random Structures and Algorithms, 13(3-4):457–466, 1988.
3. H. Chen and A. Frieze. Coloring bipartite hypergraphs. IPCO 1996, 345–358.
4. A. Coja-Oghlan. A spectral heuristic for bisecting random graphs. SODA 2005, 850–859.
5. A. Coja-Oghlan. Finding Large Independent Sets in Polynomial Expected Time. STACS
2003, 511–522.
6. U. Feige. Approximating maximum clique by removing subgraphs. Siam J. on Discrete
Math., 18(2):219–225, 2004.
7. U. Feige and J. Kilian. Heuristics for semirandom graph problems. Journal of Computing
and System Sciences, 63(4):639–671, 2001.
8. U. Feige and R. Krauthgamer. Finding and certifying a large hidden clique in a semirandom
graph. Random Structures and Algorithms, 16(2):195–208, 2000.
9. A. Flaxman. A spectral technique for random satisfiable 3cnf formulas. SODA 2003, 357–
363.
10. A. Goerdt and A. Lanka. On the hardness and easiness of random 4-sat formulas. ISAAC
2004, 470–483.
11. G. Grimmet and C. McDiarmid. On colouring random graphs. Math. Proc. Cam. Phil. Soc.,
77:313–324, 1975.
12. J. Håstad. Clique is hard to approximate within n1−
. Acta Mathematica, 182(1):105–142,
1999.
13. M. Jerrum. Large clique elude the metropolis process. Random Structures and Algorithms,
3(4):347–359, 1992.
14. R. M. Karp. The probabilistic analysis of some combinatorial search algorithms. In J. F.
Traub, editor, Algorithms and Complexity: New Directions and Recent Results, pages 1–19.
Academic Press, New York, 1976.
15. L. Kučera. Expected complexity of graph partitioning problems. Discrete Appl. Math.,
57(2-3):193–212, 1995.

On the Error Parameter of Dispersers
Ronen Gradwohl1
, Guy Kindler2
, Omer Reingold1,
, and Amnon Ta-Shma3
1
Department of Computer Science and Applied Math, Weizmann Institute of Science
{ronen.gradwohl,omer.reingold}@weizmann.ac.il
2
Institute for Advanced Study, Princeton
gkindler@ias.edu
3
School of Computer Science, Tel-Aviv University
amnon@tau.ac.il
Abstract. Optimal dispersers have better dependence on the error than optimal
extractors. In this paper we give explicit disperser constructions that beat the best
possible extractors in some parameters. Our constructions are not strong, but we
show that having such explicit strong constructions implies a solution to the Ram-
sey graph construction problem.
1 Introduction
Extractors [8, 18] and dispersers [13] are combinatorial structures with many random-
like properties1
. Extractors are functions that take two inputs – a string that is not uni-
formly distributed, but has some randomness, and a shorter string that is completely
random – and output a string that is close to being uniformly distributed. Dispersers
can be seen as a weakening of extractors: they take the same input, but output a string
whose distribution is only guaranteed to have a large support. Both objects have found
many applications, including simulation with weak sources, deterministic amplifica-
tion, construction of depth-two super-concentrators, hardness of approximating clique,
and much more [7]. For nearly all applications, explicit constructions are required.
Extractors and dispersers have several parameters: the longer string is called the
input string, and its length is denoted by n, whereas the shorter random string is called
the seed, and its length is denoted by d. Additional parameters are the the output length
m, the amount of required entropy in the source k and the error parameter . The two
parameters that are of central interest for this paper are the seed length d and the entropy
loss Λ = k + d − m, a measure of how much randomness is lost by an application of
the disperser/extractor.
The best extractor construction (which is known to exist by nonconstructive proba-
bilistic arguments) has seed length d = log(n − k) + 2 log 1
+ Θ(1) and entropy loss
Λ = log 1
+ Θ(1), and this was shown to be tight by [10]. However, the situation for
optimal dispersers is different. It can be shown non-explicitly that there exist dispersers
with seed length d = log(n−k)+log 1
+Θ(1) and entropy loss Λ = log log 1
+Θ(1),
and, again, this was shown to be tight by [10]. In particular we see that optimal dis-
persers can have shorter seed length and significantly smaller entropy loss than the best
possible extractors. Up to this work, however, all such explicit constructions also incur
a significant cost in other parameters.

Research supported by US-Israel Binational Science Foundation Grant 2002246.
1
For formal definitions, see Sect. 2.
c

On the Error Parameter of Dispersers 295
1.1 Our Results
Ta-Shma and Zuckerman [17], and Reingold, Vadhan, and Wigderson [12] gave a way
to construct dispersers with small dependence on via an error reduction technique2
.
Their approach requires a disperser for high min-entropy sources, and therefore our
first result is a good disperser for such sources. Next, we show a connection between
dispersers with optimal dependence on and constructions of bipartite Ramsey graphs3
.
This connection holds for dispersers that work well for low min-entropies, and so our
second construction focuses on this range.
A Good Disperser for High Entropies. Our goal is to construct a disperser for very
high min-entropies (k = n−1) but with very small error , as well as the correct entropy
loss of log log 1
and seed length of log 1
.
One high min-entropy disperser associates the input string with a vertex of an ex-
pander graph (see Definition 2.8), and the output with the vertex reached after taking
one step on the graph [4]. This disperser has the optimal d = log 1
, but its entropy loss
Λ = d = log 1
is exponentially larger than the required log log 1
.
A better disperser construction, in terms of the entropy loss, was given by Reingold,
Vadhan, and Wigderson [12], who constructed high min-entropy dispersers in which
the degree and entropy loss are nearly optimal. Their constructions are based on the ex-
tractors obtained by the Zig-Zag graph product, but the evaluation time includes factors
of poly(1
) or even 21/
, so they are inefficient for super-polynomially small error. The
reason these constructions incur this cost is that they view their extractors as bipartite
graphs, and then define their dispersers as the same graph, but with the roles of the left-
hand-side and right-hand-side reversed. This inversion of an extractor is nontrivial, and
thus requires much computation.
Our constructions are a new twist on the old theme of using random walks on ex-
pander graphs to reduce the error. Formally, we get,
Theorem 1.1. For every = (n) 0, there exists an efficiently constructible (n −
1, )-disperser DSP : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
with d = (2 + o(1)) log 1
and
entropy loss Λ = (1 + o(1)) log log 1
.
As a corollary of one of the lemmas used in this construction, Lemma 3.1, we also
get the following construction:
Corollary 1.1. There exists an efficiently constructible (n − 1, )-disperser DSP :
{0, 1}n
×{0, 1}d
*→ {0, 1}m
with d = O(log 1
) and entropy loss Λ = log log 1
+O(1).
Error Reduction for Dispersers. An error reduction takes a disperser with, say, con-
stant error, and converts it to a disperser with the desired (small) error . One way for
achieving error reduction for dispersers was suggested by Ta-Shma and Zuckerman
[17], and Reingold, Vadhan, and Wigderson [12], and is obtained by first applying the
2
See Sect. 1.1 for details.
3
For a formal definition, see Sect. 4.

296 Ronen Gradwohl et al.
constant error disperser, and then applying a disperser with only error that works well
for sources of very high min-entropy (such as k = n − 1). Using the disperser of The-
orem 1.1 we get:
Theorem 1.2 (error reduction for dispersers). Suppose there exists an efficiently con-
structible (k, 1
2 )-disperser DSP1 : {0, 1}n
× {0, 1}d1
*→ {0, 1}m1
with entropy loss
Λ1. Then for every = (n) 0 there exists an efficiently constructible (k, )-disperser
DSP : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
with entropy loss Λ = Λ1 + (1 + o(1)) log log 1

and d = d1 + (2 + o(1)) log 1
.
If we take the best explicit disperser construction for constant error (stated in The-
orem 2.2), and apply an error reduction based on Corrolary 1.1, we get:
Corollary 1.2. There exists an efficiently constructible (k, )-disperser DSP : {0, 1}n
×
{0, 1}
d
*→ {0, 1}
m
with seed length d = O(log n
) and entropy loss Λ = O(log n) +
log log 1
.
A Disperser for Low Entropies. Next, we construct a disperser for low min-entropies.
The seed length of this disperser is d = O(k) + (1 + o(1)) log 1
and the entropy loss is
Λ = O(k + log log 1
). Formally,
Theorem 1.3. There exists an efficiently constructible (k, )-disperser DSP : {0, 1}
n
×
{0, 1}
d
*→ {0, 1}
m
with d = O(k)+(1+o(1)) log 1
and Λ = O(k+log log 1
) entropy
loss.
Notice that when is small (say 2−n2/3
) and k is small (say O(log(n))) the entropy
loss is optimal up to constant factors, and the seed length is optimal with respect to . At
first glance this may seem like only a small improvementover the previous construction,
which was obtained using the error reduction. However, we now shortly argue that the
reduction of the seed length from O(log(n)) + 2 log 1
to O(log(n)) + 1 · log 1
is
significant.
First we note that for some applications, such as the Ramsey graph construction
we discuss below, the constant coefficient is a crucial parameter. A seed length of
O(log(n))+2 log 1
does not yield any improvement over known constructions of Ram-
sey graphs, whereas a seed length of O(log(n)) + 1 · log 1
implies a vast improvement.
Second, we argue that in some sense the improvement in the seed length from
2 log 1
to 1·log 1
is equivalent to the exponential improvement of log 1
to O(log log 1
)
in the entropy loss. Say DSP : {0, 1}
n1
× {0, 1}
d1
→ {0, 1}
n2
is a (k1, 1) dis-
perser. We can view DSP as a bipartite graph with N1 = 2n1
vertices on the left,
N2 = 2n2
vertices on the right, with an edge (v1, v2) in the bipartite graph if and only
if DSP(v1, y) = v2 for some y ∈ {0, 1}
d1
. The disperser property then translates to
the property that any set A1 ⊆ {0, 1}
n1
of size K1 = 2k1
, and any set A2 ⊆ {0, 1}
n2
of size 1N2 have an edge between them. This property is symmetric, and so every dis-
perser DSP from [N1] to [N2] can equivalently be thought of as a disperser from [N2]
to [N1], which we call the inverse disperser.
It then turns out that for certain settings of the paraments, if a disperser has seed
length dependence of 1 · log 1
on the error, then the inverse disperser has entropy loss

O(log log 1
)4
. However, as noted earlier, inverting a disperser is often computationally
difficult, which is the reason we require explicit constructions of dispersers in both
directions.
The Connection to Ramsey Graphs. Our final result is a connection between dis-
persers and bipartite Ramsey graphs. Ramsey graphs have the property that for every
subset of vertices on the left-hand-side and every subset of vertices on the right-hand-
side, there exists both an edge and a non-edge between the subsets. We show that bipar-
tite Ramsey graphs can be attained by constructing a disperser with O(log log 1
) en-
tropy loss and 1 · log 1
seed length, as well as the additional property of being strong5
.
This connection is formalized in Theorem 4.1.
2 Preliminaries
2.1 Dispersers and Strong Dispersers
Dispersers are formally defined as follows:
Definition 2.1 (disperser). A bipartite graph G = (L, R, E) is a (k, )-disperser if for
every S ⊂ L with |S| ≥ 2k
, |Γ(S)| ≥ (1 − )|R|, where Γ(S) denotes the set of
neighbors of the vertices in S.
As noted earlier, we will denote |L| = 2n
and |R| = M = 2m
. Also, the left-degree
of the disperser will be denoted by D = 2d
.
We can describe dispersers as functions DSP : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
. DSP
is a disperser if, when choosing x uniformly at random from S and r uniformly at
random from {0, 1}
d
, the distribution of DSP(x, r) has support of size (1 − )M.
We will also be interested in strong dispersers. Loosely speaking, this means that
they have a large support for most seeds. Equivalently, if we were to concatenate the
seed to the output of the disperser, then this extended output would have a large support.
Here we actually give a slightly more general definition, to include the case of almost-
strong dispersers, which have a large support when “most” of the seed is concatenated
to the output.
Definition 2.2 (strong disperser). Denote by x[0..t−1] the first t bits of a string x. A
(k, )-disperser DSP : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
is strong in t bits if for every
S ⊆ L with |S| ≥ 2k
, |{(DSP(x, r), r[0..t−1]) : x ∈ S, r ∈ {0, 1}
d
}| ≥ (1 − )2m+t
.
DSP is a strong disperser if it is strong in all d bits.
We also need the following proposition, which is implicit in [12].
4
We remark that the entropy loss lower bound in [10] is obtained by proving a seed length lower
bound, and then using this equivalence.
5
See Definition 2.2.

Proposition 2.1 ([12]). Let
– DSP1 : {0, 1}
n1
× {0, 1}
d1
*→ {0, 1}
m1
be a (k, 1)-disperser with entropy loss
Λ1, and let
– DSP2 : {0, 1}
m1
× {0, 1}
d2
*→ {0, 1}
m2
be an (m1 − log 1
1−1
, 2)-disperser with
entropy loss Λ2.
Then DSP : {0, 1}
n1
× {0, 1}
d1+d2
*→ {0, 1}
m2
, where
DSP(x, r1, r2) = DSP2(DSP1(x, r1), r2) ,
is a (k, 2)-disperser with entropy loss Λ = Λ1 + Λ2.
2.2 Extractors
To formally describe extractors, we first need a couple of other definitions:
Definition 2.3 (statistical difference). For two distributions X and Y over some finite
domain, denote the statistical difference between them by Δ(X, Y ), where:
Δ(X, Y ) =
1
2
i∈supp(X∪Y )
|Pr[X = i] − Pr[Y = i]| .
X and Y are ε-close if Δ(X, Y ) ≤ ε.
We also need a measure of the randomness of a distribution.
Definition 2.4 (min-entropy). The min-entropy of a distribution X, denoted by
H∞(X), is defined as
H∞(X) = min
i∈supp(X)
log
1
Pr[X = i]
.
Definition 2.5 (extractor). EXT : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
is a (k, )-extractor
if, for any distribution X with H∞(X) ≥ k, when choosing x according to X and r
uniformly at random from {0, 1}
d
, the distribution of EXT(x, r) is -close to uniform.
As with dispersers, we also consider strong extractors:
Definition 2.6 (strong extractor). EXT : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
is a (k, )-
extractor if, for any distribution X with H∞(X) ≥ k, when choosing x according to X
and r uniformly at random from {0, 1}
d
, the distribution of (EXT(x, r), r) is -close to
uniform.
Finally, for both dispersers and extractors, and S ⊆ (n) and T ⊆ (d), denote by
EXT(S, T ) = {EXT(s, t) : s ∈ S, t ∈ T } and DSP(S, T ) = {DSP(s, t) : s ∈ S, t ∈
T }.

2.3 Previous Explicit Constructions
In this paper we are interested in explicit constructions, which we now define formally.
Definition 2.7 (explicit family). A family of functions fn : {0, 1}
n
× {0, 1}
dn
*→
{0, 1}
mn
is an explicit family of extractors/dispersers if for every n, fn is an extrac-
tor/disperser, and if there exists an algorithm A such that, given x ∈ {0, 1}
n
and
r ∈ {0, 1}
dn
, A computes fn(x, r) in time polynomial in n.
We will refer to extractors and dispersers as explicit when we mean that there exist
explicit families of such functions.
In our construction, we will use the following extractor of Srinivasan and Zucker-
man [14] and disperser of Ta-Shma, Umans and Zuckerman [16]:
Theorem 2.1 ([14]). There exists an efficiently constructible strong (k, )-extractor
EXTSZ : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
with d = O(m + log n
) and Λ = 2 log 1
+
O(1).
Theorem 2.2 ([16]). For every k n and any constant 0, there exists an efficiently
constructible disperser DSPT UZ : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
with d = O(log n)
and m = k − 3 log n − O(1).
2.4 Expander Graphs
A main tool we use in our construction is an expander graph.
Definition 2.8 (expander). Let G = (V, E) be a regular graph with normalized ad-
jacency matrix A (the adjacency matrix divided by the degree), and denote by λ the
second largest eigenvalue (in absolute value) of A. Then G is an (N, D, α)-expander if
G is D-regular, |V | = N, and λ ≤ α.
An expander G is explicit if there exists an algorithm that, given any vertex v ∈ V
and index i ∈ {0, 1, . . ., D −1}, computes the i’th neighbor of v in time poly(log |V |).
Intuitively, the smaller the value of α, the better the expander, implying better parame-
ters for our construction. Thus, we wish to use graphs with the optimal α ≤ 2
√
D
, called
Ramanujan Graphs.
A minor technicality involving Ramanujan Graphs is that there are known explicit
constructions only for certain values of N and D. However, for any given N and D,
it is possible to find a description of the Ramanujan Graphs of Lubotzky, Phillips and
Sarnak [6] that are sufficient for our needs. These graphs satisfy all the requirements,
with |V | ∈ [N, (1 + δ) · N] and degree D, and can be found in time poly(n, 1
δ ) [1, 4].
Since the Ramanujan Graphs of [6] are Cayley graphs, they also have the useful
property of being consistently labelled: in any graph G = (V, E), the label of an edge
(u, v) ∈ E is i if following edge i leaving u leads to v. The graph is consistently
labelled if for all vertices u, v, w ∈ V , if (u, v) ∈ E with label i and (w, v) ∈ E with
label j, then i = j. Loosely speaking, if we take two walks on a consistently labelled
graph following the same list of labels but starting at different vertices, then the walks
will not converge into one walk.

The reason expander graphs will be useful is Kahale’s Expander Path Lemma [5]:
Lemma 2.1 ([5]). Let G be an (N, D, α)-expander, and let B ⊂ V with density β =
|B|
|V | . Choose X0 ∈ V uniformly at random, and let X0, . . . , Xt be a random walk on G
starting at X0. Then
Pr[∀i, Xi ∈ B] ≤ (β + α)
t
.
3 A High Min-Entropy Disperser
In this section we prove Theorem 1.1. Our technique is motivated by the construction
of Cohen and Wigderson [3]. In the context of deterministic amplification, Cohen and
Wigderson [3] used a random walk on an expander graph to reduce the error from poly-
nomially small to exponentially small. Translating their work into the current notion of
a disperser, their construction has the following parameters: d = (2 + α) log(1/) and
Λ = log log(1/), for some constant α. Cohen and Wigderson used this disperser as a
sampling procedure, and they were only interested in the case k ≥ n − 1
poly(n) . In fact,
their construction does not attain the above parameters for smaller entropy thresholds,
such as k = n − 1.
3.1 The Basic Construction
Lemma 3.1. For all positive w, t, and δ there exists an (n − c, )-disperser DSP :
{0, 1}
n
× {0, 1}
d
*→ {0, 1}
m
computable in time poly(n, 1
δ ) with:
– ≤

1 − 1
(1+δ)2c + 2
2w/2
t
– d = wt + log t
– Λ ≤ log t
Proof. Consider a Ramanujan Graph G = (V, E) with |V | ∈ [N, (1 + δ)N] and de-
gree 2w
, which can be constructed in time poly(1
δ ) (see Sect. 2.4). Associate with an
arbitrary set of N vertices of G the strings {0, 1}
n
, which in turn correspond to the
vertices on the left side of the disperser. Again, for every string s ∈ {0, 1}
d
, we think
of s = r ◦ i, where r ∈ {0, 1}wt
and i ∈ {0, 1}log t
. The disperser is defined as
DSP(x, r ◦ i) = (y, r), where y ∈ V is the vertex of G reached after taking a walk
consisting of the last i steps encoded by r, starting at the vertex associated with x. Now
define S ⊂ {0, 1}
n
, an arbitrary set of starting vertices, such that |S| ≥ 2n−c
.
For a pair (y, r), denote by ←
−
ry the walk r ending at y, backwards. By backwards we
mean that if the i’th step of the walk is from v to u, then the (t − i)’th step of ←
−
ry has
the label of the edge from u to v. If (y, r) is bad (i.e. for no x ∈ S and i ∈ {0, 1}
log t
,
DSP(x, r ◦ i) = (y, r)), then the walk ←
−
ry starting at vertex y in G never hits a vertex
x ∈ S. This implies that in all t steps, the walk stays in S. Note that because the graph
is consistently labelled, there is a bijection from pairs (y, r) to pairs (y, ←
−
ry ). Thus, to
bound , it suffices to bound the probability that a random walk in G stays in S. We use
the Expander Path Lemma (Lemma 2.1) to bound this probability by
≤

|S|
|V |
+ λ
t
,

where λ is the second largest eigenvalue (in absolute value) of G. Since |S| ≥ N
2c ,
|S|
|V |
≤
|V | − N
2c
|V |
≤
(1 + δ)N − N
2c
(1 + δ)N
= 1 −
1
(1 + δ)2c
.
Thus,
≤

1 −
1
(1 + δ)2c
+ λ
t
≤

1 −
1
(1 + δ)2c
+
2
2w/2
t
for a Ramanujan Graph. It is clear that d = wt + log t. Since the length of the output is
at least n + wt, the entropy loss Λ ≤ log t.

3.2 Parameters
By using Lemma 3.1 with w = 6, t = 5
2 log 1
, and δ = 1
poly(n) we get Corollary 1.1 of
Section 1.1.
For our high min-entropy disperser, we wish to reduce the coefficient in the seed
length. Thus, we will compose two different dispersers guaranteed by Lemma 3.1. The
effect of composing two dispersers is captured by Proposition 2.1.
The first disperser will have entropy requirement n − 1, and error 1 = 2
2w/2 . This
is chosen in order to satisfy 1 = λ, the second eigenvalue of the expanders used in
the disperser constructions, and thus simplify the analysis. The second disperser will
take a subset of {0, 1}n2
with error 1 = 2
2w/2 (or, equivalently, entropy requirement
n2 − log 1
1−1
), and have error , as desired. Both dispersers will be as guaranteed by
Lemma 3.1, with the same w. This means that both expander graphs will have the same
degree, and thus the same bound on λ, but their sizes will be different. Note that for the
first expander, we do not care too much about the size of |V |, and so we choose δ to be
some constant, to yield a graph with |V | ∈ [N, 3
2 N]. For the second expander, however,
this size is relevant, so we must choose a smaller δ.
We now prove Theorem 1.1.
Proof. Our disperser DSP will consist of the composition of two other dispersers, an
(m1 − 1, 1)-disperser DSP1 : {0, 1}
n
× {0, 1}
d1
*→ {0, 1}
m1
and a (n − log 1
1−1
, )-
disperser DSP2 : {0, 1}m1
× {0, 1}d2
*→ {0, 1}m
. These dispersers will be as guaran-
teed by Lemma 3.1, with parameters t1, w1, δ1 = 1
2 and t2, w2, δ2 respectively. Recall
that ti indicates the length of the walk, wi is the number of bits needed to specify a step
in the expander graph, and δi is related to the size of the expander graph. We will have
DSP(x, r1 ◦ r2 ◦ i1 ◦ i2) = DSP2(DSP1(x, r1 ◦ i1), r2 ◦ i2) .
Let w = w1 = w2, a parameter we will choose later. For DSP1, we wish its error
to be 1 ≤ 2
2w/2 . By Lemma 3.1,

1 − 2
3 · 1
2 + 2
2w/2
t1
=
2
3 + 2
2w/2
t1
≤ 2
2w/2 . If
λ 1
6 (which it will be), then it suffices to satisfy (3
4 )t1
≤ 2
2w/2 . This occurs whenever
t1 ≥ 3
2 w, so set t1 = 3
2 w. Thus, DSP1 will output m1 bits, with error 1 ≤ 2
2w/2 .
Furthermore, the required seed length d1 = t1w = 3
2 w2
.

We now apply DSP2 on these strings, with δ2 = 1
1−1
− 1. The construction time
is poly( 1
δ2
) = 2O(w)
. By Lemma 3.1, the error is
≤

1 −
1 − 1
1 + δ2
+ λ
t2
(21 + λ)t2
=

6
2w/2
t2
.
This implies that
t2 =
log
log 6
2w/2
= log
1

1
1
log 2w/2
6
2
log
1

2
w − 6

.
The required seed length is
d2 = t2w =
2w
w − 6
log
1

= (2 +
12
w − 6
) log
1

.
We now analyze the parameters we obtained in the composition of DSP1 and DSP2.
By Proposition 2.1, the total seed length of DSP is d = (2 + 12
w−6 ) log 1
+ 3
2 w2
. If we
pick w = ω(1), but also w = o(log log 1
), then d = (2 + o(1)) log 1
, as claimed. The
entropy loss Λ ≤ log[(2 + 12
w−6 ) log 1
] + log 3
2 w2
, and with the above choice of w, we
get Λ = (1 + o(1)) log log 1
.

3.3 Error Reduction for Dispersers
The error reduction for dispersers of Theorem 1.2 is attained by composing DSP1 with
our high min-entropy disperser, as in Proposition 2.1.
4 Dispersers and Bipartite Ramsey Graphs
Our construction of low min-entropy dispersers was motivated by the connection of
such dispersers to bipartite Ramsey graphs.
Deﬁnition 4.1. A bipartite graph G = (L, R, E) with |L| = |R| = N is (s, t)-Ramsey
if for every S ⊆ L with |S| ≥ s and every T ⊆ R with |T | ≥ t, the vertices of S and
the vertices of T have at least one edge and at least one non-edge between them in G.
By the probabilistic method, it can be shown that (2 log N, 2 log N)-Ramsey graphs
exist. However, the best known explicit constructions are of (Nδ
, Nδ
)-Ramsey graphs,
for any constant δ 0 [2]. We now argue that constructions of low min-entropy dis-
persers may provide new ways of constructing bipartite Ramsey graphs.
Suppose we have some strong (k, )-disperser DSP that outputs 1 bit, and has seed
length d = sn + log 1
. Consider the bipartite graph G = (L, R, E), such that there is
an edge between x ∈ L and y ∈ R if and only if DSP(x, y) = 1. Then for every S ⊂ L

with |S| ≥ 2k
, and every T ⊂ R with |T | 2|R|, there must be an edge and a non-
edge between S and T . To see this, suppose there was no non-edge. Then this implies
that DSP(S, t) = 1 ◦ t, for all t ∈ T , and for no t ∈ T does DSP(S, t) = 0 ◦ t. This
means that the disperser misses more than an -fraction of the outputs, a contradiction.
How good of a Ramsey graph does this yield? If we set k = O(log n) and =
2−n−sn
, then we get a 2n
× 2n
graph that is (poly(n), 2sn
+ 1)-Ramsey. If sn =
O(log n), then this would yield a (poly(n), poly(n))-Ramsey graph, which is signifi-
cantly better than any current construction, even for the non-bipartite case.
Theorem 4.1 is a generalization of the above:
Theorem 4.1. Let DSP : {0, 1}
n
× {0, 1}
d
*→ {0, 1}
1
be a strong (k, )-disperser
with d = sn + t log 1
, where sn is only a function of n. Let = 2− n−sn
t , implying
that d = n. Define the bipartite graph G = (L, R, E) with |L| = |R| = 2n
, such that
there is an edge between x ∈ L and y ∈ R if and only if DSP(x, y) = 1. Then G is
(2k
, 2
t−1
t n+ sn
t +1
+ 1)-Ramsey.
Note that the coefficient t plays a crucial role in the quality of the Ramsey graph. If
t = 1, then it is possible to get extremely good Ramsey graphs. On the other hand, if
the seed length is d 2 log 1
, then the graph is worse than (2k
,
√
2n)-Ramsey.
5 A Low Min-Entropy Disperser
In this section, we construct a disperser with seed length d = O(log n)+(1+o(1)) log 1

that is almost strong.
Lemma 5.1. There exists an efficiently constructible (k, )-disperser DSP : {0, 1}
n
×
{0, 1}
d
*→ {0, 1}
m
with d = O(k) + (1 + o(1)) log 1
that is strong in d − O(log k +
log log 1
) bits. If the strong seed bits are concatenated to the output of the disperser,
then the entropy loss of DSP is Λ = O(k + log log 1
).
Consider the following intuition. Given some source with min-entropy O(log n), we
first apply an extractor with m output bits (and think of m as a constant) and 1
2 · 2−m
error. The error is small enough relative to the output length to guarantee that half of
the seeds are “good” in the sense that all possible output strings are hit, and so the error
for them is 0. The error over the seeds, however, is large, because only half of the seeds
are good. Our next step is then to use a disperser with error half, to sample a good seed.
The main advantage of this approach is that we obtain a disperser with a very low error
, by constructing only objects with relatively high error.
We now formalize the above discussion, and prove Theorem 1.3.
Proof. We wish to construct a (k, )-disperser DSP : {0, 1}n
× {0, 1}d
*→ {0, 1}m
.
We use the following two ingredients:
– EXT : {0, 1}
n
× {0, 1}
dE
*→ {0, 1}
m
that is a (k, 2−m−1
)-extractor, and,
– DSP
: {0, 1}k
+log 1

× {0, 1}d
*→ {0, 1}dE
that is a (k
, 1
2 )-disperser, with d =
k
+ log 1
+ d
.

We will specify the parameters later. Now, Given x ∈ {0, 1}
n
, r1 ∈ {0, 1}
k
+log 1

,
and r2 ∈ {0, 1}
d
, define
DSP(x, r1 ◦ r2) = EXT(x, DSP
(r1, r2)) .
Fix S ⊆ {0, 1}
n
with |S| ≥ 2k
. We say a seed s is “good” for S if |EXT(S, s)| =
2m
, and “bad” otherwise. As EXT is a strong extractor, at least half of all seeds s ∈
{0, 1}
dE
are good for S.
Claim. For all S as above, and all but an -fraction of the ri ∈ {0, 1}k
+log 1

,
DSP
(ri, {0, 1}d
) hits a seed s that is good for S.
Proof. Suppose towards a contradiction that more than an -fraction of the ri do not
hit any s that is good for S. This implies that for at least · 2k
+log 1
= 2k
x’s,
DSP
(x, {0, 1}
d
) misses more than half of the outputs. Consider the set T of all such
x’s. Then |DSP
(T, {0, 1}
d
)| 1
2 · 2dE
, contradicting the fact that DSP
is a (k
, 1
2 )-
disperser.

Recall our disperser construction DSP(x, r1 ◦ r2) = EXT(x, DSP
(r1, r2)), and
denote by R the length of r1. The claim shows that for all but an -fraction of the r1’s,
DSP
(r1, r2)) hits an s that is good for S, and EXT(S, s) hits all of {0, 1}
m
. Thus, for
all but an -fraction of the r1’s, EXT(S, DSP
(r1, r2)) hits all of {0, 1}
m
. This implies
that DSP is a (k, )-disperser that is strong in R bits.
To complete the proof, let EXT = EXTSZ from Theorem 2.1, and let DSP
=
DSPT UZ from Theorem 2.2, with k
= O(k) and m = O(k). Then d
= O(log(k +
log 1
)) ≤ O(log k + log log 1
), and d = R + d
= O(k) + (1 + o(1)) log 1
. Thus,
DSP is strong in R = d − d
bits, as claimed. Finally, the entropy loss is O(k) from
the application of EXTSZ, and an additional O(log k + log log 1
) from the seed of
DSPT UZ . Thus, the total entropy loss is Λ = O(k + log log 1
).

Plugging in k = O(log n) we get:
Corollary 5.1. There exists an efficiently constructible (O(log n), )-disperser DSP :
(n)×(d) *→ (m) with d = O(log n)+(1+o(1)) log 1
that is strong in d−O(log log n+
log log 1
) bits. The entropy loss of DSP is Λ = O(log n + log log 1
).
6 Discussion
In this work we addressed one aspect of sub-optimality in dispersers, the dependence
on , and constructed min-entropy dispersers in which the entropy loss is optimally
dependent on .
Another aspect has to do with the dependence of the entropy loss on n. Say DSP :
{0, 1}
n
×{0, 1}
d
→ {0, 1}
m
is a (k, ) disperser with small error and close to optimal
entropy-loss Λ. For any source X ⊆ {0, 1}n
of size 2k
, DSP is close to being one-to-
one on X (to be more precise, every image is expected to have 2Λ
pre-images, and
the disperser property guarantees we are not far from that). In particular every such
disperser is also an extractor with some worse error, and if we fix to be some constant,
this disperser would actually be an extractor with constant error.

It follows that if we could improve Corollary 1.2 to have entropy loss independent
of n, we would get an extractor with constant error, optimal seed length and constant
entropy loss! No current extractor construction can attain such entropy loss, regardless
of the error, without incurring a large increase in seed length (generally, d = log2
n is
necessary). So such dispersers may provide a new means of obtaining better extractors.
A different issue is the distinction between strong and non-strong dispersers. It
seems that a possible conclusion from this work is that the property of being strong
becomes very signiﬁcant when dealing with objects that are close to being optimal.
Acknowledgements
We would like to thank Ronen Shaltiel for very useful discussions.
References
1. N. Alon, J. Bruck, J. Naor, M. Naor and R. Roth. Construction of asymptotically good, low-
rate error-correcting codes through pseudo-random graphs. In Transactions on Information
Theory, 38:509-516, 1992. IEEE.
2. Boaz Barak, Guy Kindler, Ronen Shaltiel, Benny Sudakov, and Avi Wigderson. Simulating
independence: New constructions of condensers, Ramsey graphs, dispersers, and extractors.
In 37th STOC, 2005.
3. Aviad Cohen and Avi Wigderson. Dispersers, deterministic ampliﬁcation, and weak random
sources. In Proc. of 30th FOCS, pages 14-19, 1989.
4. Oded Goldreich, Avi Wigderson. Tiny families of functions with random properties: a
quality-size trade-off for hashing. Random Structures and Algorithms 11:4, 1997.
5. Nabil Kahale. Better expansion for Ramanujan graphs. In 33rd FOCS, 1992, pages 398-404.
6. A. Lubotzky, R. Phillips, and P. Sarnak. Explicit expanders and the Ramanujan conjectures.
In 18th STOC, 1986, pages 240-246.
7. Noam Nisan and Amnon Ta-Shma. Extracting randomness: A survey and new constructions.
Journal of Computer and System Sciences, 58(1):148-173, 1999.
8. Noam Nisan and David Zuckerman. More deterministic simulation in logspace. In 25th
STOC, 1993, pages 235-244.
9. Ran Raz. Extractors with weak random seeds. In 37th STOC, 2005.
10. Jaikumar Radhakrishnan and Amnon Ta-Shma. Tight bounds for depth-two superconcentra-
tors. In 38th FOCS, pages 585-594, Miami Beach, Florida, 20-22 October 1997. IEEE.
11. Ran Raz, Omer Reingold, and Salil Vadhan. Error reduction for extractors. In 40th FOCS,
1999. IEEE.
12. Omer Reingold, Salil Vadhan, and Avi Wigderson. Entropy waves, the zig-zag graph product,
and new constant-degree expanders and extractors. In Proc. 41st FOCS, pages 3-13, 2000.
13. Michael Sipser. Expanders, randomness, or time versus space. Journal of Computer and
System Sciences, 36(3):379-383, June 1988.
14. Aravind Srinivasan and David Zuckerman. Computing with very weak random sources. In
35th FOCS, 1994, pages 264-275.
15. Amnon Ta-Shma. Almost optimal dispersers. In 30th STOC, pages 196-202, Dallas, TX,
May 1998. ACM.
16. Amnon Ta-Shma, Christopher Umans, and David Zuckerman. Loss-less condensers, unbal-
anced expanders, and extractors. In Proc. 33rd STOC, pages 143-152, 2001.
17. Amnon Ta-Shma and David Zuckerman. Personal communication.
18. David Zuckerman. General weak random sources. In 31st FOCS, 1990.

Tolerant Locally Testable Codes
Venkatesan Guruswami
and Atri Rudra
Department of Computer Science Engineering
University of Washington
Seattle, WA 98195
{venkat,atri}@cs.washington.edu
Abstract. An error-correcting code is said to be locally testable if it has
an efficient spot-checking procedure that can distinguish codewords from
strings that are far from every codeword, looking at very few locations
of the input in doing so. Locally testable codes (LTCs) have generated
a lot of interest over the years, in large part due to their connection to
Probabilistically checkable proofs (PCPs). The ability to correct errors
that occur during transmission is one of the big advantages of using a
code. Hence, from a coding-theoretic angle, local testing is potentially
more useful if in addition to accepting codewords, it also accepts strings
that are close to a codeword (in contrast, local testers can have arbi-
trary behavior on such strings, which potentially annuls the benefits of
error-correction). This would imply that when the tester accepts, one can
follow-up the testing with a (more expensive) decoding procedure to cor-
rect the errors and recover the transmitted codeword, while if the tester
rejects, we can save the effort of running the more expensive decoding
algorithm.
In this work, we define such testers, which we call tolerant testers fol-
lowing some recent work in property testing [13]. We revisit some recent
constructions of LTCs and show how one can make them locally testable
in a tolerant sense. While we do not optimize the parameters, the main
message from our work is that there are explicit tolerant LTCs with
similar parameters to LTCs.
1 Introduction
Locally testable codes (LTCs) have been the subject of much research over the
years and there has been heightened activity and progress on them recently [4–
6, 10–12]. LTCs are error-correcting codes which have a testing procedure with
the following property– given oracle access to a string which is a purported
codeword, these testers “spot check” the string at very few locations, accepting if
the string is indeed a codeword, and rejecting with high probability if the string is
“far-enough” from every codeword. Such spot-checkers arise in the construction
of Probabilistically checkable proofs (PCPs) [1, 2] (see the recent survey by
Goldreich [10] for more details on the interplay between LTCs and PCPs). Note


c

Tolerant Locally Testable Codes 307
that in the definition of LTCs, there is no requirement on the tester for input
strings that are very close to a codeword. This “asymmetry” in the way the tester
accepts and rejects an input reflects the way PCPs are defined, where we only
care about accepting perfectly correct proofs with high probability. However,
the crux of error-correcting codes is to tolerate and correct a few errors that
could occur during transmission of the codeword (and not just be able to detect
errors). In this context, the fact that a tester can reject received words with
few errors is not satisfactory. A more desirable (and stronger) requirement in
this scenario would be the following– we would like the tester to make a quick
decision on whether or not the purported codeword is close to any codeword.
If the tester declares that there is probably a close-by codeword, we then use a
decoding algorithm to decode the received word. If on the other hand, we can
say with high confidence that the received word is far away from all codewords
then we do not run our expensive decoding algorithm.
In this work we introduce the concept of tolerant testers. These are codeword
testers which reject (w.h.p) received words far from every codeword (like the
current “standard” testers) and accept (w.h.p) close-by received words (unlike
the current ones which only need to accept codewords). We will refer to codes
that admit a tolerant tester as a tolerant LTCs. In the general context of property
testing, the notion of tolerant testing was introduced by Parnas et al [13] along
with the related notion of distance approximation. Parnas et al also give tolerant
testers for clustering. We feel that codeword-testing is a particularly natural
instance to study tolerant testing. (In fact, if LTCs were defined solely from a
coding-theoretic viewpoint, without their relevance and applications to PCPs
in mind, we feel that it is likely that the original definition itself would have
required tolerant testers.)
For any vectors u, v ∈ Fn
q , the relative distance between u and v, denoted
dist(u, v), equals the fraction of positions where u and v differ. For any sub-
set A ⊂ Fn
q , dist(v, A) = minu∈Adist(u, v). An [n, k, d]q linear code C is a k-
dimensional subspace of Fn
q such that every pair of distinct elements x, y ∈ C
differ in at least d locations, i.e., dist(x, y) ≥ d/n. The ratio k
n is called the rate
of the code and d is the (minimum) distance of the code. We now formally define
a tolerant tester.
Definition 1. For any linear code C over Fq of block length n and distance d,
and 0 ≤ c1 ≤ c2 ≤ 1, a (c1, c2)-tolerant tester T for C with query complexity
p(n) (or simply p when the argument is clear from the context) is a probabilistic
polynomial time oracle Turing machine such that for every vector v ∈ Fn
q :
1. If dist(v, C) ≤ c1d
n , T upon oracle access to v accepts with probability at least
2
3 (tolerance),
2. If dist(v, C) c2d
n , T rejects with probability at least 2
3 (soundness),
3. T makes p(n) probes into the string (oracle) v.
A code is said to be (c1, c2, p)-testable if it admits a (c1, c2)-tolerant tester of
query complexity p(·).

308 Venkatesan Guruswami and Atri Rudra
We will be interested in asymptotics and thus we implicitly are interested
in a family of codes with the stated properties in the above definition (and so,
the notion of the tester being a polynomial time machine, in particular, makes
sense). We usually hide this for notational simplicity.
A tester has perfect completeness if it accepts any codeword with probabil-
ity 1. As pointed out earlier, the existing literature just consider (0, c2)-tolerant
testers with perfect completeness. We will refer to these as standard testers
henceforth. Note that our definition of tolerant testers is per se not a general-
ization of standard testers since we do not require perfect completeness for the
case when the input v is a codeword. However, all our constructions will inherit
this property from the standard testers we obtain them from.
Recall one of the applications of tolerant testers mentioned earlier– a tolerant
tester is used to decide if the expensive decoding algorithm should be used. In
this scenario, one would like to set the parameters c1 and c2 such that the tester
is tolerant up to the decoding radius. For example, if we have an unique decoding
algorithm which can correct up to d
2 errors, a particularly appealing setting of
parameters would be c1 = 1
2 and c2 as close to 1
2 as possible. However, we would
not be able to achieve such large c1. In general we will aim for positive constants
c1 and c2 with c2
c1
being as small as possible while minimizing p(n).
One might hope that the existing standard testers could also be tolerant
testers. We give a simple example to illustrate the fact that this is not the case
in general. Consider the tester for the Reed-Solomon (RS) codes of dimension
k+1– pick k+2 points uniformly at random and check if the degree k univariate
polynomial obtained by interpolating on the first k + 1 points agrees with the
input on the last point. It is well known that this is a standard tester [16].
However, this is not a tolerant tester. Assume we have an input which differs
from a degree k polynomial in only one point. Thus, for
n−1
k+1

choices of k + 2
points, the tester would reject, that is, the rejection probability is
(n−1
k+1)
( n
k+2)
= k+2
n
which is greater than 1
3 for high rate RS codes.
Another pointer towards the inherent difficulty in coming up with a tolerant
tester is the recent work of Fischer and Fortnow [8] which shows that there are
certain boolean properties which have a standard tester with constant number
of queries but for which every tolerant tester requires at least nΩ(1)
queries.
In this work, we examine existing standard testers and convert some standard
testers into tolerant ones. In Section 2 we record a few general facts which will
be useful in performing this conversion. The ultimate goal, if this can be realized
at all, would be to construct tolerant LTCs of constant rate which can be tested
using O(1) queries (we remark that such a construction has not been obtained
even without the requirement of tolerance). In this work, we show that we can
achieve either constant number of queries with slightly sub-constant rate (Section
3) as well as constant rate with sub-linear number of queries (Section 4). That
is, something non-trivial is possible in both the domains: (a) constant rate, and
(b) constant number of queries. Specifically, in Section 3 we discuss binary codes
which encode k bits into codewords of length n = k · exp(logε
k) for any ε 0,

and can be tolerant tested using O(1/ε) queries. In Section 4, following [5], we
study the simple construction of LTCs using products of codes – this yields
asymptotically good codes which are tolerant testable using a sub-linear number
nγ
of queries for any desired γ 0. An interesting common feature of the codes
in Section 3 and 4 is that they can be constructed from any code that has good
distance properties and which in particular need not admit a local tester with
sub-linear query complexity. The overall message from our work is that a lot of
the work on locally testable code constructions extends fairly easily to also yield
tolerant locally testable codes. However, there does not seem to be a generic way
to “compile” a standard tester to a tolerant tester for an arbitrary code.
Due to the lack of space most of the proofs are omitted and will appear in
the full version of the paper.
2 General Observations
In this section we will fix some notations and spell out some general properties
of tolerant testers and subsequently use them to design tolerant testers for some
existing codes. We will denote the set {1, · · · , n} by [n]. All the testers we refer
to are non-adaptive testers which decide on the locations to query all at once
based only on the random choices. In the sequel, we use n to denote the block
length and d the distance of the code under consideration. The motivation for
the definition below will be clear in Section 3.
Definition 2. Let 0 α ≤ 1. A tester T is (s1, q1, s2, q2, α)-smooth if there
exists a set A ⊆ [n] where |A| = αn with the following properties:
– T queries at most q1 points in A, and for every x ∈ A, the probability that
each of these queries equals location x is at most s1
|A| , and
– T queries at most q2 points in [n] − A, and for every x ∈ [n] − A, the
probability that each of these queries equals location x is at most s2
n−|A| .
As a special case a (1, q, 0, 0, 1)-smooth tester makes a total of q queries
each of them distributed uniformly among the n possible probe points. The
following lemma follows easily by an application of the union bound.
Lemma 1. For any 0 α 1, a (s1, q1, s2, q2, α)-smooth (0, c2)-tolerant
tester T with perfect completeness is a (c1, c2)-tolerant tester T
, where
c1 = nα(1−α)
3d max{q1s1(1−α), q2s2α} .
The above lemma is not useful for us unless the relative distance and the
number of queries are constants. Next we sketch how to design tolerant testers
from existing robust testers with certain properties. We first recall the definition
of robust testers from [5].
A standard tester T has two inputs– an oracle for the received word v and
a random string s. Depending on s, T generates q query positions i1, · · · , iq,
fixes a circuit Cs and then accepts if Cs(vf (s)) = 1 where vf (s) = vi1 , · · · , viq .
The robustness of T on inputs v and s, denoted by ρT
(v, s), is defined to be the

minimum, over all strings y such that Cs(y) = 1, of dist(vf (s), y). The expected
robustness of T on v is the expected value of ρT
(v, s) over the random choices
of s and would be denoted by ρT
(v).
A standard tester T is said to be c-robust for C if for every v ∈ C, the tester
accepts with probability 1, and for every v ∈ Fn
q , dist(v, C) ≤ c · ρT
(v).
The tolerant version T
of the standard c-robust tester T is obtained by
accepting an oracle v on random input s, if ρT
(v, s) ≤ τ for some threshold τ.
(Throughout the paper τ will denote the threshold.) We will sometimes refer to
such a tester as one with threshold τ. Recall that a standard tester T accepts if
ρT
(v, s) = 0. We next show that T
is sound.
The following lemma follows from a simple averaging argument and the fact
that T is c-robust:
Lemma 2. Let 0 ≤ τ ≤ 1, and let c2 = (τ+2)cn
3d . For any v ∈ Fn
q , if dist(v, C)
c2d
n , then the tolerant tester T
with threshold τ rejects v with probability at least
2
3 .
We next mention a property of the query pattern of T which would make T
tolerant. Let S be the set of all possible choices for the random string s. Further
for each s, let pT
(s) be the set of positions queried by T .
Deﬁnition 3. A tester T has a partitioned query pattern if there exists a par-
tition S1 ∪ · · · ∪ Sm of the random choices of T for some m, such that for every
i,
– ∪s∈Si pT
(s) = {1, 2, · · · , n}, and
– For all s, s
∈ Si, pT
(s) ∩ pT
(s
) = ∅ if s = s
.
Lemma 3. Let T have a partitioned query pattern. For any v ∈ Fn
q , if dist(v, C)
≤ c1d
n , where c1 = nτ
3d , then the tolerant test T
with threshold τ rejects with
probability at most 1
3 .
Proof. Let S1, · · · , Sm be the partition of S, the set of all random choices of the
tester T . For each j, by the properties of Sj,

s∈Sj
ρT
(v, s) ≤ dist(v, C). By an
averaging argument and by the assumption on dist(v, C) and the value of c1, at
least 2
3 fraction of the choices of s in Sj have ρT
(v, s) ≤ τ and thus, T
accepts.
Recalling that S1, · · · , Sm was a partition of S, for at least 2
3 of the choices of s
in S, T
accepts. This completes the proof.
3 Tolerant Testers for Binary Codes
One of the natural goals in the study of tolerant codes is to design explicit
tolerant binary codes with constant relative distance and as large a rate as
possible. In the case of standard testers, Ben-Sasson et al [4] give binary locally
testable codes which map k bits to k · exp(logε
k) bits for any ε 0 and which
are testable with O(1/ε) queries. Their construction uses objects called PCPs of
Proximity (PCPP) which they also introduce in [4]. In this section, we show that

a simple modification to their construction yields tolerant testable binary codes
which map k bits to k · exp(logε
k) bits for any ε 0. We note that a similar
modification is used by Ben-Sasson et al to give a relaxed locally decodable codes
[4] but with worse parameters (specifically they gives codes with block length
k1+ε
).
3.1 PCP of Proximity
We start with the definition1
of of a Probabilistic Checkable proof of Proximity
(PCPP). A pair language is simply a language whose elements are naturally a
pair of strings, i.e., it is some collection of strings (x, y). A notable example is
CIRCUITVAL = {C, a | Boolean circuit C evaluates to 1 on assignment a}.
Definition 4. Fix 0 ≤ δ ≤ 1. A probabilistic verifier V is a PCPP for a pair
language L with proximity parameter δ and query complexity q(·) if the following
conditions hold:
– (Completeness) If (x, y) ∈ L then there exists a proof π such that V accepts
by accessing the oracle y ◦ π with probability 1.
– (Soundness) If y is δ-far from L(x) = {y|(x, y) ∈ L}, then for all proofs π,
V accepts by accessing the oracle y ◦ π with probability strictly less than 1
4 .
– (Query complexity) For any input x and proof π, V makes at most q(|x|)
queries in y ◦ π.
Note that a PCPP differs from a standard PCP in that it has a more relaxed
soundness condition but its queries into part of the input y are also counted in
its query complexity.
Ben-Sasson et. al. give constructions of PCPPs with the following guarantees:
Lemma 4. ([4]) Let ε 0 be arbitrary. There exists a PCP of proximity for
the pair language CIRCUITVAL = {(C, x)|C is a boolean circuit and C(x) = 1}
whose proof length, for input circuits of size s, is at most s · exp(logε/2
s) and
for t = 2 log log s
log log log s the verifier of proximity has query complexity O(max{1
δ , 1
ε })
for any proximity parameter δ that satisfies δ ≥ 1
t . Furthermore, the queries of
the verifier are non-adaptive and each of the queries which lie in the input part
x are uniformly distributed among the locations of x.
The fact that the queries to the input part are uniformly distributed follows
by an examination of the verifier construction in [4]. In fact, in the extended
version of that paper, the authors make this fact explicit and use it in their
construction of relaxed locally decodable codes (LDCs). To achieve a tolerant
LTC using the PCPP, we will need all queries of the verifier to be somewhat
uniformly or smoothly distributed. We will now proceed to make the queries of
the PCPP verifier that fall into the “proof part” π near-uniform. This will follow
1
The definition here is a special case of the general PCPP defined in [4] which would
be sufficient for our purposes.

a fairly general method suggested in [4] to smoothen out the query distribution,
which the authors used to obtain relaxed locally decodable codes from the PCPP.
We will obtain tolerant LTCs instead, and in fact will manage to do so without a
substantial increase in the encoding length (i.e., the encoding length will remain
k·2logε
k
). On the other hand, the best encoding length achieved for relaxed LDCs
in [4] is k1+ε
for constant ε 0. We begin with the definition of a mapping that
helps smoothen out the query distribution.
Definition 5. Given any v ∈ Fn
q and p = pin
i=1 with pi ≥ 0 for all i ∈ [n] and
n
i=1 pi = 1, we define the mapping Repeat(·, ·) as follows: Repeat(v, p) ∈ Fn
q
such that vi is repeated 4npi times in Repeat(v, p) and n
=
n
i=14npi.
The foloowing lemma shows why the mapping is useful. A similar fact appears
in [4].
Lemma 5. For any v ∈ Fn
q let a non-adaptive verifier T (with oracle access
to v) make q(n) queries and let pi be the probability that each of these queries
probes location i ∈ [n]. Let ci = 1
2n + pi
2 and c = cin
i=1. Consider the map
Repeat(v, c) : Fn
q → Fn
q . Then there exists another tester T
for strings of length
n
1. T
makes 2q(n) queries on v
= Repeat(v, c) each of which probes location j,
for any j ∈ [n
], with probability at most 2
n , and
2. for every v ∈ Fn
q , the decision of T
on v
is identical to that of T on v.
Further, 3n n
≤ 4n.
Applying the transformation of Lemma 5 to the proximity verifier and proof
of proximity of Lemma 4, we conclude the following.
Proposition 1. Let ε 0 be arbitrary. There exists a PCP of proximity for
the pair language CIRCUITVAL = {(C, x)|C is a boolean circuit and C(x) = 1}
1. The proof length, for input circuits of size s, is at most s · exp(logε/2
s), and
2. for t = 2 log log s
log log log s the verifier of proximity has query complexity O(max{1
δ , 1
ε })
for any proximity parameter δ that satisfies δ ≥ 1
t .
Furthermore, the queries of the verifier are non-adaptive with the following prop-
erties:
1. Each query made to one of the locations of the input x is uniformly dis-
tributed among the locations of x, and
2. each query to one of the locations in the proof of proximity π probes each lo-
cation with probability at most 2/|π| (and thus is distributed nearly uniformly
among the locations of π).

3.2 The Code
We now outline the construction of the locally testable code from [4]. The idea
behind the construction is to make use of a PCPP to aid in checking if the
received word is a codeword is far away from being one. Details follow.
Suppose we have a binary code C0 : {0, 1}k
→ {0, 1}m
of distance d defined
by a parity check matrix H ∈ {0, 1}(m−k)×m
that is sparse, i.e., each of whose
rows has only an absolute constant number of 1’s. Such a code is referred to
as a low-density parity check code (LDPC). For the construction below, we will
use any such code which is asymptotically good (i.e., has rate k/m and relative
distance d/m both positive as m → ∞). Explicit constructions of such codes are
known using expander graphs [15]. Let V be a verifier of a PCP of proximity
for membership in C0; more precisely, the proof of proximity of an input string
w ∈ {0, 1}m
will be a proof that C̃0(w) = 1 where C̃0 is a linear-sized circuit
which performs the parity checks required by H on w (the circuit will have size
O(m) = O(k) since H is sparse and C0 has positive rate). Denote by π(x) be the
proof of proximity guaranteed by Proposition 1 for the claim that the input C0(x)
is a member of C0 (i.e., satisfies the circuit C̃0). By Proposition 1 and fact that
the size of C̃0 is O(k), the length of π(x) can be made at most k exp(logε/2
k).
The final code is defined as C1(x) = (C0(x)t
, π(x)) where t = (log k−1)|π(x)|
|C0(x)| .
The repetition of the code part C0(x) is required in order to ensure good distance,
since the length of the proof part π(x) typically dominates and we have no
guarantee on how far apart π(x1) and π(x2) for x1 = x2 are.
For the rest of this section let denote the proof length. The tester T1 for C1
on an input w = (w1, · · · , wt, π) ∈ {0, 1}tm+
picks i ∈ [t] at random and runs
the PCPP verifier V on wi ◦ π. It also performs a few rounds of the following
consistency checks: pick i1, i2 ∈ [t] and j1, j2 ∈ [m] at random and check if
wi1 (j1) = wi2 (j2). Ben-Sasson et al in [4] show that T1 is a standard tester.
However, T1 need not be a tolerant tester. To see this, note that the proof part
of C1 forms a 1
log k fraction of the total length. Now consider a received word
wrec = (w0, · · · , w0, π
) where w0 ∈ C0 but π
is not a correct proof for w0
being a valid codeword in c0. Note that wrec is close to C1. However, T1 is not
guaranteed to accept wrec with high probability.
The problem with the construction above was that the proof part was too
small: a natural fix is to make the proof part a constant fraction of the code-
word. We will show that this is sufficient to make the code tolerant testable. We
also remark that a similar idea was used by Ben-Sasson et. al. to give efficient
constructions for relaxed locally decodable codes [4].
Construction 1 Let 0 β 1 be a parameter, C0 : {0, 1}k
→ {0, 1}m
be a good
2
binary code and V be a PCP of proximity verifier for membership in C0. Finally
let π(x) be the proof corresponding to the claim that C0(x) is a codeword in C0.
The final code is defined as C2(x) = (C0(x)r1
, π(x)r2
) with r1 = (1−β) log k|π(x)|
m
and r2 = β log k.
2
This means that m = O(k) and the encoding can be done by circuits of nearly linear
size s0 = Õ(k).

For the rest of the section the proof length |π(x)| will be denoted by . Further
the proximity parameter and the number of queries made by the PCPP veriﬁer
V would be denoted by δp and qp respectively. Finally let ρ0 denote the relative
distance of the code C0.
The tester T2 for C2 is also the natural generalization of T1. For a param-
eter qr (to be instantiated later) and input w = (w1, · · · , wr1 , π1, · · · , πr2 ) ∈
{0, 1}r1m+r2l
, T2 does the following:
1. Repeat the next two steps twice.
2. Pick i ∈ [r1] and j ∈ [r2] randomly and run V on wi ◦ πj.
3. Do qr repetitions of the following: pick i1, i2 ∈ [r1] and j1, j2 ∈ [m] randomly
and check if wi1 (j1) = wi2 (j2).
The following lemma captures the properties of the code C2 and its tester T2.
Lemma 6. The code C2 in Construction 1 and the tester T2 (with parameters
β and qr respectively) above have the following properties:
1. The code C2 has block length n = log k · with minimum distance d lower
bounded by (1 − β)ρ0n.
2. T2 makes a total of q = 2qp + 4qr queries.
3. T2 is (1, q, 2, 2qp, 1 − β)-smooth.
4. T2 is a (c1, c2)-tolerant tester with c1 = nβ(1−β)
6d max{(2qr+qp)β, 2(1−β)qp} and c2 =
n
d (δp + 4
qr
+ β).
Fix any 0 δ 1 and let β = δ
2 , δp = δ
6 , qr = 12
δ . With these settings we
get δp + 4
qr
+ β = δ and qp = O(1
δ ) from Proposition 1 with the choice ε = 2δ.
Finally, q = 2qp + 4qr = O(1
δ ). Substituting the parameters in c2 and c1, we get
c2 = δn
d and
c1d
n
=
δ
24 max{δ(qr + qp/2), (2 − δ)qp}
= Ω(δ2
) .
Also note that the minimum distance d ≥ (1−β)ρ0n = (1− δ
2 )ρ0n ≥ ρ0
2 n. Thus,
we have the following result for tolerant testable binary codes.
Theorem 1. There exists an absolute constant α0 0 such that for every δ,
0 δ 1, there exists an explicit binary linear code C : {0, 1}k
→ {0, 1}n
where
n = k · exp(logδ
k) with minimum distance d ≥ α0n which admits a (c1, c2)-
tolerant tester with c2 = O(δ), c1 = Ω(δ2
) and query complexity O(1
δ ).
The claim about explicitness follows from the fact that the PCPP of Lemma 4
and hence Proposition 1 has an explicit construction. The claim about linearity
follows from the fact that the PCPP for CIRCUITVAL is a linear function of the
input when the circuit computes linear functions – this aspect of the construction
is discussed in detail in Section 8.4 of the extended version of [4].

4 Tolerant Testers for Tensor Products of Codes
Tensor product of codes is simple way to construct new codes from any existing
codes such that the constructed codes have testers with sub-linear query com-
plexity even though the original code need not admit a sub-linear complexity
tester [5]. We first briefly define product of codes and then outline the tester of
product of codes from [5].
Given an [n, k, d]q code C, the product of C with itself, denoted by C2
, is a
[n2
, k2
, d2
]q code such that a codeword (viewed as a n × n matrix) restricted to
any row or column is a codeword in C. More formally, given the n × k generator
matrix M of C, C2
is precisely the set of matrices in the set {M · X · MT
| X ∈
Fk×k
q }. A very natural test for C2
is to randomly choose a row or a column
and then check if the restriction of the received word on that row or column is
a codeword in C (which can be done for example by querying all the n points
in the row or column). Unfortunately, it is not known if this test is robust in
general (see the discussion in [5]).
Ben-Sasson and Sudan in [5] considered the more general product of codes
Ct
for t ≥ 3 along with the following general tester– Choose at random b ∈
{1, · · · , t} and i ∈ {1, · · · , n} and check if bth
coordinate of the received word
(which is an element of Fnt
q ) when restricted3
to i is a codeword in Ct−1
. It
is shown in [5] that this test is robust, in that if a received word is far from
Ct
, then many of the tested substrings will be far from Ct−1
. This tester lends
itself to recursion: the test for Ct−1
can be reduced to a test for Ct−2
and so
on till we need to check whether a word in Fn2
q is a codeword of C2
. This last
check can done by querying all the n2
points, out of the nt
points in the original
received word, thus leading to a sub-linear query complexity. As shown in [5],
the reduction can be done in log t stages by the standard halving technique.
We now give a tolerant version of the test for product of codes given by Ben-
Sasson and Sudan [5]. In what follows t ≥ 4 will be a power of two. As mentioned
above the tester T for the tensor product Ct
reduces the test to checking if some
restriction of the given string belong to C2
. For the rest of this section, with a
slight abuse of notation let vf ∈ Fn2
q denote the final restriction being tested.
In what follows we assume that by looking at all points in any v ∈ Fn2
q one can
determine if dist(v, C2
) ≤ τ in time polynomial in n2
.
The tolerant version of the test of [5] is a simple modification as mentioned
in Section 2: reduce the test on Ct
to C2
as in [5] and then accept if vf is τ-close
to C2
.
First we make the following observation about the test in [5]. The test recurses
log t times to reduce the test to C2
. At step l , the test chooses an random
coordinate bl (this will just be a random bit) and fixes the value of the bth
l
coordinate of the current C
t
2l
to an index il (where il takes values in the range 1 ≤
il ≤ nt/2l
). The key observation here is that for each fixed choice of b1, · · · , blog t,
distinct choices of i1, · · · , ilog t correspond to querying disjoint sets n2
points in
3
For the t = 2 case b signifies either row or column and i denotes the row/column
index.

the original v ∈ Fnt
q string, which together form a partition of all coordinates
of v. In other words, T has a partitioned query pattern, which will be useful to
argue tolerance. For soundness, we use the results in [5], which show that their
tester is Clog t
-robust for C = 232
.
Applying Lemmas 2 and 3, therefore, we have the following result:
Theorem 2. Let t ≥ 4 be a power of two and 0 τ ≤ 1. There exist 0 c1
c2 ≤ 1 with c2
c1
= Clog t
(1 + 2/τ) such that the proposed tolerant tester for Ct
is
a (c1, c2)-tolerant tester with query complexity N2/t
where N is the block length
of Ct
. Further, c1 and c2 are constants (independent of N) if t is a constant and
C has constant relative distance.
Thus, Theorem 2 achieves the goal of a simple construction of tolerant testable
codes with sub-linear query complexity, as the following corollary records:
Corollary 1 For every γ 0, there is an explicit family of asymptotically good
binary linear codes which are tolerant testable using nγ
queries, where n is the
block length of the concerned code. (The rate, relative distance and thresholds
c1, c2 for the tolerant testing depend on γ.)
Obtaining non-trivial lower bounds on the the block length of codes that are lo-
cally testable with very few (even 3) queries is an extremely interesting question.
This problem has remained open and resisted even moderate progress despite
all the advancements in constructions of LTCs. The requirement of having a
tolerant local tester is a stronger requirement. While we have seen that we can
get tolerance with similar parameters to the best known LTCs, it remains an
interesting question whether the added requirement of tolerance makes the task
of proving lower bounds more tractable. This seems like a good first step in mak-
ing progress towards understanding whether asymptotically good locally testable
codes exist, a question which is arguably one of the grand challenges in this area.
For interesting work in this direction which proves that such codes, if they exist,
cannot also be cyclic, see [3].
We also investigated whether Reed-Muller codes admit tolerant testers. For
large fields, the existing results on low-degree testing of multivariate polynomials
[9, 14] immediately imply results on tolerant testing for these codes. The details
were omitted due to the lack of space and would appear in the full version of
the paper.
Very recently, (standard) locally testable codes with blocklength n = k ·
polylogk have been constructed [6, 7]. In the full version of the paper we hope
to investigate whether this new code admits a tolerant tester as well.
References
1. S. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof verification and
the intractibility of approximation problems. Journal of the ACM, 45(3):501–555,
1998.

2. S. Arora and S. Safra. Probabilistic checking of proofs: A new characterization of
NP. Journal of the ACM, 45(1):70–122, 1998.
3. L. Babai, A. Shpilka, and D. Stefankovic. Locally testable cyclic codes. In Pro-
ceedings of 44th Annual Symposium on Foundations of Computer Science (FOCS),
pages 116–125, 2003.
4. E. Ben-Sasson, O. Goldreich, P. Harsha, M. Sudan, and S. Vadhan. Robust PCPs
of proximity, shorter PCPs and application to coding. In Proceedings of the 36th
Annual ACM Symposium on Theory of Computing (STOC), pages 1–10, 2004.
5. E. Ben-Sasson and M. Sudan. Robust locally testable codes and products of codes.
In Proceedings of the 8th International Workshop on Randomization and Compu-
tation (RANDOM), pages 286–297, 2004.
6. E. Ben-Sasson and M. Sudan. Simple PCPs with poly-log rate and query complex-
ity. In Proceedings of 37th ACM Symposium on Theory of Computing (STOC),
pages 266–275, 2005.
7. I. Dinur. The PCP theorem by gap amplification. In ECCC Technical Report
TR05-046, 2005.
8. E. Fischer and L. Fortnow. Tolerant versus intolerant testing for boolean proper-
ties. In Proceedings of the 20th IEEE Conference on Computational Complexity,
2005. To appear.
9. K. Friedl and M. Sudan. Some improvements to total degree tests. In Proceedings of
the 3rd Israel Symp. on Theory and Computing Systems (ISTCS), pages 190–198,
1995.
10. O. Goldreich. Short locally testable codes and proofs (Survey). ECCC Technical
Report TR05-014, 2005.
11. O. Goldreich and M. Sudan. Locally testable codes and PCPs of almost linear
length. In Proceedings of 43rd Symposium on Foundations of Computer Science
(FOCS), pages 13–22, 2002.
12. T. Kaufman and D. Ron. Testing polynomials over general fields. In Proceedings of
the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS),
pages 413–422, 2004.
13. M. Parnas, D. Ron, and R. Rubinfeld. Tolerant property testing and distance
approximation. In ECCC Technical Report TR04-010, 2004.
14. A. Polishchuk and D. A. Spielman. Nearly-linear size holographic proofs. In
Proceedings of the 26th Annual ACM Symposium on Theory of Computing (STOC),
pages 194–203, 1994.
15. M. Sipser and D. Spielman. Expander codes. IEEE Transactions on Information
Theory, 42(6):1710–1722, 1996.
16. M. Sudan. Efficient Checking of Polynomials and Proofs and the Hardness of
Approximation Problems. ACM Distinguished Theses Series. Lecture Notes in
Computer Science, no. 1001, Springer, 1996.

A Lower Bound on List Size for List Decoding
Venkatesan Guruswami1,
and Salil Vadhan2,
1
Department of Computer Science Engineering
University of Washington, Seattle, WA
venkat@cs.washington.edu
2
Division of Engineering Applied Sciences
Harvard University, Cambridge, MA
salil@eecs.harvard.edu
Abstract. A q-ary error-correcting code C ⊆ {1, 2, . . . , q}n
is said to be
list decodable to radius ρ with list size L if every Hamming ball of radius
ρ contains at most L codewords of C. We prove that in order for a q-ary
code to be list-decodable up to radius (1 − 1/q)(1 − ε)n, we must have
L = Ω(1/ε2
). Specifically, we prove that there exists a constant cq 0
and a function fq such that for small enough ε 0, if C is list-decodable
to radius (1 − 1/q)(1 − ε)n with list size cq/ε2
, then C has at most fq(ε)
codewords, independent of n. This result is asymptotically tight (treating
q as a constant), since such codes with an exponential (in n) number of
codewords are known for list size L = O(1/ε2
).
A result similar to ours is implicit in Blinovsky [Bli] for the binary (q =
2) case. Our proof works for all alphabet sizes, and is technically and
conceptually simpler.
1 Introduction
List decoding was introduced independently by Elias [Eli1] and Wozencraft [Woz]
as a relaxation of the classical notion of error-correction by allowing the decoder
to output a list of possible answers. The decoding is considered successful as long
as the correct message is included in the list. We point the reader to the paper
by Elias [Eli2] for a good summary of the history and context of list decoding.
The basic question raised by list decoding is the following: How many er-
rors can one recover from, when constrained to output a list of small size? The
study of list decoding strives to (1) understand the combinatorics underlying
this question, (2) realize the bounds with explicit constructions of codes, and
(3) list decode those codes with efficient algorithms. This work falls in the com-
binatorial facet of list decoding. Combinatorially, an error-correcting code has
“nice” list-decodability properties if every Hamming ball of “large” radius has
a “small” number of codewords in it. In this work, we are interested in expos-
ing some combinatorial limitations on the performance of list-decodable codes.


Supported by NSF grant CCF-0133096, ONR grant N00014-04-1-0478, and a Sloan
Research Fellowship. Work done in part while a Fellow at the Radcliffe Institute for
Advanced Study.
c

A Lower Bound on List Size for List Decoding 319
Specifically, we seek lower bounds on the list size needed to perform decoding
up to a certain number of errors, or in other words, lower bounds on the number
of codewords that must fall inside some ball of specified radius centered at some
point. We show such a result by picking the center in a certain probabilistic
way. We now give some background definitions and terminology, followed by a
description of our main result.
1.1 Preliminaries
We denote the set {1, 2, . . ., m} by the shorthand [m]. For q ≥ 2, a q-ary code of
block length n is simply a subset of [q]n
. The elements of the code are referred to
as codewords. The high-level property of a code that makes it useful for error-
correction is its sparsity — the codewords must be well spread-out, so they are
unlikely to distort into one another. One way to insist on sparsity is that the
Hamming distance between every pair of distinct codewords is at least d. Note
that this is equivalent to requiring that every Hamming ball of radius (d−1)/2
has at most one codeword. Generalizing this, one can allow up to a small number,
say L, of codewords in Hamming balls of certain radius. This leads to the notion
of list decoding and a good list-decodable code. Since the expected Hamming
distance of a random string of length n from any codeword is (1 − 1/q) · n for
a q-ary code, the largest fraction of errors one can sensibly hope to correct is
(1 − 1/q). This motivates the following definition of a list-decodable code.
Definition 1. Let q ≥ 2, 0 ρ 1, and L be a positive integer. A q-ary code
C of block length n is said to be (ρ, L)-list-decodable if for every y ∈ [q]n
, the
Hamming ball of radius ρ·(1−1/q)·n centered at y contains at most L codewords
of C.
We will study (ρ, L)-list-decodable codes for ρ = 1 − ε in the limit of ε → 0.
This setting is the one where list decoding is most beneficial, and is a clean setting
to initially study the asymptotics. In particular, we will prove that, except for
trivial codes whose size does not grow with n, (1 − ε, L)-list-decodable codes
require list size L = Ω(1/ε2
) (hiding dependence on q).
1.2 Context and Related Results
Before stating our result, we describe some of the previously known results to
elucidate the broader context where our work fits. The rate of a q-ary code of
block length n is defined to be
logq |C|
n . For 0 ≤ x ≤ 1, we denote by Hq(x) the
q-ary entropy function, Hq(x) = x logq(q − 1) − x logq x − (1 − x) logq(1 − x).
Using the probabilistic method, it can be shown that (ρ, L)-list-decodable
q-ary codes of rate 1 −Hq((1−1/q)ρ)−1/L exist [Eli2, GHSZ]. In particular, in
the limit of large L, we can achieve a rate of 1−Hq((1−1/q)ρ), which equals both
the Hamming bound and the Shannon capacity of the q-ary channel that changes
a symbol α ∈ [q] to a uniformly random element of [q] with probability ρ and
leaves α unchanged with probability 1−(1−1/q)ρ. When ρ = 1−ε for small ε, we

320 Venkatesan Guruswami and Salil Vadhan
have Hq((1 − 1/q)ρ) = 1 − Ω(qε2
/ log q). Therefore, there exist (1 − ε, L(q, ε))-
list-decodable q-ary codes with 2Ω(qε2
n)
codewords and L(q, ε) = O(log q
qε2 ). In
particular, for constant q, list size of O(1/ε2
) suffices for non-trivial list decoding
up to radius (1 − 1/q) · (1 − ε).
We are interested in whether this quadratic dependence on 1/ε in the list size
is inherent. The quadratic bound is related to the 2 log(1/ε)−O(1) lower bound
due to [RT] for the amount of “entropy loss” in randomness extractors, which are
well-studied objects in the subject of pseudorandomness. In fact a lower bound
of Ω(1/ε2
) on list size will implies such an entropy loss bound for (“strong”)
randomness extractors. However, in the other direction, the argument loses a
factor of ε in the lower bound, yielding only a lower of Ω(1/ε) for list size (cf.
[Vad]).
For the model of erasures, where up to a fraction (1−ε) of symbols are erased
by the channel, optimal bounds of Θ(log(1/ε)) are known for the list size required
for binary codes [Gur]. This can be compared with the log log(1/ε) − O(1) lower
bound on entropy loss for dispersers [RT].
A lower bound of Ω(1/ε2
) for list size L for (1 − ε, L)-list-decodable binary
codes follows from the work of Blinovsky [Bli]. We discuss more about his work
and how it compares to our results in Section 1.5.
1.3 Our Result
Our main result is a proof of the following fact: the smallest list size that permits
list decoding up to radius (1−1/q)(1−ε) is Θ(ε−2
) (hiding constants depending
on q in the Θ-notation). The formal statement of our main result is below.
Theorem 2 (Main). For every integer q ≥ 2 there exists cq 0 and dq
∞ such that for all small enough ε 0, the following holds. If C is a q-ary
(1 − ε, cq/ε2
)-list-decodable code, then |C| ≤ 2dq·ε−2
log(1/ε)
.
1.4 Overview of Proof
We now describe the high-level structure of our proof. Recall that our goal is to
exhibit a center z that has several (specifically Ω(1/ε2
)) codewords of C with
large correlation, where we say two codewords have correlation ε if they agree
in (1/q + ε) · n locations. (The actual definition we use, given in Definition 3,
is slightly different, but this version suffices for the present discussion.) Using
the probabilistic method, it is not very difficult to prove the existence of such a
center z and Ω(1/ε2
) codewords whose average correlation with z is at least Ω(ε).
(This is the content of our Lemma 6.) This step is closely related to (and actually
follows from) the known lower bound of Radhakrishnan and Ta-Shma [RT] on the
“entropy loss” of “randomness extractors,” by applying the known connection
between randomness extractors and list-decodable error-correcting codes (see
[Tre, TZ, Vad]).
However, this large average could occur due to about 1/ε codewords having a
Ω(1) correlation with z, whereas we would like to find many more (i.e., Ω(1/ε2
))

codewords with smaller (i.e., Ω(ε)) correlation. We get around this difficulty by
working with a large subcode C
of C where such a phenomenon cannot occur.
Roughly speaking, we will use the probabilistic method to prove the existence
of a large “L-pseudorandom” subcode C
, for which looking at any set of L
codewords of C never reveals any significant overall bias in terms of the most
popular symbol (out of [q]). More formally, all -tuples, ≤ L, the average
“plurality” (i.e., frequency of most frequent symbol) over all the coordinates
isn’t much higher than /q. (This is the content of our Lemma 7.) This in turn
implies that for every center z, the sum of the correlations of z with all codewords
that have “large” correlation (say at least Dε, for a sufficiently large constant D)
is small. Together with the high average correlation bound, this means several
codewords must have “intermediate” correlation with z (between ε and Dε).
The number of such codewords is our lower bound on list size.
1.5 Comparison with Blinovsky [Bli]
As remarked earlier, a lower bound of L = Ω(1/ε2
) for binary (1 − ε, L)-list-
decodable codes follows from the work of Blinovsky [Bli]. He explores the tradeoff
between ρ, L, and the relative rate γ of a (ρ, L)-list-decodable code, when all
three of these parameters are constants and the block length n tends to infinity.
A special case of his main theorem shows that if ρ = 1 − ε and L ≤ c/ε2
for
a certain constant c 0, then the rate γ must be zero asymptotically, which
means that the code can have at most 2o(n)
codewords for block length n. A
careful inspection of his proof, however, reveals an f(ε) bound (independent of
n) on the number of codewords in any such code. This is similar in spirit to our
Theorem 2. However, our work compares favorably with [Bli] in the following
respects.
1. Our result also holds for q-ary codes for q 2. The result in [Bli] applies only
to binary codes, and it is unclear whether his analysis can be generalized to
q-ary codes.
2. Our result is quantitatively stronger. The dependence f(ε) of the bound
on the size of the code in [Bli] is much worse than the (1/ε)O(ε−2
)
that we
obtain. In particular, f(ε) is at least an exponential tower of height Θ(1/ε2
)
(and is in fact bigger than the Ackermann function of 1/ε).
3. Our proof seems significantly simpler and provides more intuition about why
and how the lower bound arises.
We now comment on the proof method in [Bli]. As with our proof, the first
step in the proof is a bound for the case when the average correlation (w.r.t every
center) for every set of L+1 codewords is small (this is Theorem 2 in [Bli]). Note
that this is a more stringent condition than requiring no set of L + 1 codewords
lie within a small ball. Our proof uses the probabilistic method to show the
existence of codewords with large average correlation in any reasonable sized
code. The proof in [Bli] is more combinatorial, and uses a counting argument to
bound the size of the code when all subsets of L +1 codewords have low average

correlation (with every center). But the underlying technical goal of the first
step in both the approaches is the same.
The second step in Blinovsky’s proof is to use this bound to obtain a bound
for list-decodable codes. The high-level idea is to pick a subcode of the list-
decodable code with certain nice properties so that the bound for average corre-
lation can be turned into one for list decoding. This is also similar in spirit to our
approach (Lemma 7). The specifics of how this is done are, however, quite differ-
ent. The approach in [Bli] is to find a large subcode which is (L +1)-equidistant,
i.e., for every k ≤ L +1, all subsets of k codewords have the same value for their
k’th order scalar product, which is defined as the sum over all coordinates of the
product of the k symbols (from {0, 1}) in that coordinate.1
Such a subcode has
the following useful property: in each subset of L+1 codewords, all codewords in
the subset have the same agreement with the best center, i.e., the center obtained
by taking their coordinate-wise majority, and moreover this value is independent
of the choice of the subset of L + 1 codewords. This in turn enables one to get
a bound for list decoding from one for average correlation. The requirement of
being (L +1)-equidistant is a rather stringent one, and is achieved iteratively by
ensuring k-equidistance for k = 1, 2, . . . , L + 1 successively. Each stage incurs a
rather huge loss in the size of the code, and thus the bound obtained on the size
of the original code is an enormously large function of 1/ε. We make do with a
much weaker property than (L + 1)-equidistance, letting us pick a much larger
subcode with the property we need. This translates into a good upper bound on
the size of the original list-decodable code.
2 Proof of Main Result
We first begin with convenient measures of closeness between strings, the agree-
ment and the correlation.
Definition 3 (Agreement and Correlation). For strings x, y ∈ [q]n
, define
their agreement, denoted agr(x, y) = 1
n · #{i : xi = yi}. Their correlation is the
value corr(x, y) ∈ [−1/(q − 1), 1] such that agr(x, y) = 1
q +

1 − 1
q

· corr(x, y).2
The standard notion of correlation between two strings in {1, −1}n
is simply
their dot product divided by n; the definition above is a natural generalization
to larger alphabets.
A very useful notion for us will be the plurality of a set of codewords.
Definition 4 (Plurality). For symbols a1, . . . , ak ∈ [q], we define their plu-
rality plur(a1, . . . , ak) ∈ [q] to be the most frequent symbol among a1, . . . , ak,
breaking ties arbitrarily. We define the plurality count #plur(a1, . . . , ak) ∈ N to
be the number of times that plur(a1, . . . , ak) occurs among a1, . . . , ak.
1
A slight relaxation of the (L + 1)-equidistance property is actually what is used in
[Bli], but this description should suffice for the discussion here.
2
Note that we find it convenient to work with agreement and correlation that are
normalized by dividing by the length n.

For vectors c1, . . . , ck ∈ [q]n
, we define plur(c1, . . . , ck) ∈ [q]n
to be the com-
ponentwise plurality, i.e. plur(c1, . . . , ck)i = plur(c1i, . . . , cki).
We define #plur(c1, . . . , ck) to be the average plurality count over all coordi-
nates; that is, #plur(c1, . . . , ck) = (1/n) ·
n
i=1 #plur(c1i, . . . , cki).
The reason pluralities will be useful to us is that they capture the maximum
average correlation any vector has with a set of codewords:
Lemma 5. For all c1, . . . , ck ∈ [q]n
,
max
z∈[q]n
k
i=1
agr(z, ci) =
k
i=1
agr(plur(c1, . . . , ck), ci) = #plur(c1, . . . , ck)
Note that our goal of proving lower bound on list size is the same as proving
that in every not too small code, there must be some center z that has several
(i.e. Ω(1/ε2
)) close-by codewords, or in other words several codewords with
large (i.e., at least ε) correlation. We begin by showing the existence of a center
which has a large average correlation with a collection of several codewords. By
Lemma 5, this is equivalent to finding a collection of several codewords whose
total plurality count is large.
Lemma 6. For all integers q ≥ 2 and t ≥ 37q, there exists a constant bq 0
such that for every positive integer t and every code C ⊆ [q]n
with |C| ≥ 2t,
there exist t distinct codewords c1, c2, . . . , ct ∈ C such that
#plur(c1, . . . , ct) ≥
t
q
+ Ω
-
t
q

.
Equivalently, there exists a z ∈ [q]n
such that
t
i=1
corr(z, ci) ≥ Ω
-
t
q

. (1)
Proof. Without loss of generality, assume |C| = 2t. Pick a subset {c1, c2, . . . , ct}
from C, chosen uniformly at random among all t-element subsets of C. For j =
1, . . . , n, define the random variable Pj = #plur(c1j, . . . , ctj) to be the plurality
of the j’th coordinates. By definition, #plur(c1, . . . , ct) = (1/n)·
n
j=1 Pj. Notice
that Pj is always at least t/q, and we would expect the plurality to occasionally
deviate from the lower bound. Indeed, it can be shown that for any sequence of
2t elements of [q], if we choose a random subset of half of them, the expected
plurality count is t/q + Ω( t/q). Thus, E[Pj] = t/q + Ω( t/q). So E[(1/n) ·

j Pj] = t/q+Ω( t/q), and thus the lemma follows by taking any c1, . . . , ct that
achieves the expectation. The equivalent reformulation in terms of correlation
follows from Lemma 5 and the definition of correlation in terms of agreement
(Definition 3).

For any ε 0, the above lemma gives a center z and t = Ω(1/qε2
) codewords
c1, . . . , ct such that the average correlation between z and c1, . . . , ct is at least ε.
This implies that at least (ε/2) · t of the ci’s have correlation at least ε/2 with
z. Thus we get a list-size lower bound of (ε/2) · t = Ω(1/qε) for decoding from
correlation ε/2. (This argument has also appeared in [LTW].)
Now we would like to avoid the ε factor loss in list size in the above argu-
ment. The reason it occurs is that the average correlation can be ε due to the
presence of ≈ εt of the ci’s having extremely high correlation with z. This is
consistent with the code being list-decodable with list size o(1/(qε2
)) for corre-
lation ε, but it means that it code has very poor list-decoding properties at some
higher correlations — e.g., having εt = Ω(1/(qε)) codewords at correlation Ω(1),
whereas we’d expect a “good” code to have only O(1) such codewords. In our
next (and main) lemma, we show that we can pick a subcode of the code where
this difficulty does not occur. Specifically, if C has good list-decoding properties
at correlation ε, we get a subcode that has good list-decoding properties at every
correlation larger than ε.
Lemma 7 (Main technical lemma). For all positive integers L, t, m ≥ 2t
and q ≥ 2, and all small enough ε 0, the following holds. Let C be a (1−ε, L)-
list-decodable q-ary code of block length n with |C| 2L · t · m!/(m − t)!. Then
there exists a subcode C
⊆ C, |C
| ≥ m, such that for all positive integers ≤ t
and every c1, c2, . . . , c ∈ C
,
#plur(c1, . . . , c) ≤

1
q
+

1 −
1
q

· ε + O

q3/2
√

· .
Equivalently, for every z ∈ [q]n
and every c1, . . . , c ∈ C
, we have

i=1
corr(z, ci) ≤

ε + O

q3/2
√

· . (2)
Notice that the lemma implies a better upper bound on list size for correlations
much larger than ε. More precisely, for every δ 0, it implies that the number
of codewords having correlation at least ε + δ with a center z is at most =
O(q3
/δ2
). In fact, any codewords must even have average correlation at most
ε + δ.
Proof. We will pick a subcode C
⊆ C of size m at random from all m-element
subsets of C, and prove that C
will fail to have the claimed property with
probability less than 1.
For now, however, think of the code C
as being fixed, and we will reduce
proving the desired properties above to bounding some simpler quantities. Let
(c1, c2, . . . , c) be an arbitrary -tuple of codewords in C
. We will keep track
of the average plurality count #plur(c1, . . . , ci) as we add each codeword to this
sequence. To describe how this quantity can change at each step, we need a couple
of additional definitions. We say a sequence (a1, . . . , ai) ∈ [q]i
has a plurality tie if
at least two symbols occur #plur(a1, . . . , ai) times among a1, . . . , ai. For vectors

c1, . . . , ci ∈ [q]n
, we define #ties(c1, . . . , ci) to be the fraction of coordinates
j ∈ [n] such that (c1j, . . . , cij) has a plurality tie. Then:
Claim 8. For every c1, . . . , ci ∈ [q]n
, #plur(c1, . . . , ci) ≤ #plur(c1, . . . , ci−1) +
agr(ci, plur(c1, . . . , ci−1)) + #ties(c1, . . . , ci−1).
Proof of Claim. Consider each coordinate j ∈ [n] separately. Clearly,
#plur(c1j, . . . , cij) ≤ #plur(c1j, . . . , c(i−1)j) + 1 .
Moreover, if (c1j, . . . , c(i−1)j) does not have a plurality tie, then the plurality
increases iff cij equals the unique symbol plur(c1j, . . . , c(i−1)j) achieving the plu-
rality. Thus,
#plur(c1j, . . . , cij) ≤ #plur(c1j, . . . , c(i−1)j) + Aj + Tj,
where Tj is the indicator variable for (c1j, . . . , c(i−1)j) having a plurality tie, and
Aj for cij agreeing with plur(c1j, . . . , c(i−1)j). The claim follows by averaging
over j = 1, . . . , n.
Therefore, our task of bounding #plur(c1, . . . , c) reduces to bounding
agr(ci, plur(c1, . . . , ci−1)) and #ties(c1, . . . , ci−1) for each i = 1, . . . , . The first
term we bound using the list-decodability of C and the random choice of the
subcode C
.
Claim 9. There exists a choice of the subcode C
such that |C
| = m and for
every i ≤ t and every (ordered) sequence c1, . . . , ci ∈ C
, we have
agr(ci, plur(c1, . . . , ci−1)) ≤ 1/q + (1 − 1/q) · ε .
Proof of Claim. We choose the subcode C
uniformly at random from all m-
subsets of C. We view C
as a sequence of m codewords selected randomly
from C without replacement. Consider any i of the codewords c1, . . . , ci in this
sequence. By the list-decodability of the C, for any c1, . . . , ci−1, there are at
most L choices for ci having agreement larger than (1/q + (1 − 1/q) · ε) with
plur(c1, . . . , ci−1). Conditioned on c1, . . . , ci−1, ci is distributed uniformly on the
remaining |C|−i+1 elements of C, so the probability of ci being one of the ≤ L
bad codewords is at most L/(|C| − i + 1).
By a union bound, the probability that the claim fails for at least one sub-
sequence c1, . . . , ci of at most t codewords in C
is at most
t
i=1
m!
(m − i)!
·
L
|C| − i + 1
≤ t ·
m!
(m − t)!
·
L
|C| − t + 1
1.
Thus, there exists a choice of subcode C
satisfying the claim.
For the #ties(c1, . . . , ci−1) terms, we consider the codewords c1, . . . , c in a
random order.

Claim 10. For every sequence of c1, . . . , c ∈ [q]n
, there exists a permutation
σ : [] → [] such that

i=1
#ties(cσ(1), . . . , cσ(i)) = O(q3/2
·
√
).
Proof of Claim. We choose σ uniformly at random from all permutations σ :
[] → [] and show that the expectation of the left side is at most O(q3/2
·
√
).
By linearity of expectations, it suffices to consider the expected number of plu-
rality ties occurring in each coordinate j ∈ [n]. That is, we read the sym-
bols c1j, . . . , cj ∈ [q] in a random order σ and count the number of prefixes
cσ(1)j, . . . , cσ(i)j having a plurality tie. If this prefix were i symbols chosen in-
dependently according to some (arbitrary) distribution, then it is fairly easy to
show that the probability of a tie is O(1/
√
i) (ignoring the dependence on q), and
summing this from i = 1, . . . , gives O(
√
) expected ties in each coordinate.
Since they are not independently chosen, but rather i distinct symbols from
a fixed sequence of symbols, the analysis becomes a bit more involved, but
nevertheless the bound remains essentially the same. Specifically, by Lemma 11
stated below, the expected number of ties is at most O(q3/2
·
√
), yielding the
claim.
Lemma 11. Let b1, b2, . . . , bk be a sequence of elements from the universe [q].
Recall that a prefix of such a sequence has a plurality tie if there are at least
two elements of [q] that occur the same number of times in the prefix, and no
other element occurs a strictly greater number of times in the prefix. Let Y be the
random variable counting the number of prefixes with a plurality tie in a random
permutation of the bi’s. Then E[Y ] = O(q3/2
√
k).
A full proof of the above lemma is omitted for reasons of space and can be
found in the full version of the paper. Our approach is as follows. For α ∈ [q]
and i ∈ [k], let Yα,i be the indicator random variable (over the choice of the
permutation π of the sequence) for the event that the prefix (bπ(1), . . . , bπ(i)) has
a plurality tie, α achieves the plurality, and bπ(i) = α. Then Y ≤

α,i Yα,i. (For
every prefix with a plurality tie, at least one of the two symbols achieving the
plurality must be different from the last symbol in the prefix.) Essentially what
we prove is that E[Yα,i] = O( q/ min{i, k − i}). Then summing over all i ∈ [k]
and α ∈ [q] gives the desired bound of O(
√
qk).
Now to complete the proof of Lemma 7, let C
be as in Claim 9, and let
c1, . . . , c be an arbitrary sequence of distinct codewords in C
. Let σ be permu-
tation guaranteed by Claim 10. Then, by Claim 8, we have
#plur(c1, . . . , c) = #plur(cσ(1), . . . , cσ())
≤

i=1

agr(cσ(i), plur(cσ(1), . . . , cσ(i−1))) + #ties(cσ(1), . . . , cσ(i−1))

≤ · (1/q + (1 − 1/q) · ε) + O(q3/2
·
√
),

as desired. The equivalent reformulation in terms of correlation again follows
from Lemma 5 and the definition of correlation in terms of agreement (Defini-
tion 3).
The following corollary of Lemma 7 will be useful in proving our main result.
Corollary 12. Let L, t, m, q, ε, C, and C
be as in Lemma 7 for a choice of
parameters satisfying t ≥ L. Then for all z ∈ [q]n
and all D ≥ 2,
c∈C
corr(z,c)≥Dε
corr(z, c) ≤ O
q3
Dε

. (3)
Proof. Let c1, c2, . . . , cr be all the codewords of C
that satisfy corr(z, c) ≥ Dε.
Since C, and hence C
, is (1 − ε, L)-list-decodable, we have r ≤ L ≤ t. Using (2)
for the codewords c1, c2, . . . , cr, we have
Dε ≤
1
r
r
i=1
corr(z, ci) ≤ ε + O
q3/2
√
r

which gives r = O(q3
/((D − 1)2
ε2
)) = O(q3
/(D2
ε2
)), since D ≥ 2. Applying (2)
again,
c∈C
corr(z,c)≥Dε
corr(z, c) =
r
i=1
corr(z, ci)
≤ εr + O(q3/2
√
r)
≤ O
q3
D2ε

+ O
q3
Dε

≤ O
q3
Dε

.
We are now ready to prove our main result, Theorem 2, which restate (in
slightly different form) below.
Theorem 13 (Main). There exist constants c 0, d ∞, such that for all
small enough ε 0, the following holds. Suppose C is a q-ary (1 − ε, L)-list-
decodable code with |C| 1/(qε2
)d/(qε2
)
Then L ≥ c/(q5
ε2
).
Proof. Let T be a large enough constant to be specified later. Let t = 1
T qε2 . If
L t, then there is nothing to prove. So assume that t ≥ L ≥ 1 and set m = 2t.
Then
2L · t ·
m!
(m − t)!
≤ 2t2
· (2t)t
=

1
qε2
O(1/(qε2
))
|C|,

for a sufficiently large choice of the constant d. Let C
be a subcode of C of size
m = 2t guaranteed by Lemma 7.
By Lemma 6, there exist t codewords ci, 1 ≤ i ≤ t, in C
, and a center
z ∈ [q]n
such that
t
i=1
corr(z, ci) = Ω
-
t
q

. (4)
Also, for any D ≥ 2, we have
t
i=1 corr(z, ci) equals
i:corr(z,ci)ε
corr(z, ci) +
i:ε≤corr(z,ci)Dε
corr(z, ci) +
i:corr(z,ci)≥Dε
corr(z, ci)
≤ εt + DεL + O
q3
Dε

(5)
where to bound the second part we used that C
is (1 − ε, L)-list-decodable, and
to bound the third part we used the fact that C
satisfies (3).
Putting these together, and setting D = q4
· T , we have
Ω
-
t
q

≤ εt + DεL + O
q3
Dε

≤
-
t
Tq
+ DεL + O
-
t
Tq

.
For a sufficiently large choice of the constant T , this gives
L ≥
1
Dε
· Ω
-
t
q

= Ω

1
√
T · q5 · ε2

.
as desired.
3 Open Questions
Several questions are open in the general direction of exhibiting limitations on
the performance of list-decodable codes. We mention some of them below.
– We have not attempted to optimize the dependence on the alphabet size q
in our bound on list size (i.e. the constant cq in Theorem 2), and this leaves
a gap between the upper and lower bounds. The probabilistic code construc-
tion of [Eli2, GHSZ] achieves a nearly linear dependence on q (specifically,
list size L = O(log q/(qε2
)), whereas our lower bound (Theorem 13) has a
polynomial dependence on q (namely, it shows L = Ω(1/(q5
ε2
)).
– It should be possible to use our main result, together with an appropriate
“filtering” argument (that focuses, for example, on a subcode consisting of
all codewords of a particular Hamming weight) to obtain upper bounds on
rate of list-decodable q-ary codes. In particular, can one confirm that for

each fixed L, the maximum rate achievable for list decoding up to radius p
with list size L is strictly less than the capacity 1 − Hq(p)? Such a result is
known for binary codes [Bli]. Also, it is an interesting question whether some
of the ideas in this paper can be used to improve the rate upper bounds of
Blinovsky [Bli] for the binary case.
– Can one prove a lower bound on list size as a function of distance from
“capacity”? In particular, does one need list size Ω(1/γ) to achieve a rate
that is within γ of capacity?
– Can one prove stronger results for linear codes?
Acknowledgments
We thank Michael Mitzenmacher, Madhu Sudan, and Amnon Ta-Shma for help-
ful conversations.
References
[Bli] V. M. Blinovsky. Bounds for codes in the case of list decoding of finite volume.
Problems of Information Transmission, 22(1):7–19, 1986.
[Eli1] P. Elias. List decoding for noisy channels. Technical Report 335, Research
Laboratory of Electronics, MIT, 1957.
[Eli2] P. Elias. Error-correcting codes for list decoding. IEEE Transactions on
Information Theory, 37:5–12, 1991.
[Gur] V. Guruswami. List decoding from erasures: Bounds and code constructions.
IEEE Transactions on Information Theory, 49(11):2826–2833, 2003.
[GHSZ] V. Guruswami, J. Hastad, M. Sudan, and D. Zuckerman. Combinato-
rial bounds for list decoding. IEEE Transactions on Information Theory,
48(5):1021–1035, 2002.
[LTW] C.-J. Lu, S.-C. Tsai, and H.-L. Wu. On the complexity of hardness amplifica-
tion. In Proceedings of the 20th Annual IEEE Conference on Computational
Complexity, San Jose, CA, June 2005. To appear.
[RT] J. Radhakrishnan and A. Ta-Shma. Bounds for dispersers, extractors, and
depth-two superconcentrators. SIAM Journal on Discrete Mathematics,
13(1):2–24 (electronic), 2000.
[TZ] A. Ta-Shma and D. Zuckerman. Extractor codes. IEEE Transactions on
Information Theory, 50(12):3015–3025, 2004.
[Tre] L. Trevisan. Extractors and Pseudorandom Generators. Journal of the ACM,
48(4):860–879, July 2001.
[Vad] S. P. Vadhan. Randomness Extractors and their Many Guises. Tutorial at
IEEE Symposium on Foundations of Computer Science, November 2002. Slides
available at http://eecs.harvard.edu/~salil.
[Woz] J. M. Wozencraft. List Decoding. Quarterly Progress Report, Research Labo-
ratory of Electronics, MIT, 48:90–95, 1958.

A Lower Bound
for Distribution-Free Monotonicity Testing
Shirley Halevy and Eyal Kushilevitz
Department of Computer Science, Technion, Haifa 3200, Israel
{shirleyh,eyalk}@cs.technion.ac.il
Abstract. We consider monotonicity testing of functions f : [n]d
→
{0, 1}, in the property testing framework of Rubinfeld and Sudan [23]
and Goldreich, Goldwasser and Ron [14]. Specifically, we consider the
framework of distribution-free property testing, where the distance be-
tween functions is measured with respect to a fixed but unknown distri-
bution D on the domain and the testing algorithms have an oracle access
to random sampling from the domain according to this distribution D.
We show that, though in the uniform distribution case, testing of boolean
functions defined over the boolean hypercube can be done using query
complexity that is polynomial in 1

and in the dimension d, in the distri-
bution-free setting such testing requires a number of queries that is ex-
ponential in d. Therefore, in the high-dimensional case (in oppose to
the low-dimensional case), the gap between the query complexity for the
uniform and the distribution-free settings is exponential.
1 Introduction
Property testing is a relaxation of decision problems, where algorithms are re-
quired to distinguish objects having some property P from those objects which
are at least “-far” from every such object. The notion of property testing was
introduced by Rubinfeld and Sudan [23] and since then attracted a consider-
able amount of attention. Property testing algorithms (or property testers) were
introduced for monotonicity testing (e.g. [2, 6–8, 10, 11, 13, 18]), problems in
graph theory (e.g. [1, 4, 14–16, 20]) and other properties (e.g. [3, 5]; the reader
is referred to excellent surveys by Ron [22], Goldreich [12], and Fischer [9] for a
presentation of some of this work, including some connections between property
testing and other topics). The main goal of property testing is to avoid “read-
ing” the whole object (which requires complexity at least linear in the size of its
representation); i.e., to make the decision by reading a small (possibly, selected
at random) fraction of the input (e.g., a fraction of size polynomial in 1/ and
poly-logarithmic in the size of the representation) and still having a good (say,
at least 2/3) probability of success.
A crucial component in the definition of property testing is that of the dis-
tance between two objects. For the purpose of this definition, it is common to
think of objects as being functions over some domain X. The distance between
functions f and g is then measured by considering the set Xf=g of all points
c

A Lower Bound for Distribution-Free Monotonicity Testing 331
x where f(x) = g(x) and comparing the size of this set Xf=g to that of X;
equivalently, one may introduce a uniform distribution over X and measure the
probability of picking x ∈ Xf=g. Note that property testers access the input
function (object) via membership queries (i.e., the algorithm gives a value x and
gets f(x)).
It is natural to generalize the above definition of distance between two func-
tions, to deal with arbitrary probability distributions D over X, by measuring the
probability of Xf=g according to D. Ideally, one would hope to get distribution-
free property testers. Such a tester, for a given property P, accesses the function
using membership queries, as above, and by randomly sampling the fixed but
unknown distribution D (this mimics similar definitions from learning theory
and is implemented via an oracle access to D; see, e.g., [21] 1
). As before, the
tester is required to accept the given function f, with high probability, if f sat-
isfies the property P, and to reject it, with high probability, if f is at least -far
from P with respect to the distribution D.
This notion of testing, termed distribution-free testing, has been introduced
by Goldreich, Goldwasser and Ron in [14], who also showed that it is impossi-
ble to test a variety of partition problems (for which they showed testers with
respect to the uniform distribution) in a distribution-free manner. Halevy and
Kushilevitz [17] presented the first non-trivial distribution-free testers for two
properties: monotonicity and low-degree polynomials. In [19], distribution-free
testers were presented for testing connectivity of sparse graphs.
A natural question that arises when dealing with distribution-free testing
is whether it can be done using the same query complexity as in the uniform
setting. We answer this question with respect to monotonicity testing, by showing
an exponential gap between the query complexity required for the uniform and
the distribution-free settings.
Monotonicity is one of the central problems studied in the property testing
literature (e.g. [2, 6–8, 10, 11, 13, 18]). In monotonicity testing, the domain X
is usually the d-dimensional cube [n]d
. A partial order is defined on this domain
in the natural way (for y, z ∈ [n]d
, we say that y ≤ z if each coordinate of y is
bounded by the corresponding coordinate of z). A function f over the domain
[n]d
is monotone if whenever z ≥ y then f(z) ≥ f(y). Testers were developed
to deal with both the low-dimensional and the high-dimensional cases.
The low-dimensional case: In this case, d is considered to be small compared to
n (e.g., d may be a constant); a successful algorithm for this case is typically one
that is polynomial in 1/ and in log n. The first paper to deal with this case in
the uniform setting is by Ergün et al. [7] which presented an O(log n
) algorithm
for the line (i.e., the case d = 1), and showed that this query complexity cannot
be achieved without using membership queries. This algorithm was generalized
for any fixed d in [2]. For the case d = 1, there is a lower bound showing that
1
More precisely, distribution-free property testing is the analogue of the PAC+MQ
model of learning (that was studied by the learning-theory community mainly via
the EQ+MQ model); standard property testing is the analogue of the uniform+MQ
model.

332 Shirley Halevy and Eyal Kushilevitz
testing monotonicity (for some constant ) indeed requires Ω(log n) queries [8]. In
[18], the authors further analyzed a tester presented in [6], and showed an upper
bound of O(d·4d
·log n
) on the query complexity required for the low-dimensional
case. In the distribution-free setting, [17] presented a distribution-free tester with
query complexity O(logd
n·2d
); by this, showing that in the low-dimensional case
the overhead caused by the distribution-free requirement is relatively small (both
tests have polylogarithmic query complexity and, in fact, for d = 1 they have
the same query complexity).
The high-dimensional case: In this case, d is considered as the main parameter
(and n might be as low as 2); a successful algorithm is typically one that is
polynomial in 1/ and d. This case was first considered by Goldreich et al. [13]
that showed an algorithm for testing monotonicity of functions over the boolean
(n = 2) d-dimensional hyper-cube to a boolean range using O(d
) queries. This
result was generalized in [6] to arbitrary values of n, showing that O(d·log2
n
)
queries suffice for testing monotonicity of general functions over [n]d
. In the
distribution-free setting, the only known tester is the one presented by [17] using
O(logd
n·2d
) queries (this query complexity is non-trivial whenever n ≥ 8, but is
exponential in d). However, the question whether distribution-free testing can
be done using similar query complexity to the one used in the uniform setting
remained open.
Our Contribution: We show that, while O(d
) queries suffice for testing mono-
tonicity of boolean functions over the boolean hypercube in the uniform case [13],
in the distribution-free case it requires a number of queries that is exponential in
the dimension d (that is, there exists a constant c such that Ω(2cd
) queries are
required for the testing). Hence, such testing is infeasible in the high-dimensional
case. Our bound can be trivially extended to monotonicity testing of boolean
functions defined over the domain [n]d
for general values of n. By this, showing
a gap between the query complexity of uniform and distribution-free testers for
a natural testing problem.
Other related work: Lower bounds for monotonicity testing of functions f :
{0, 1}d
→ {0, 1} were shown in [10]: Ω(
√
d) lower bound for non-adaptive, one-
sided error algorithms, and an Ω(log log d) lower bound for two-sided error algo-
rithms. The authors also considered graphs other than the hyper-cube, proving,
for example, that testing monotonicity with a constant (depending on 1
only)
number of queries is possible for certain classes of graphs.
Organization: In Section 2, we formally define the notions we are about to use
in this work. In Section 3, we give an overview of the lower-bound proof and
present some families of functions used for this proof. Then, in Section 4, we
prove the lower bound for one-sided error testing and, in Section 5, we extend
our lower bound to the two-sided error case.

2 Preliminaries
In this section, we formally define the notion of a function f : {0, 1}d
→ {0, 1}
being -far from monotone with respect to a given distribution D defined over
{0, 1}d
, and of monotonicity distribution-free testing. Then, we define some no-
tation that will be used throughout this work. Denote by Pmon the set of all
monotone functions f : {0, 1}d
→ {0, 1}.
Definition 1. Let D be a distribution defined over {0, 1}d
. The distance between
functions f, g : {0, 1}d
→ {0, 1}, measured with respect to D, is defined by
distD(f, g)
def
= Prx∼D{f(x) = g(x)}.
The distance of a function f from being monotone, measured with respect to D,
is distD(f, Pmon)
def
= ming∈Pmon distD(f, g).
We say that f is -far from monotone with respect to D if distD(f, Pmon) ≥ .
Definition 2. A distribution-free monotonicity tester is a probabilistic oracle
machine M, which is given a distance parameter, 0, and an oracle access to
an arbitrary function f : {0, 1}d
→ {0, 1} and to sampling of a fixed but unknown
distribution D over {0, 1}d
, and satisfies the following two conditions:
1. If f is monotone, then Pr{Mf,D
= Accept} ≥ 2
3 .
2. If f is -far from monotone with respect to D, then Pr{Mf,D
= Accept} ≤ 1
3 .
We will also be interested in testers that allow one-sided error; such testers
always accept any monotone function.
Hereafter, we identify any point x ∈ {0, 1}d
with the corresponding set x ⊆
[d]. This allows us to apply set theory operations, such as union and intersection,
to points. In addition, for any two points p, p
∈ {0, 1}
d
2 , denote by p||p
the point
x ∈ {0, 1}d
that is the concatenation of p and p
(i.e., it is identical to p in its
first d
2 coordinates and to p
in its last d
2 coordinates). Given a point x ∈ {0, 1}d
,
we say that p ∈ {0, 1}
d
2 is the prefix of x if x = p||p
for some p
. Given two
points x, y ∈ {0, 1}d
, denote by H(x, y) the hamming distance between x and y.
Denote by Bd
λ the set of all the λd weight points in {0, 1}d
, that is Bd
λ = {x ∈
{0, 1}d
: |x| = λd}.
3 The Two Families
To prove the lower bound on the query complexity, we show that any monotonic-
ity distribution-free tester with sub-exponential query complexity is unable to
distinguish between monotone functions to functions that are 1
2 -far from mono-
tone. To this aim, we consider two families F1 and F2 of pairs (f, Df ), where f
is a boolean function and Df is a probability distribution corresponding to f,
both defined over {0, 1}d
, such that the following holds:
1. Every function in F1 is monotone, hence every tester has to accept every
pair (f, Df ) ∈ F1 with probability of at least 2
3 .

2. Every function f in F2 is 1
2 -far from monotone with respect to Df . Therefore,
every tester has to reject every pair (f, Df ) ∈ F2 with probability of at least
2
3 (regardless of the choice of the distance parameter ).
3 There exists a constant c such that no algorithm A, that asks less than 2cd
membership queries and samples the distribution less than 2cd
times, accepts
every pair (f, Df ) in F1 with probability of at least 2
3 , and rejects every pair
(f, Df ) in F2 with probability of at least 2
3 .
Hence, there exists no two-sided error distribution-free tester for monotonic-
ity of boolean functions defined over {0, 1}d
with sub-exponential query complex-
ity. (A similar approach was previously used for lower bound proofs in property
testing, e.g., [15].)
In the rest of this section, we describe the two families of functions (and
distributions) F1 and F2 defined over {0, 1}d
. Let α 1
16 be a parameter used
for the construction of both families and define m
def
= 2dα
. In the construction,
we use a set M ⊂ B
d/2
1/2−2α of size 2m such that every two points in M are ”far
apart”. This property of M will be used in the proof of the third requirement.
We first define the exact requirements from such a set M and claim its existence.
The proof of the following lemma will appear in the full version of the paper.
Lemma 1. There exists a set M ⊂ B
d/2
1/2−2α such that: (a) |M| = 2m, and (b)
H(p, p
) 2αd for every two points p, p
∈ M such that p = p
.
Using such M, we define the two families. For each pair (f, Df ), we first define
the function f : {0, 1}d
→ {0, 1} and then, based on the definition of f, define
the corresponding distribution Df over {0, 1}d
.
3.1 The Family F1
Each function f is defined by first choosing (in all possible ways) two sets X1
and X2, each of size m such that X1 ⊂ Bd
1/2−α and X2 ⊂ Bd
1/2+α. The choice of
the two sets is done as follows:
– Choose m points from M, denote the set of m selected points by M
.
– For each point p in M
, choose a point y ∈ B
d/2
1/2 , and add the point x = p||y
to X1. (|x| is exactly (1/2 − α)d.)
– For each point p in M M
, choose a point y ∈ B
d/2
1/2+4α and add the point
x = p||y to the set X2. (|x| is exactly (1/2 + α)d.)
The function f is now defined in the following manner:
– For every point x1 ∈ X1 and every point y such that y ≥ x1, set f(y) = 1.
– For every point x2 ∈ X2 and every point y such that y ≤ x2, set f(y) = 0.
– For every point y, such that f(y) was not defined above, if |y| ≤ d
2 then
f(y) = 0, otherwise f(y) = 1.

0
1
1
0
0 0 0
1
1
1
1
0
(1/2 − α)d
(1/2 + α)d
d/2
0̄
1̄
1
0
0 0
0
1 1 1
0
0
1
1
1
1
0
0
d/2
(1/2 + α)d
(1/2 − α)d
0̄
1̄
Fig. 1. An example for functions in F1 (left) and in F2 (right) for m = 2. (The bullets
represent points in X1 and X2)
See left side of Figure 1 for an example of such a function for m = 2.
The distribution Df corresponding to the function f is the uniform distribu-
tion over the 2m points in X1 ∪ X2. That is, Df (x) = 1
2m for every x ∈ X1 ∪ X2
and Df (x) = 0 for any other point in {0, 1}d
.
The fact that every function in F1 is monotone follows the definition. The
fact that F1 is not empty follows the existence of such a set M. To see that the
function f is well defined, and hence the family F1 satisfies the first requirement
of the construction, we observe the following simple lemma.
Lemma 2. {y : y ≥ x1} ∩ {y : y ≤ x2} = φ, for every x1 ∈ X1 and x2 ∈ X2.
Proof. If there exists a point z ∈ {y : y ≥ x1} ∩ {y : y ≤ x2}, then x1 x2.
By the construction of f, there exists two different points p1, p2 ∈ M such that
x1 = p1||y1 and x2 = p2||y2 for some y1 and y2. Thus, p1 ≤ p2, contradicting the
fact that p1 = p2 and |p1| = |p2|.

3.2 The Family F2
Each function f in F2 is also defined by first choosing two sets X1 and X2, each
of size m such that X1 ⊂ Bd
1/2−α and X2 ⊂ Bd
1/2+α. The choice of the two sets
is done as follows:
– Choose m points from M, denote the set of m selected points by M
.
– For each point p in M
, choose a pair of points (y1, y2) such that: (a) y1 ∈
B
d/2
1/2 , (b) y2 ∈ B
d/2
1/2+4α, and (c) y1 y2.
– Add the point x1 = p||y1 to X1, and the point x2 = p||y2 to X2.
– The two points x1 and x2 will be referred to as a couple in f.
The function f is now defined as follows:
– For every point x1 ∈ X1 and every point y such that y ∈ U(x1), set f(y) = 1,
where U(x)
def
= {y : y ≥ x, |y| ≤ d
2 }.
– For every point x2 ∈ X2 and every point y such that y ∈ L(x2), set f(y) = 0,
where L(x)
def
= {y : y ≤ x, |y| d
2 }.

– For every point y such that f(y) was not defined above, if |y| ≤ d
2 then
f(y) = 0, otherwise, f(y) = 1.
See right side of Figure 1 for an example of such a function for m = 2.
The distribution Df is again the uniform distribution over X1 ∪X2, as in the
definition of F1.
The fact that F2 is not empty follows immediately the existence of the set M.
However, we have to argue that every function f in F2 is 1
2 -far from monotone
with respect to Df .
Lemma 3. f is 1
2 -far from monotone with respect to Df , for every (f, Df ) ∈ F2.
Proof. To turn f into a monotone function, we have to alter the value of f either
in x1 or in x2, for every couple (x1, x2) in f. By this, increasing the distance by
1
2m . Since there are m different couples, the total distance is at least 1
2 .

4 Lower Bound for One-Sided Error Testing
In this section, we prove that there exists no one-sided error distribution-free
monotonicity tester with sub-exponential query complexity that accepts every
pair (f, Df ) in F1, with probability 1, and rejects every pair (f, Df ) in F2, with
high probability. To this aim, we will show that for every tester A, there exists
a pair (f, Df ) ∈ F2, such that with high probability the run of A on (f, Df ) is
also consistent with a monotone function from F1. Since A has to accept every
monotone function, then A has to accept f with high probability.
The above claim is simple if the tester is not allowed to use membership
queries, but only to sample the distribution Df . In this case, it is clear that dis-
tinguishing between the two families requires an exponential number of queries.
However, the difference between one tester to the other is in the tester’s choice of
membership queries, and dealing with the tester’s membership queries is where
the difficulty of this proof lays.
To state our claim formally, we introduce some notation. Let A be an algo-
rithm with query and sample complexity n = n(d, 1
). Such an algorithm can
also be viewed as a mapping from a sequence {(pi, vi)}j
i=1 of labelled points to
either ”sample the distribution” or ”query pj+1” if j n, and to ”accept” or
”reject” if j = n. We refer to such a sequence of labelled points obtained by
A as a knowledge sequence. Given a pair (f, Df ), denote by Sf the knowledge
sequence learnt by A during its run on (f, Df ) (for some possible run of A on
(f, Df )). We say that a function f is consistent with a knowledge sequence S if
f(pi) = vi, for every 1 ≤ i ≤ j.
As stated above, we show that for some constant c, for every algorithm A
with query complexity n 2cd
, there exists a pair (f, Df ) ∈ F2 such that with
high probability, over the possible runs of A and the sampling of the domain
according to Df , the sequence Sf is consistent with some monotone function
f
∈ F1. Hence, with high probability Sf causes A to accept, contradicting the
requirement that A rejects (f, Df ) with probability of at least 2
3 .

For this, we show that for every algorithm A and for every possible choice
of random coins for A, the probability that, for a randomly drawn pair (f, Df )
from F2, the sequence Sf is consistent with some monotone function from F1
is very high (where the probability is taken over the choice of (f, Df ) and the
sampling of Df ). In other words, for every algorithm A and for every choice
of random coins for A, most pairs in F2 are such that, with high probability
(over the sampling of Df ), Sf is consistent with a monotone function. Hence,
for every algorithm A, there exists a pair (f, Df ) ∈ F2 such that for most choices
of random coins for A, the probability that Sf is consistent with a monotone
function is very high.
Lemma 4. For every algorithm A, there exits a pair (f, Df ) ∈ F2 such that
Pr{Sf is consistent with a monotone function} ≥ 1−2n2
m , where the probability
is over the choice of random coins for A and the samplings of Df .
Based on the above lemma, we can state our lower bound.
Theorem 1. Let c = α/3. For every one-sided tester A that asks less than 2cd
membership queries and samples the distribution less than 2cd
times, there exists
a pair (f, Df ) ∈ F2 such that Pr{A accepts (f, Df )} 1
3 .
It remains to prove Lemma 4. We first state the condition that a knowledge
sequence has to satisfy in order to be consistent with some function in F1. We
observe that in order for a knowledge sequence to be extendable to a monotone
function in F1, it is not enough that Sf is consistent with some monotone func-
tion. If for some couple (x1, x2) with prefix p, the sequence Sf contains points
z ∈ U(x1) and z
∈ L(x2) both with the same prefix p, then we can deduce that
f ∈ F2 regardless of whether z and z
are comparable or not. Therefore, we have
to define a relaxed notion of a witness for non-monotone functions for our case,
such that every sequence that does not contain a witness is indeed extendable to
a function in F1. (In general, it is enough to show that there exists an arbitrary
monotone function that is consistent with the knowledge sequence, and not nec-
essarily a function in F1. However, we will use this fact later on when extending
the proof of the lower bound to the two-sided error case.)
Definition 3. Let Sf be a knowledge sequence learnt by A while running on the
pair (f, Df ) ∈ F2. We say that Sf contains a witness if there exist points y1, y2
in Sf such that for some couple (x1, x2) in f, y1 ∈ U(x1) and y2 ∈ L(x2).
Lemma 5. If the knowledge sequence Sf , learnt by A while running on a pair
(f, Df ) ∈ F2, does not contain a witness, then there exists a function f
∈ F1
that is consistent with Sf .
To prove the existence of such a function f
, we show that it is possible construct
the sets X
1 and X
2 such that f
, the function induced by X
1 and X
2, is in F1
and is consistent with Sf . The proof of the lemma is omitted.
Next, we prove that the probability that the knowledge sequence Sf , learnt
by A during its run on a randomly chosen pair (f, Df ) ∈ F2, contains a witness,

is very small. To do so, we prove that the probability for any step of A, that the
knowledge sequence will contain a witness after that step is very small.
We start with the following technical lemma. It shows that, for every query
that A may ask and sample it may receive, there can be only one prefix in M
such that this query or sample belongs to U(x1) or L(x2) for some couple (x1, x2)
with that prefix. The proof of the lemma relies on the fact that the points in M
are far apart and is omitted for lack of space.
Lemma 6. Let f be a function in F2. Then, for every point z ∈ {y ∈ {0, 1}d
:
(1
2 − α)d ≤ |y| ≤ d
2 }, there exists at most one prefix p ∈ M such that z ∈ U(x)
for some point x ∈ Bd
1/2−α with prefix p. Similarly, for every point z ∈ {y ∈
{0, 1}d
: d
2 |y| ≤ (1
2 + α)d}, there exists at most one prefix p ∈ M such that
z ∈ L(x) for some point x ∈ Bd
1/2+α with prefix p.
Given (f, Df ) ∈ F2 and a knowledge sequence Sf , denote by Sf (i) the length
i prefix of Sf .
Lemma 7. For every algorithm A and every possible sequence of random coins
for A, the following holds:
Pr{Sf (i + 1) contains a witness | Sf (i) does not contain a witness} ≤ i+2
m
(where the probability is over the random choice of (f, Df ) from F2 that is con-
sistent with Sf (i), and over the possible samplings of Df for the chosen pair
(f, Df )).
Proof. Distinguish between the two possible types of steps for A. The first pos-
sibility is that in step i + 1 A asks for a sample of the distribution Df , and
the second possibility is that A chooses to query the function. In the first case,
for every possible choice of (f, Df ) from F2, we show that the probability that
Sf (i + 1) contains a witness is at most i
2m . Let (f, Df ) be a pair in F2 that is
consistent with Sf (i) and let X1 and X2 be the sets used for its construction.
The sampled point can cause Sf (i) to contain a witness only if Sf (i) already
contains a point that is related to its couple. That is, if for some couple (x1, x2)
in f, the sampled point is either x1 and there are points in Sf (i) from L(x2), or
that the sampled point is x2 and Sf (i) contains points from U(x1). By Lemma 6
and the fact that every two different points in X1 have different prefixes, every
point in Sf (i) can be in U(x1) for at most one point x1 ∈ X1 and similarly,
every point can be in L(x2) for at most one point x2 ∈ X2. Hence, for every
point in Sf (i), only sampling of the relevant point x1 or x2 can cause the knowl-
edge sequence to contain a witness. (In general, if the points in M were not
”far apart”, it is possible that the couples are very close to one another, and
hence almost every sampled point causes the knowledge sequence to contain a
witness). The probability for this event is 1
2m . Therefore, the probability that
the sampled point, that was returned by the oracle, causes Sf (i) to contain a
witness can be bounded by i
2m .
In the second case, let qi+1 be the query asked by A in step i + 1. If |qi+1|
(1
2 − α)d or |qi+1| (1
2 + α)d, then Sf (i + 1) cannot contain a witness. Assume
w.l.o.g. that |qi+1| ≤ d
2 (the proof for the case that |qi+1| d
2 is similar). Based

on Lemma 6, there can be at most one relevant prefix p ∈ M such that a couple
(x1, x2) with prefix p can include qi+1 in U(x1). If there is no 0-labelled point
z in Sf (i) that can possibly be in L(x2), for a point x2 with prefix p, then qi+1
cannot cause Sf (i) to contain a witness. Hence, assume there exists a 0-labelled
point z in Sf (i) that can possibly be in L(x2) for a point x2 with prefix p. The
sequence Sf (i + 1) will contain a witness only if qi+1 ∈ U(x1) for the point x1
corresponding to x2. Note that Sf does not necessarily include x2 itself, but
only points in L(x2). We will bound the probability, over the possible choices of
a pair (f, Df ) such that f is consistent with Sf (i), that indeed qi+1 ∈ U(x1),
and show that this probability is at most 2
m .
The proof is by showing an upper bound on the probability, for a randomly
drawn point x1 that is consistent with Sf , that indeed qi+1 ∈ U(x1). Before
proving this bound, we explain why such a bound implies a bound on the prob-
ability that a randomly drawn pair in F2 that is consistent with Sf satisfies
qi+1 ∈ U(x1).
We first observe that, based on Lemma 6, there exists a unique prefix p ∈ M
such that all the functions in F2 that are consistent with Sf contain a couple with
the prefix p. Clearly, once the prefix of a couple is determined and we consider
only functions in F2 that include a couple with that prefix, the distribution
induced over the possible choices of this couple in f is uniform over all prefix
p couples (x1, x2) such that x1 x2. In addition, for every such function f,
the question whether indeed qi+1 ∈ U(x1) is fully determined by the choice of
the prefix p couple (x1, x2). Hence, it is enough to bound the probability over
the possible choices of the couple (x1, x2) in f with the prefix p such that the
couple is consistent with Sf (i), that qi+1 ∈ U(x1). Following the fact that every
(1
2 −α)d-weight point x plays the role of x1 in the same number of couples (note
that the weight of x1 and x2 is identical in all the couples), we conclude that
the distribution induced by a random choice of a couple with prefix p over the
choice of x1 (or x2) is uniform. Thus, bounding the probability that a random
couple that is consistent with Sf satisfies qi+1 ∈ U(x1) is done by considering the
probability that a random choice of x1, that is consistent with Sf (i), satisfies
qi+1 ∈ U(x1), and showing that this probability is at most 2
m . The technical
details are omitted and will appear in the full version of the paper.

Based on the above lemma, it is possible to prove the following proposition.
Lemma 4 follows.
Proposition 1. For every algorithm A and every possible choice of random
coins for A, the following holds: Pr{Sf contains a witness} ≤ 2n2
m (where the
probability is taken over the random choice of the pair (f, Df ) from F2 and over
the possible sampling of Df for the chosen pair (f, Df )).
Proof. Given an algorithm A and a choice of random coins for A, the proba-
bility that Sf contains a witness for a randomly chosen pair (f, Df ) ∈ F2 can
be bounded by
n
i=1 Pr{Sf (i+1) contains a witness | Sf (i) does not contain a
witness}. By Lemma 7, this probability can be bounded by
n
i=1
i+2
m = n(n+1)+2n
m
2n2
m .

5 Lower Bound for Two-Sided Error Testing
In this section, we extend Theorem 1 to the two-sided error case, and prove that
there exists no two-sided error distribution-free monotonicity tester with sub-
exponential query complexity that accepts every pair (f, Df ) ∈ F1, with high
probability, and rejects every pair (f, Df ) ∈ F2, with high probability. To do
so, we show that there exists a constant c, such that for every algorithm A that
asks less than 2cd
membership queries and samples Df less than 2cd
times, the
distributions induced on the set of possible knowledge sequences by running A
on a randomly chosen pair from F1 and from F2 are indistinguishable.
In the previous section, we saw that for every algorithm A, the probability
that a knowledge sequence obtained by A, while running on a randomly chosen
pair from F2, contains a witness, is very small. At first glance, it may seem that
this is enough to show that the two distributions are indistinguishable. However,
this is not true. In the case of functions from F1, for every prefix p ∈ M there
exists a point either in X1 or in X2 with the prefix p (m of the prefixes in M are
used as prefixes for points in X1, while the other m are used as prefixes for points
in X2). This does not hold in the case of functions from F2, where we choose
only m out of the 2m prefixes, and select a couple (x1, x2) with each prefix.
Hence, one can suggest a testing approach that looks for unused prefixed from
M. (Such an approach is not relevant in the case of one-sided error tester, since
the tester has to accept every function in F1 with probability 1.) Clearly, if the
tester is only allowed to sample the distribution Df , then this testing approach
requires an exponential number of queries. However, it is possible that we can
use membership queries to significantly reduce the required query complexity.
Hence, showing that with high probability the knowledge sequence Sf does not
contain a witness is not enough, and there are other undesired events we have to
eliminate. The full details of the two-sided testing lower bound proof will appear
in the full version of this paper.
Acknowledgment
This work was initiated by a discussion with Nader Bshouty. We wish to thank
Nader for sharing his ideas with us; they played a significant role in the devel-
opment of the results presented in this paper.
References
1. N. Alon, E. Fischer, M. Krivelevich, and M. szegedy, Efficient testing of large
graphs, FOCS 1999, pages 656–666.
2. T. Batu, R. Rubinfeld, and P. White, Fast approximation PCPs for multidimen-
sional bin-packing problems, RANDOM-APPROX 1999, pages 246–256,.
3. M. Blum, M. Luby, and R. Rubinfeld, Self testing/correcting with applications to
numerical problems, Journal of Computer and System Science 47:549–595, 1993.
4. A. Bogdanov, K. Obata, and L. Trevisan, A lower bound for testing 3-colorability
in bounded-degree graphs, FOCS, 2002, pages 93-102.

5. A. Czumaj and C. Sohler, Testing hypergraph coloring, ICALP 2001, pp. 493–505.
6. Y. Dodis, O. Goldreich, E. Lehman, S. Raskhodnikova, D. Ron, and A. Samorod-
nitsky, Improved testing algorithms for monotonicity, RANDOM-APPROX 1999,
pages 97–108.
7. E. Ergün, S. Kannan, R. Kumar, R. Rubinfeld, and M. Viswanathan, Spot-
checkers, Journal of Computing and System Science, 60:717–751, 2000 (a pre-
liminary version appeared in STOC 1998).
8. E. Fischer, On the strength of comparisons in property testing, manuscript (avail-
able at ECCC 8(8): (2001)).
9. E. Fischer, The art of uninformed decisions: A primer to property testing, The
Computational Complexity Column of The bulletin of the European Association for
Theoretical Computer Science, 75:97–126, 2001.
10. E. Fischer, E. Lehman, I. Newman, S. Raskhodnikova, R. Rubinfeld and, A.
Samorodnitsky, Monotonicity testing over general poset domains, STOC 2002,
pages 474–483.
11. E. Fischer and I. Newman, Testing of matrix properties, STOC 2001, pages 286–
295.
12. O. Goldreich, Combinatorical property testing – a survey, In: Randomized Methods
in Algorithms Design, AMS-DIMACS pages 45–61, 1998 .
13. O. Goldreich, S. Goldwasser, E. Lehman, D. Ron, and A. Samorodnitsky, Testing
Monotonicity, Combinatorica, 20(3):301–337, 2000 (a preliminary version appeared
in FOCS 1998).
14. O. Goldreich, S. Goldwasser, and D. Ron, Property testing and its connection to
learning and approximation, Journal of the ACM, 45(4):653–750, 1998 (a prelimi-
nary version appeared in FOCS 1996).
15. O. Goldreich and D. Ron, Property testing in bounded degree graphs, STOC 1997,
pages 406–415.
16. O. Goldreich and L. Trevisan, Three theorems regarding testing graph properties,
FOCS 2001, pages 302–317.
17. S. Halevy and E. Kushilevitz, Distribution-free property testing. In RANDOM-
APPROX 2003, pages 341–353.
18. S. Halevy and E. Kushilevitz, Testing monotonicity over graph products, ICALP
2004, pages 721–732.
19. S. Halevy and E. Kushilevitz, Distribution-free connectivity testing. In RANDOM-
APPROX 2004, pages 393–404.
20. T. Kaufman, M. Krivelevich, and D. Ron, Tight bounds for testing bipartiteness
in general graphs, RANDOM-APPROX 2003, pages 341–353.
21. M. J. Kearns and U. V. Vzirani, An introduction to Computational Learning
Theory, MIT Press, 1994.
22. D. Ron, Property testing (a tutorial), In: Handbook of Randomized Computing
(S.Rajasekaran, P. M. Pardalos, J. H. Reif and J. D. P. Rolin eds), Kluwer Press
(2001).
23. R. Rubinfeld and M. Sudan, Robust characterization of polynomials with applica-
tions to program testing, SIAM Journal of Computing, 25(2):252–271, 1996. (ﬁrst
appeared as a technical report, Cornell University, 1993).

On Learning Random DNF Formulas
Under the Uniform Distribution
Jeffrey C. Jackson1
and Rocco A. Servedio2
1
Dept. of Math. and Computer Science
Duquesne University
Pittsburgh, PA 15282
jackson@mathcs.duq.edu
2
Dept. of Computer Science
Columbia University
New York, NY 10027, USA
rocco@cs.columbia.edu
Abstract. We study the average-case learnability of DNF formulas in
the model of learning from uniformly distributed random examples. We
define a natural model of random monotone DNF formulas and give an
efficient algorithm which with high probability can learn, for any fixed
constant γ 0, a random t-term monotone DNF for any t = O(n2−γ
).
We also define a model of random nonmonotone DNF and give an efficient
algorithm which with high probability can learn a random t-term DNF
for any t = O(n3/2−γ
). These are the first known algorithms that can
successfully learn a broad class of polynomial-size DNF in a reasonable
average-case model of learning from random examples.
1 Introduction
1.1 Motivation and Background
A disjunctive normal form formula, or DNF, is an AND of ORs of Boolean liter-
als. A question that has been open since Valiant’s initial paper on computational
learning theory [22] is whether or not efficient algorithms exist for learning poly-
nomial size DNF formulas in various learning models. The only positive result
to date is the Harmonic Sieve [11], which is a membership-query algorithm that
efficiently learns DNF with respect to the uniform distribution (and certain re-
lated distributions). The approximating function produced by the Sieve is not
itself a DNF; thus, the Sieve is an improper learning algorithm.
There has been little progress on polynomial-time algorithms for learning
arbitrary DNF since the discovery of the Sieve. There are two obvious relax-
ations of the uniform-plus-membership model that can be pursued: learn with
respect to arbitrary distributions using membership queries, and learn with re-
spect to uniform without membership queries. Given standard cryptographic
assumptions, the first direction is essentially as difficult as showing that DNF
is learnable with respect to arbitrary distributions without membership queries
c

On Learning Random DNF Formulas Under the Uniform Distribution 343
[3]. However, there are also substantial known obstacles to learning DNF in the
second model of uniform distribution without membership queries. In particu-
lar, no algorithm which can be recast as a Statistical Query algorithm can learn
arbitrary polynomial-size DNF under the uniform distribution in no(log n)
time
[7]. Since nearly all non-membership learning algorithms can be recast as SQ al-
gorithms [15], a major conceptual shift seems necessary to obtain an algorithm
for efficiently learning arbitrary DNF formulas from uniform examples alone.
An apparently simpler question is whether monotone DNF formulas can be
learned efficiently. Angluin showed that monotone DNF can be properly learned
with respect to arbitrary distributions using membership queries [2]. It has also
long been known that with respect to arbitrary distributions without member-
ship queries, monotone DNF are no easier to learn than arbitrary DNF. [16].
This leaves the following enticing question (posed in [5, 6, 14]): are monotone
DNF efficiently learnable from uniform examples alone?
In 1990, Verbeurgt [23] gave an algorithm that can properly learn any poly(n)-
size (arbitrary) DNF from uniform examples in time nO(log n)
. More recently, the
algorithm of [21] learns any 2
√
log n
-term monotone DNF in poly(n) time. How-
ever, despite significant interest in the problem, no algorithm faster than that of
[23] is known for learning arbitrary poly(n)-size monotone DNF from uniform
examples, and no known hardness result precludes such an algorithm (the SQ
result of [7] is at its heart a hardness result for low-degree parity functions).
Since worst-case versions of several DNF learning problems have remained
stubbornly open for a decade or more, it is natural to ask about DNF learning
from an average-case perspective, i.e., about learning random DNF formulas. In
fact, this question has been considered before: Aizenstein Pitt [1] were the first
to ask whether random DNF formulas are efficiently learnable. They proposed a
model of random DNF in which each of the t terms is selected independently at
random from all possible terms, and gave a membership and equivalence query
algorithm which with high probability learns a random DNF generated in this
way. However, as noted in [1], a limitation of this model is that with very high
probability all terms will have length Ω(n). The learning algorithm itself becomes
quite simple in this situation. Thus, while this is a “natural” average-case DNF
model, from a learning perspective it is not a particularly interesting model. To
address this deficiency, they also proposed another natural average-case model
which is parameterized by the expected length k of each term as well as the
number of independent terms t, but left open the question of whether or not
random DNF can be efficiently learned in such a model.
1.2 Our Results
We consider an average-case DNF model very similar to the latter Aizenstein
and Pitt model, although we simplify slightly by assuming that k represents a
fixed term length rather than an expected length. We show that, in the model
of learning from uniform random examples only, random monotone DNF are
properly and efficiently learnable for many interesting values of k and t. In
particular, for t = O(n2−γ
) where γ 0, and for k = log t, our algorithm can

344 Jeffrey C. Jackson and Rocco A. Servedio
achieve any error rate 0 in poly(n, 1/) time with high probability (over
both the selection of the target DNF and the selection of examples). In addition,
we obtain slightly weaker results for arbitrary DNF: our algorithm can properly
and efficiently learn random t-term DNF for t such that t = O(n
3
2 −γ
). This
algorithm cannot achieve arbitrarily small error but can achieve error = o(1)
for any t = ω(1). For detailed result statements see Theorems 3 and 5.
While our results would clearly be stronger if they held for any t = poly(n)
rather than the specific polynomials given, they are a marked advance over the
previous state of affairs for DNF learning. (Recall that in the standard worst-
case model, poly(n)-time uniform-distribution learning of t(n)-term DNF for any
t(n) = ω(1) is an open problem with an associated cash prize [4].)
We note that taking k = log t is a natural choice when learning with respect
to the uniform distribution. (We actually allow a somewhat more general choice
of k.) This choice leads to target DNFs that, with respect to uniform, are roughly
balanced (0 and 1 values are equally likely). From a learning perspective balanced
functions are generally more interesting than unbalanced functions, since a con-
stant function is trivially a good approximator to a highly unbalanced function.
Our results shed some light on which cases are not hard in the worst-case
model. While “hard” cases were previously known for arbitrary DNF [4], our
findings may be particularly helpful in guiding future research on monotone
DNF. In particular, our algorithm learns any monotone DNF which (i) is near-
balanced, (ii) has every term uniquely satisfied with reasonably high probability,
(iii) has every pair of terms jointly satisfied with much smaller probability, and
(iv) has no variable appearing in significantly more than a 1/
√
t fraction of the t
terms (this is made precise in Lemma 6). So in order to be “hard,” a monotone
DNF must violate one or more of these criteria.
Our algorithms work in two stages: they first identify pairs of variables which
cooccur in some term of the target DNF, and then use these pairs to reconstruct
terms via a specialized clique-finding algorithm. For monotone DNF we can with
high probability determine for every pair of variables whether or not the pair
cooccurs in some term. For nonmonotone DNF, with high probability we can
identify most pairs of variables which cooccur in some term; as we show, this
enables us to learn to fairly (but not arbitrarily) high accuracy.
We give preliminaries in Section 2. Sections 3 and 4 contain our results for
monotone and nonmonotone DNF respectively. Section 5 concludes. Because of
space constraints, many proofs are omitted and will appear in the full version.
2 Preliminaries
We first describe our models of random monotone and nonmonotone DNF. Let
Mt,k
n be the probability distribution over monotone t-term DNF induced by the
following random process: each term is independently and uniformly chosen at
random from all
n
k

monotone ANDs of size exactly k over variables v1, . . . , vn.
For nonmonotone DNF, we write Dt,k
n to denote the following distribution over
t-term DNF: first a monotone DNF is selected from Mt,k
n , and then each occur-

rence of each variable in each term is independently negated with probability
1/2. (Equivalently, a draw from Dt,k
n is done by independently selecting t terms
from the set of all terms of length exactly k).
Given a Boolean function φ : {0, 1}n
→ {0, 1}, we write Pr[φ] to denote
Prx∼Un [φ(x) = 1], where Un denotes the uniform distribution over {0, 1}n
. We
write log to denote log2 and ln to denote natural log.
In the uniform distribution learning model which we consider, the learner is
given a source of labeled examples (x, f(x)) where each x is uniformly drawn
from {0, 1}n
and f is the unknown function to be learned. The goal of the learner
is to efficiently construct a hypothesis h which with high probability (over the
choice of labeled examples used for learning) has low error relative to f under
the uniform distribution, i.e. Prx∼Un [h(x) = f(x)] ≤ with probability 1 − δ.
This model has been intensively studied in learning theory, see e.g. [9, 10, 12, 18,
19, 21, 23]. In our average case framework, the target function f will be drawn
randomly from either Mt,k
n or Dt,k
n , and (as in [13]) our goal is to construct a low-
error hypothesis h for f with high probability over both the random examples
used for learning and the random draw of f.
3 Learning Random Monotone DNF
3.1 Interesting Parameter Settings
Consider a random draw of f from Mt,k
n . It is intuitively clear that if t is too
large relative to k then a random f ∈ Mt,k
n will likely have Pr[f] ≈ 1; similarly
if t is too small relative to k then a random f ∈ Mt,k
n will likely have Pr[f] ≈ 0.
Such cases are not very interesting from a learning perspective since a trivial
algorithm can learn to high accuracy. We are thus led to the following definition:
Definition 1. For 0 α ≤ 1/2, a pair of values (k, t) is said to be monotone
α-interesting if α ≤ Ef∈Mt,k
n
[Pr[f]] ≤ 1 − α.
Throughout the paper we will assume that 0 α ≤ 1/2 is a fixed constant
independent of n and that t ≤ p(n), where p(·) is a fixed polynomial (and
we will also make assumptions about the degree of p). The following lemma
gives necessary conditions for (k, t) to be monotone α-interesting. (As Lemma 1
indicates, we may always think of k as being roughly log t.)
Lemma 1. If (k, t) is monotone α-interesting then α2k
≤ t ≤ 2k+1
ln 2
α .
3.2 Properties of Random Monotone DNF
Throughout the rest of Section 3 we assume that α 0 is fixed and (k, t) is a
monotone α-interesting pair where t = O(n2−γ
) for some γ 0. In this section
we state some useful lemmas regarding Mt,k
n .
Let fi
denote the projected function obtained from f by first removing term
Ti from the monotone DNF for f and then restricting all of the variables which

were present in term Ti to 1. For = i we write T i
to denote the term obtained
by setting all variables in Ti to 1 in T, i.e. T i
is the term in fi
corresponding
to T. Note that if T i
≡ T then T i
is smaller than T.
The following lemma shows that each variable appears in a limited number
of terms; intuitively this means that not too many terms T i
in fi
are smaller
than their corresponding terms T in f. In this and later lemmas, “n sufficiently
large” means n is larger than a constant which depends on α but not on k or t.
Lemma 2. Let δmany := n(ekt3/2
log t
n2k−1α2 )2k−1
α2
/
√
t log t
. For n sufficiently large,
with probability at least 1 − δmany over the draw of f from Mt,k
n we have that
every variable vj, 1 ≤ j ≤ n, appears in at most 2k−1
α2
/
√
t log t terms of f.
Note that since (k, t) is a monotone α-interesting pair and t = O(n2−γ
) for some
fixed γ 0, for sufficiently large n this probability bound is non-trivial.
Using straightforward probabilistic arguments, we can show that for f drawn
from Mt,k
n , with high probability each term is “uniquely satisfied” by a noticeable
fraction of assignments. More precisely, we have:
Lemma 3. Let δusat := tk(ek(t−1) log t
n2k )2k
/(log t)
+t2
(k2
n )log log t
. For n sufficiently
large, with probability at least 1 − δusat over the random draw of f from Mt,k
n ,
f is such that for all i = 1, . . . , t we have Prx[Ti is satisfied by x but no other
Tj is satisfied by x] ≥ α3
2k+2 .
On the other hand, we can upper bound the probability that two terms of a
random DNF f will be satisfied simultaneously:
Lemma 4. Let δshared := t2
(k2
n )log log t
. With probability at least 1−δshared over
the random draw of f from Mt,k
n , for all 1 ≤ i j ≤ t we have Pr[Ti∧Tj] ≤ log t
22k .
3.3 Identifying Cooccurring Variables
We now show how to identify pairs of variables that cooccur in some term of f.
First, some notation. Given a monotone DNF f over variables v1, . . . , vn,
define DNF formulas g∗∗, g1∗, g∗1 and g11 over variables v3, . . . , vn as follows:
– g∗∗ is the disjunction of the terms in f that contain neither v1 nor v2;
– g1∗ is the disjunction of the terms in f that contain v1 but not v2 (but with
v1 removed from each of these terms);
– g∗1 is defined similarly as the disjunction of the terms in f that contain v2
but not v1 (but with v2 removed from each of these terms);
– g11 is the disjunction of the terms in f that contain both v1 and v2 (with
both variables removed from each term).
We thus have f = g∗∗ ∨(v1g1∗)∨(v2g∗1)∨(v1v2g11). Note that any of g∗∗, g1∗, g∗1,
g11 may be an empty disjunction which is identically false.

We can empirically estimate each of the following using uniform random
examples (x, f(x)):
p00 := Pr
x
[g∗∗] = Pr
x∈Un
[f(x) = 1 | x1 = x2 = 0]
p01 := Pr
x
[g∗∗ ∨ g∗1] = Pr
x∈Un
[f(x) = 1 | x1 = 0, x2 = 1]
p10 := Pr
x
[g∗∗ ∨ g1∗] = Pr
x∈Un
[f(x) = 1 | x1 = 1, x2 = 0]
p11 := Pr
x
[g∗∗ ∨ g∗1 ∨ g1∗ ∨ g11] = Pr
x∈Un
[f(x) = 1 | x1 = 1, x2 = 1].
It is clear that g11 is nonempty if and only if v1 and v2 cooccur in some term
of f; thus we would ideally like to obtain Prx∈Un [g11]. While we cannot obtain
this probability from p00, p01, p10 and p11, the following lemma shows that we
can estimate a related quantity:
Lemma 5. Let P denote p11 − p10 − p01 + p00. Then P = Pr[g11 ∧ g1∗ ∧ g∗1 ∧
g∗∗] − Pr[g1∗ ∧ g∗1 ∧ g∗∗].
More generally, let Pij be defined as P but with vi, xi, vj, and xj substituted
for v1, x1, v2, and x2, respectively, throughout the definitions of the g’s and p’s
above. The following lemma shows that, for most random choices of f, for all
1 ≤ i, j ≤ n, the value of Pij is a good indicator of whether or not vi and vj
cooccur in some term of f:
Lemma 6. For n sufficiently large and t ≥ 4, with probability at least 1 −
δusat − δshared − δmany over the random draw of f from Mt,k
n , we have that for
all 1 ≤ i, j ≤ n (i) if vi and vj do not cooccur in some term of f then Pij ≤ 0;
(ii) if vi and vj do cooccur in some term of f then Pij ≥ α4
8t .
Proof. Part (i) holds for any monotone DNF by Lemma 5. For (ii), we first note
that with probability at least 1 − δusat − δshared − δmany, a randomly chosen f
has all the following properties:
1. Each term in f is uniquely satisfied with probability at least α3
/2k+2
(by
Lemma 3);
2. Each pair of terms Ti and Tj in f are both satisfied with probability at most
log t/22k
(by Lemma 4); and
3. Each variable in f appears in at most 2k−1
α2
/
√
t log t terms (by Lemma 2).
We call such an f well-behaved. For the sequel, assume that f is well-behaved
and also assume without loss of generality that i = 1 and j = 2. We consider
separately the two probabilities ρ1 = Pr[g11 ∧ g1∗ ∧ g∗1 ∧ g∗∗] and ρ2 = Pr[g1∗ ∧
g∗1∧g∗∗] whose difference defines P12 = P. By property (1) above, ρ1 ≥ α3
/2k+2
,
since each instance x that uniquely satisfies a term Tj in f containing both v1 and
v2 also satisfies g11 while falsifying all of g1∗, g∗1, and g∗∗. Since (k, t) is monotone
α-interesting, this implies that ρ1 ≥ α4
/4t. On the other hand, clearly ρ2 ≤
Pr[g1∗ ∧g∗1]. By property (2) above, for any pair of terms consisting of one term
from g1∗ and the other from g∗1, the probability that both terms are satisfied

is at most log t/22k
. Since each of g1∗ and g∗1 contains at most 2k−1
α2
/
√
t log t
terms by property (3), by a union bound we have ρ2 ≤ α4
/(4t log t), and the
lemma follows given the assumption that t ≥ 4.
Thus, our algorithm for finding all of the cooccurring pairs of a randomly
chosen monotone DNF consists of estimating Pij for each of the n(n−1)/2 pairs
(i, j) so that all of our estimates are—with probability at least 1 − δ—within an
additive factor of α4
/16t of their true values. The reader familiar with Boolean
Fourier analysis will readily recognize that P12 is just ˆ
f(110n−2) and that in gen-
eral all of the Pij are simply second-order Fourier coefficients. Therefore, by the
standard Hoeffding bound, a uniform random sample of size O(t2
ln(n2
/δ)/α8
)
is sufficient to estimate all of the Pij’s to the specified tolerance with overall
probability at least 1 − δ. We thus have:
Theorem 1. For n sufficiently large and any δ 0, with probability at least
1 − δusat − δshared − δmany − δ over the choice of f from Mt,k
n and the choice
of random examples, our algorithm runs in O(n2
t2
log(n/δ)) time and identifies
exactly those pairs (vi, vj) which cooccur in some term of f.
3.4 Forming a Hypothesis from Pairs of Cooccurring Variables
Here we show how to construct an accurate DNF hypothesis for a random f
drawn from Mt,k
n .
Identifying all k-cliques. By Theorem 1, with high probability we have com-
plete information about which pairs of variables (vi, vj) cooccur in some term
of f. We thus may consider the graph G with vertices v1, . . . , vn and edges for
precisely those pairs of variables (vi, vj) which cooccur in some term of f. This
graph is a union of t randomly chosen k-cliques from {v1, . . . , vn} which corre-
spond to the t terms in f. We will show how to efficiently identify (with high
probability over the choice of f and random examples of f) all of the k-cliques
in G. Once these k-cliques have been identified, as we show later it is easy to
construct an accurate DNF hypothesis for f.
The following lemma (whose proof is omitted) shows that with high prob-
ability over the choice of f, each pair (vi, vj) cooccurs in at most a constant
number of terms:
Lemma 7. Fix 1 ≤ i j ≤ n. For any C ≥ 0 and all sufficiently large n, we
have Prf∈Mt,k
n
[some pair of variables (vi, vj) cooccur in more than C terms of
f] ≤ δC := (tk2
n2 )C
.
By Lemma 7 we know that, for any given pair (vi, vj) of variables, with
probability at least 1 − δC there are at most Ck other variables v such that
(vi, vj, v) all cooccur in some term of f. Suppose that we can efficiently (with
high probability) identify the set Sij of all such variables v. Then we can perform
an exhaustive search over all (k −2)-element subsets S
of Sij in at most
Ck
k

≤
(eC)k
= nO(log C)
time, and can identify exactly those sets S
such that S
∪

{vi, vj} is a clique of size k in G. Repeating this over all pairs of variables
(vi, vj), we can with high probability identify all k-cliques in G.
Thus, to identify all k-cliques in G it remains only to show that for every pair
of variables vi and vj, we can determine the set Sij of those variables v that
cooccur in at least one term with both vi and vj. Assume that f is such that all
pairs of variables cooccur in at most C terms, and let T be a set of variables of
cardinality at most C having the following properties:
– In the projection fT ←0 of f in which all of the variables of T are fixed to 0,
vi and vj do not cooccur in any term; and
– For every set T
⊂ T such that |T
| = |T |−1, vi and vj do cooccur in fT ←0.
Then T is clearly a subset of Sij. Furthermore, if we can identify all such sets T ,
then their union will be Sij. There are only O(nC
) possible sets to consider, so
our problem now reduces to the following: given a set T of at most C variables,
determine whether vi and vj cooccur in fT ←0.
The proof of Lemma 6 shows that f is well-behaved with probability at least
1−δusat−δshared −δmany over the choice of f. Furthermore, if f is well-behaved
then it is easy to see that for every |T | ≤ C, fT ←0 is also well-behaved, since fT ←0
is just f with O(
√
t) terms removed (by Lemma 2). That is, removing terms from
f can only make it more likely that the remaining terms are uniquely satisfied,
does not change the bound on the probability of a pair of remaining terms being
satisfied, and can only decrease the bound on the number of remaining terms
in which a remaining variable can appear. Furthermore, Lemma 5 holds for any
monotone DNF f. Therefore, if f is well-behaved then the proof of Lemma 6
also shows that for every |T | ≤ C, the Pij’s of fT ←0 can be used to identify
the cooccurring pairs of variables within fT ←0. It remains to show that we can
efficiently simulate a uniform example oracle for fT ←0 so that these Pij’s can be
accurately estimated.
In fact, for a given set T , we can simulate a uniform example oracle for fT ←0
by filtering the examples from the uniform oracle for f so that only examples
setting the variables in T to 0 are accepted. Since |T | ≤ C, the filter accepts with
constant probability at least 1/2C
. A Chernoff argument shows that if all Pij’s
are estimated using a single sample of size 2C+10
t2
ln(2(C + 2)nC
/δ)/α8
(fil-
tered appropriately when needed) then all of the estimates will have the desired
accuracy with probability at least 1 − δ. This gives us the following:
Theorem 2. For n sufficiently large, any δ 0, and any fixed C ≥ 2, with
probability at least 1 − δusat − δshared − δmany − δC − δ over the random draw
of f from Mt,k
n and the choice of random examples, all of the k-cliques of the
graph G can be identified in time nO(C)
t3
k2
log(n/δ).
The main learning result for monotone DNF. We now have a list T
1, . . . , T
N
(with N = O(nC
)) of length-k monotone terms which contains all t true terms
T1, . . . , Tt of f. Now note that the target function f is simply an OR of some
subset of these N “variables” T1, . . . , TN , so the standard elimination algorithm
for learning disjunctions (see e.g. Chapter 1 of [17]) can be used to PAC learn
the target function.

Call the above described entire learning algorithm A. In summary, we have
proved the following:
Theorem 3. Fix γ, α 0 and C ≥ 2. Let (k, t) be a monotone α-interesting
pair. For any 0, δ 0, and t = O(n2−γ
), algorithm A will with probability at
least 1 − δusat − δshared − δmany − δC − δ (over the random choice of DNF from
Mt,k
n and the randomness of the example oracle) produce a hypothesis h that
-approximates the target with respect to the uniform distribution. Algorithm A
runs in time polynomial in n, log(1/δ), and 1/.
4 Nonmonotone DNF
Because of space constraints here we only sketch our results and techniques for
nonmonotone DNF.
As with Mt,k
n , we are interested in pairs (k, t) for which Ef∈Dt,k
n
[Pr[f]] is
between α and 1 − α. For α 0, we now say that the pair (k, t) is α-interesting
if α ≤ Ef∈Dt,k
n
[Pr[f]] ≤ 1−α. For any ﬁxed x ∈ {0, 1}n
we have Prf∈Dt,k
n
[f(x) =
0] = (1 − 1
2k )t
, and thus by linearity of expectation we have Ef∈Dt,k
n
[Pr[f]] =
1 − (1 − 1
2k )t
; this formula will be useful later.
Throughout the rest of Section 4 we assume that α 0 is ﬁxed and (k, t) is
an α-interesting pair where t = O(n3/2−γ
) for some γ 0.
4.1 Identifying (Most Pairs of) Cooccurring Variables
Recall that in Section 3.3 we partitioned the terms of our monotone DNF into
four disjoint groups depending on what subset of {v1, v2} was present in each
term. In the nonmonotone case, we will partition the terms of f into nine disjoint
groups depending on whether each of v1, v2 is unnegated, negated, or absent:
f = g∗∗∨(v1g1∗)∨(v1g0∗)∨(v2g∗1)∨(v1v2g11)∨(v1v2g01)∨(v2g∗0)∨(v1v2g10)∨(v1v2g00)
Thus g∗∗ contains those terms of f which contain neither v1 nor v2 in any form;
g0∗ contains the terms of f which contain v1 but not v2 in any form (with v1
removed from each term); g∗1 contains the terms of f which contain v2 but not
v1 in any form (with v2 removed from each term); and so on. Each g·,· is thus a
DNF (possibly empty) over literals formed from v3, . . . , vn.
For all four possible values of (a, b) ∈ (0, 1)2
, we can empirically estimate
pab := Prx[g∗∗ ∨ ga∗ ∨ g∗b ∨ gab] = Prx[f(x) = 1 | x1 = a, x2 = b].
It is easy to see that Pr[g11] is either 0 or else at least 4
2k depending on whether
g11 is empty or not. Ideally we would like to be able to accurately estimate each
of Pr[g00], Pr[g01], Pr[g10] and Pr[g11]; if we could do this then we would have
complete information about which pairs of literals involving variables v1 and
v2 cooccur in terms of f. Unfortunately, the probabilities Pr[g00], Pr[g01], Pr[g10]
and Pr[g11] cannot in general be obtained from p00, p01, p10 and p11. However, we

will show that we can efficiently obtain some partial information which enables
us to learn to fairly high accuracy.
As before, our approach is to accurately estimate the quantity P = p11 −
p10 − p01 + p00. We have the following lemmas:
Lemma 8. If all four of g00, g01, g10 and g11 are empty, then P equals
Pr[g1∗ ∧ g∗0 ∧ ( no other g·,·)] + Pr[g0∗ ∧ g∗1 ∧ ( no other g·,·)]
− Pr[g1∗ ∧ g∗1 ∧ ( no other g·,·)] − Pr[g0∗ ∧ g∗0 ∧ ( no other g·,·)]. (1)
Lemma 9. If exactly one of g00, g01, g10 and g11 is nonempty (say g11), then
P equals (1) plus Pr[g11 ∧ g1∗ ∧ g∗0 ∧ ( no other g·,·)]+ Pr[g11 ∧ g0∗ ∧ g∗1 ∧
( no other g·,·)]− Pr[g11∧g1∗∧g∗1∧( no other g·,·)]− Pr[g11∧g0∗∧g∗0∧( no other
g·,·)]+ Pr[g11 ∧ g0∗ ∧ ( no other g·,·)]+ Pr[g11 ∧ g∗0 ∧ ( no other g·,·)]+ Pr[g11 ∧
( no other g·,·)].
Using the above two lemmas we can show that the value of P is a good indi-
cator for distinguishing between all four of g00, g01, g10, g11 being empty versus
exactly one of them being nonempty:
Lemma 10. For n sufficiently large and t ≥ 4, with probability at least 1 −
δ
usat − δshared − δmany over a random draw of f from Dt,k
n , we have that: (i)
if v1 and v2 do not cooccur in any term of f then P ≤ α2
8t .; (ii) if v1 and v2 do
cooccur in some term of f and exactly one of g00, g01, g10 and g11 is nonempty,
then P ≥ 3α2
16t .
The proof (omitted, as are the exact definitions of the various δ’s) uses analogues
of the properties established in Section 3.2 for nonmonotone DNF.
It is clear that an analogue of Lemma 10 holds for any pair of variables vi, vj
in place of v1, v2. Thus, for each pair of variables vi, vj, if we decide whether vi
and vj cooccur (negated or otherwise) in any term on the basis of whether Pij is
large or small, we will err only if two or more of g00, g01, g10, g11 are nonempty.
We now show that for f ∈ Dt,k
n , with very high probability there are not too
many pairs of variables (vi, vj) which cooccur (with any sign pattern) in at least
two terms of f. Note that this immediately bounds the number of pairs (vi, vj)
which have two or more of the corresponding g00, g01, g10, g11 nonempty.
Lemma 11. Let d 0 and f ∈ Dt,k
n . The probability that more than (d +
1)t2
k4
/n2
pairs of variables (vi, vj) each cooccur in two or more terms of f is at
most exp(−d2
t3
k4
/n4
).
Taking d = n2
/(t5/4
k4
) in the above lemma (note that d 1 for n sufficiently
large since t5/4
= O(n15/8
)), we have (d + 1)t2
k4
/n2
≤ 2t3/4
and the failure
probability is at most δcooccur := exp(−
√
t/k4
). The results of this section
(together with a standard analysis of error in estimating each Pij) thus yield:
Theorem 4. For n sufficiently large and for any δ 0, with probability at least
1 − δcooccur − δ
usat − δshared − δmany − δ over the random draw of f from Dt,k
n

and the choice of random examples, the above algorithm runs in O(n2
t2
log(n/δ))
time and outputs a list of pairs of variables (vi, vj) such that: (i) if (vi, vj) is in
the list then vi and vj cooccur in some term of f; and (ii) at most N0 = 2t3/4
pairs of variables (vi, vj) which do cooccur in f are not on the list.
4.2 Reconstructing an Accurate DNF Hypothesis
It remains to construct a good hypothesis for the target DNF from a list of
pairwise cooccurrence relationships as provided by Theorem 4. As in the mono-
tone case, we consider the graph G with vertices v1, . . . , vn and edges for pre-
cisely those pairs of variables (vi, vj) which cooccur (with any sign pattern) in
some term of f. As before this graph is a union of t randomly chosen k-cliques
S1, . . . , St which correspond to the t terms in f, and as before we would like
to find all k-cliques in G. However, there are two differences now: the first is
that instead of having the true graph G, we instead have access only to a graph
G
which is formed from G by deleting some set of at most N0 = 2t3/4
edges.
The second difference is that the final hypothesis must take the signs of literals
in each term into account. To handle these two differences, we use a different
reconstruction procedure than we used for monotone DNF in Section 3.4; this
reconstruction procedure only works for t = O(n3/2−γ
) where γ 0.
Because of space constraints we do not describe the reconstruction procedure
here. All in all the following is our main learning result for nonmonotone DNF:
Theorem 5. Fix γ, α 0 and C ≥ 2. Let (k, t) be a monotone α-interesting
pair. For f randomly chosen from Dt,k
n , with probability at least 1 − δcooccur −
δ
usat −δshared −δmany −δclique −1/nΩ(C)
the above algorithm runs in Õ(n2
t2
+
nO(log C)
) time and outputs a hypothesis h whose error rate relative to f under
the uniform distribution is at most 1/Ω(t1/4
).
It can be verified from the definitions of the various δ’s that for any t = ω(1) as
a function of n, the failure probability is o(1) and the accuracy is 1 − o(1).
5 Future Work
We can currently only learn random DNFs with o(n3/2
) terms (o(n2
) terms for
monotone DNF); can stronger results be obtained which hold for all polynomial-
size DNF? A natural approach here for learning nc
-term DNF might be to first
try to identify all c
-tuples of variables which cooccur in a term, where c
is some
constant larger than c.
Acknowledgement
Avrim Blum suggested to one of us (JCJ) the basic strategy that learning mono-
tone DNF with respect to uniform might be reducible to finding the cooccurring
pairs of variables in the target function. This material is based upon work sup-
ported by the National Science Foundation under Grant No. CCR-0209064 (JCJ)
and CCF-0347282 (RAS).

References
1. H. Aizenstein and L. Pitt. On the learnability of disjunctive normal form formulas.
Machine Learning, 19:183–208, 1995.
2. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
3. D. Angluin and M. Kharitonov. When won’t membership queries help? J. Comput.
Syst. Sci., 50:336–355, 1995.
4. A. Blum. Learning a function of r relevant variables (open problem). In Proc. 16th
COLT, pages 731–733, 2003.
5. A. Blum. Machine learning: a tour through some favorite results, directions,
and open problems. FOCS 2003 tutorial slides, available at http://www-
2.cs.cmu.edu/~avrim/Talks/FOCS03/tutorial.ppt, 2003.
6. A. Blum, C. Burch, and J. Langford. On learning monotone boolean functions. In
Proc. 39th FOCS, pages 408–415, 1998.
7. A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. Weakly
learning DNF and characterizing statistical query learning using Fourier analysis.
In Proc. 26th STOC, pages 253–262, 1994.
8. B. Bollobas. Combinatorics: Set Systems, Hypergraphs, Families of Vectors and
Combinatorial Probability. Cambridge University Press, 1986.
9. M. Golea, M. Marchand, and T. Hancock. On learning μ-perceptron networks on
the uniform distribution. Neural Networks, 9:67–82, 1994.
10. T. Hancock. Learning kμ decision trees on the uniform distribution. In Proc. Sixth
COLT, pages 352–360, 1993.
11. J. Jackson. An efficient membership-query algorithm for learning DNF with respect
to the uniform distribution. J. Comput. Syst. Sci., 55:414–440, 1997.
12. J. Jackson, A. Klivans, and R. Servedio. Learnability beyond AC0
. In Proc. 34th
STOC, 2002.
13. J. Jackson and R. Servedio. Learning random log-depth decision trees under the
uniform distribution. In Proc. 16th COLT, pages 610–624, 2003.
14. J. Jackson and C. Tamon. Fourier analysis in machine learning. ICML/COLT
1997 tutorial slides, available at http://learningtheory.org/resources.html, 1997.
15. M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the
ACM, 45(6):983–1006, 1998.
16. M. Kearns, M. Li, L. Pitt, and L. Valiant. Recent results on Boolean concept
learning. In Proc. Fourth Int. Workshop on Mach. Learning, pages 337–352, 1987.
17. M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory.
MIT Press, Cambridge, MA, 1994.
18. A. Klivans, R. O’Donnell, and R. Servedio. Learning intersections and thresholds
of halfspaces. In Proc. 43rd FOCS, pages 177–186, 2002.
19. L. Kucera, A. Marchetti-Spaccamela, and M. Protassi. On learning monotone DNF
formulae under uniform distributions. Inform. and Comput., 110:84–95, 1994.
20. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatoric
1989, pages 148–188. London Mathematical Society Lecture Notes, 1989.
21. R. Servedio. On learning monotone DNF under product distributions. In Proc.
14th COLT, pages 473–489, 2001.
22. L. Valiant. A theory of the learnable. CACM, 27(11):1134–1142, 1984.
23. K. Verbeurgt. Learning DNF under the uniform distribution in quasi-polynomial
time. In Proc. Third COLT, pages 314–326, 1990.

Derandomized Constructions
of k-Wise (Almost) Independent Permutations
Eyal Kaplan1
, Moni Naor2,
, and Omer Reingold2,
1
Tel-Aviv University
kaplaney@post.tau.ac.il
2
Department of Computer Science and Applied Math, Weizmann Institute of Science
{moni.naor,omer.reingold}@weizmann.ac.il
Abstract. Constructions of k-wise almost independent permutations have been
receiving a growing amount of attention in recent years. However, unlike the
case of k-wise independent functions, the size of previously constructed families
of such permutations is far from optimal. This paper gives a new method for
reducing the size of families given by previous constructions. Our method relies
on pseudorandom generators for space-bounded computations. In fact, all we
need is a generator, that produces “pseudorandom walks” on undirected graphs
with a consistent labelling. One such generator is implied by Reingold’s log-
space algorithm for undirected connectivity [21, 22]. We obtain families of k-
wise almost independent permutations, with an optimal description length, up
to a constant factor. More precisely, if the distance from uniform for any k tuple
should be at most δ, then the size of the description of a permutation in the family
is O(kn + log 1
δ
).
1 Introduction
In explicit constructions of pseudorandom objects, we are interested in simulating a
large random object using a succinct one and would like to capture some essential
properties of the former. A natural way to phrase such a requirement is via limited
access. Suppose the object that we are interested in simulating is a random function
f : {0, 1}n
→ {0, 1}n
and we want to come up with a small family of functions G
that simulates it. The k-wise independence requirement in this case is that a function g
chosen at random from G be completely indistinguishable from a function f chosen at
random from the set of all functions, for any process that receives the value of either f
or g at any k points. We can also relax the requirement and talk about almost k-wise
independence by requiring that the advantage of a distinguisher be limited by some δ.
Families of functions that are k-wise independent (or almost independent) were
constructed and applied extensively in the computer science literature (see [1, 14]).
Suppose now that the object we are interested in constructing is a permutation, i.e.
a 1-1 function g : {0, 1}n
→ {0, 1}n
, which is indistinguishable from a random per-
mutation for a process that examines at most k points (a variant also allows examining

Incumbent of the Judith Kleeman Professorial Chair. Research supported in part by a grant
from the Israel Science Foundation.

Incumbent of the Walter and Elise Haas Career Development Chair. Research supported by
US-Israel Binational Science Foundation Grant 2002246.
c

Derandomized Constructions of k-Wise (Almost) Independent Permutations 355
the inverse). In other words, we are interested in families of permutations such that re-
stricted to k inputs their output is identical (or statistically close, up to distance δ), to
that of a random permutation. For k = 2 the set of linear permutations (ax + b where
a = 0) over GF[2n
] constitutes such a family. Similarly, there is an algebraic trick
when k = 3 ( Schulman, private communication). For k 3 no explicit (non-trivial)
construction is known for k-wise exactly independent permutations.
Once we settle on k-wise almost dependent permutations, with error parameter δ,
then we can hope for permutations with description length O(kn + log(1
δ ))1
; this is
what a random (non-explicit) construction gives (see Section 3.2) . There are a number
of proposals in the literature of constructing k-wise almost independent permutations
(see Section 4), but the description length they obtain is in general significantly higher
than this asymptotically optimal value. This paper obtains the first construction of k-
wise almost independent permutations, with description length O(kn + log(1
δ )), for
every value of k.
Motivation: Given the simplicity of the question, and given how fundamental k-wise
independent functions are, we feel that it is well motivated in its own right. Indeed, k-
wise independent permutations have been receiving a growing amount of attention with
various motivations and applications in mind (e.g. [5]). One motivation for this study,
is the relation between k-wise independent permutations and block ciphers [7, 15].
In block-ciphers, modelled by pseudorandom permutations, the distinguisher is not
limited by the number of calls to the permutations but rather by its computational power.
Still, the two notions are related (see [7, 10, 15]).
Our Technique and Main Results: We give a method for “derandomizing” essentially
all previous constructions of k-wise almost independent permutations. It is most ef-
fective, and easiest to describe for permutation families obtained by composition of
simpler permutations. As most previous constructions fall into this category, this is a
rather general method. In particular, based on any one of a few previous constructions,
we obtain k-wise almost independent permutations with optimal description length, up
to a constant factor.
Consider a family of permutations F, with rather small description length s. We
denote by Ft
the family of permutations obtained by composing any t permutations
f1, f2, . . . , ft in F. Now assume that Ft
is a family of k-wise almost independent
permutations. The description length of Ft
is t·s, as we need to describe t independent
permutations from F. We will argue that such constructions can be derandomized in
the sense that it is sufficient to consider a subset of the t-tuples of F functions. This will
naturally reduce the overall description length.
Our first idea uses generators that fool bounded space computations for the task
of choosing the subset of Ft
. Pseudorandomness for space-bounded computation has
been a very productive area, see [16, 17].
Such pseudorandomness has been used before in the context of combinatorial con-
structions where space is not an explicit issue by Indyk [8] and by Sivakumar [26].
1
The lower bound of kn trivially follows as in the case of functions (simply since the output of
a random permutation on k fixed inputs has entropy close to kn). If for no other reason, log(1
δ
)
bits are needed to reduce roundoff errors. This lower bound also follows for more significant
reasons, unless k-wise exactly independent permutations can be constructed.

356 Eyal Kaplan, Moni Naor, and Omer Reingold
The key observation is, that if the composition of permutations, chosen according
to the generator, is not k-wise almost independent, then there is a distinguisher, using
only space kn, which contradicts the security of the generator.
Given an “ideal” generator for space bounded computations with optimal parame-
ters we could expect the method above to give k-wise almost independent permutations
with description length O(nk + log(1
δ ) + s + log t). Based on previous constructions
this implies description length O(nk + log(1
δ )) as desired. However, applying this de-
randomization method with currently known generators (which are not optimal) implies
description length (nk + log(1
δ )) times poly-logarithmic factors.
This leads us to our second idea: to obtain families with description length O(nk +
log(1
δ )) we revise the above method to use a more restricted derandomization tool: we
use pseudorandom generators for walks on undirected labelled graphs. That is walks
which are indistinguishable from a random walk for any ‘consistently labelled graph’
and sufficient length. Such generators with sufficiently good parameters are implied by
the proof that undirected connectivity is in logspace of Reingold [21], and made explicit
by Reingold, Trevisan and Vadhan [22].
Adaptive vs. Static Distinguishers: A natural issue is whether the distinguisher, trying
to guess whether the permutation it has is random or from the family G has to decide
on its k queries ahead of time (statically) or adaptively, as a function of the responses
the process receives. Here we shall consider the static case for at least two reasons:
(i) static indistinguishability up to distance δ2−nk
implies adaptive indistinguishability
up to distance δ. (ii) A result of Maurer and Pietrzak [12] shows that composing two
independently chosen k-wise almost independent permutations in the static case yields
k-wise almost independent permutations with adaptive queries of similar parameters.
Related Work: There are several lines of constructions that are of particular relevance to
our work. We describe them in more detail in Section 4. The information is summarized
in Table 1.
Table 1. Summary of Results and Previous Work on k-wise δ-dependent Permutations
Family Description Length Range of Queries
Feistel2
(Luby Rackoff) nk + O(n) k 2
n
4
−O(1)
, δ = k2
2n/2
O(nk · log δ
δ0
) k 2
n
4
−O(1)
, any δ, δ0 = k2
2n/2
Simple 3-Bit Permutations [4, 6, 7] O(n2
k(nk + lg(1
δ
)) lg(n)) k ≤ 2n
− 2
Thorp Shuffle [13, 15, 23] O(n45
k log(1
δ
)) k ≤ 2n
Non-Explicit Constructions:
Probabilistic (Thm. 3) O(nk + log(1
δ
)) k ≤ 2n
Sample space existence (Thm. 4) O(nk) k ≤ 2n
This Work (Theorem 10) O(nk + log(1
δ
)) k ≤ 2n
2
The first row is based on 4 rounds with the first and last being pair-wise independent [10, 15].
The second row is obtained by the composition theorem (Theorem 5). Analysis of related
constructions [12, 15, 19, 20] allows k that approaches k = 2n/2
, but does not go beyond.

Organization
In Section 2 we provide notation and some basic information regarding random walks
and the spectral gap of graphs. In Section 3 we define k-wise δ-dependent permutations,
argue the (non-constructive) existence of small families of such permutations and study
the composition of such permutations. In Section 4 we discuss some known families of
permutations. Section 5 describes our general construction of a permutation family, and
proves our main result. In Section 6 we describe possible extensions for future research.
2 Preliminaries and Notation
– Let Pn be the set of all permutations over {0, 1}n
. We will use N = 2n
.
– Let x and y be two bit strings of equal length, then x ⊕ y denotes their bit-by-bit
exclusive-or.
– For any f, g ∈ Pn denote by f ◦ g their composition (i.e., f ◦ g(x) = f(g(x))).
– For a set Ω, Denote by D(Ω) the set of distributions over Ω. Denote by UΩ the
uniform distribution on the elements of Ω.
– Denote by [Nk] the set of all k-tuples of distinct n-bit strings.
2.1 Random Walks
A random walk on a graph starting at a vertex v is a sequence of vertices, u0, u1, . . .
where u0 = v and for i 0 the vertex ui is obtained by selecting an edge (ui−1, ui),
uniformly from the edges leaving ui−1. Regular, connected, undirected graphs, with
self-loops, have the property that a random walk on the graph (starting at an arbitrary
vertex) converges to the uniform distribution on the vertices. The rate of convergence is
governed by the second largest (in absolute value) eigenvalue of the graph. Below we
formalize these notions.
Definition 1 (Spectral Gap). Let G = (V, E) be a connected, d-regular undirected
graph on n vertices. The normalized adjacency matrix of G is its adjacency matrix
divided by d. Denote this matrix by M ∈ Mn(R). Denote by 1 = λ1 ≥ λ2 ≥ . . . ≥ λn
its eigenvalues. We denote by λ(G) the second eigenvalue in absolute value. Namely,
λ(G) ˙
= max{|λ2|, |λn|}. The spectral gap of G, is defined by gap(G) ˙
=1 − λ(G).
Definition 2 (Mixing Time). Let G = (V, E) be a connected, regular, undirected
graph with self-loops, on n vertices. Let M ∈ Mn(R) be the normalized adjacency
matrix of G. A random walk on this graph is an ergodic Markov chain, whose transi-
tion matrix is M. Its stationary distribution π is the uniform distribution on the ver-
tices. For x ∈ V , define the mixing time of the walk starting from x, by τx( ) =
min{n| Mn
1x − π ≤ }, where 1x is the distribution concentrated on x. The mixing
time of the walk is defined by τ( ) = maxx∈V τx( ).
We have the following claims, relating the mixing time of a walk with the spectral
gap of the graph.

Claim 1 [24] Let G = (V, E), M, π be as in Definition 2. Let 0. Let λ be the
second largest eigenvalue of G. Then
1
2
λ
1 − λ
ln(
1
2
) ≤ τ( ) ≤
1
1 − λ
ln(
|V |
).
Usually, such a claim is used to bound the mixing time. However, we will be using
constructions with a proven mixing time. The construction itself may also provide a
bound on the spectral gap. In case it does not, we will be able to use Claim 1, to bound
the gap of the graph from below. A simple calculation using Claim 1, shows that
gap(G) = Ω(
ln( 1
2 )
τ( )
).
Another useful Claim is the following.
Claim 2 [7] For ≥ 1, τ(2−−1
) ≤ · τ(1
4 ).
3 The Existence of k-Wise δ-Dependent Permutations
In this section we define k-wise δ-dependent permutations, discuss their existence, and
show that the distance parameter δ is reduced by the composition of such permutations.
For simplicity of presentation this paper concentrates on permutations over bit strings
(rather than considering more general domains).
3.1 Definitions
The output of a k-wise almost independent permutation on any k inputs is δ-close to
random, where “closeness” is measured by statistical variation distance between distri-
butions.
Definition 3. Let n, k ∈ N, and let F ⊆ Pn be a family of permutations. Let δ ≥ 0. The
family F is k-wise δ-dependent if for every k-tuple of distinct elements (x1, . . . , xk) ∈
[Nk], the distribution (f(x1), f(x2), . . . , f(xk)), for f ∈ F chosen uniformly at ran-
dom is δ-close to U[Nk]. We refer to a k-wise 0-dependent family of permutations as
k-wise independent.
We are mostly interested in explicit families of permutations, meaning that both
sampling uniformly at random from F and evaluating permutations from F can be
done in polynomial time. The parameters we will be interested in analyzing are the
following:
Description Length. The description length of a family F is the number of random
bits, used by the algorithm for sampling permutations uniformly at random from F.
Alternatively, we may consider the size of F, which is the number of permutations
in F, denoted |F|. In all of our applications, the description length of a family F
equals O(log(|F|)). By allowing F to be a multi-set we can assume without loss
of generality that the description length is exactly log(|F|).
Time Complexity. The time complexity of a family F is the running time of the algo-
rithm for evaluating permutations from F.

Our main goal would be to reduce the description length of constructions of k-wise
δ-dependent permutations. Still, we would take care to keep time complexity as efficient
as possible. See additional discussion in Section 6.
3.2 Non-explicit Constructions
We note the following non-explicit families of permutations. Our goal would be to ob-
tain families of size which is as close as possible to that obtained by the non-explicit
arguments below. The following theorem follows by a standard application of the prob-
abilistic method.
Theorem 3 (Non-explicit Construction). Let n ∈ N. For all 1 ≤ k ≤ 2n
and δ 0,
there exists a k-wise δ-dependent family F ⊆ Pn, of size |F| = 2(2+)nk
δ2 , for some
constant 0 1 .
The existence (even with a non-explicit construction) of exact k-wise family of per-
mutations is unknown. Nonetheless, using an approach due to Koller and Megiddo [9],
we can show that there exists a distribution on permutations, which is k-wise indepen-
dent and has a small support. This follows, by observing that such a requirement on
the permutations defines a set of
N
k
2
constraints, in the terminology of Koller and
Megiddo [9].
Theorem 4 (Existence of k-Wise Independent Distribution). There exists a distribu-
tion on permutations which is k-wise independent (i.e. for any k points the value of the
chosen permutation is uniform in [Nk]) and the size of the support of the distribution is
at most 22nk
.
3.3 Composition of Permutations
Some of the permutations families we will inspect, require several compositions, to get
a distribution close to uniform. In fact, as we argue below, composing permutations is
an effective method for reducing the distance parameter δ. This motivates the following
definition.
Definition 4. Let F ⊆ Pn. The tth power of F, denoted by Ft
⊆ Pn, is { f1 ◦ . . . ◦ ft |
f1, . . . , ft ∈ F }.
Remark 1. Let F ⊆ Pn. Observe that |Ft
| = |F|t
, and that the time complexity of Ft
is essentially t times the time complexity of F.
We now state a composition theorem, that is proven in the full version.
Theorem 5. 1. Let F be a k-wise δ-dependent family. Then, F2
is a k-wise 2δ2
-
dependent family.
2. Let F1 and F2 be k-wise δ1-dependent and δ2-dependent families respectively.
Then, F1 ◦ F2 is a k-wise 2δ1δ2-dependent family.
Theorem 5 has the following corollary.
Corollary 1. Let F be a k-wise δ-dependent family. Then, for any ∈ N, F
is a
k-wise (1
2 (2δ)
)-dependent family.

4 Short Survey of Explicit Constructions
We now survey some known constructions yielding k-wise almost independent permu-
tations with reasonable parameters.
4.1 Feistel Based Constructions
In their famed work Luby and Rackoff [10] showed how to construct pseudorandom
permutations from pseudorandom functions. The construction is based on the Feistel
Permutation: For any function f ∈ {0, 1}n/2
→ {0, 1}n/2
the Feistel Permutation is
defined by (L, R) → (R, L ⊕ f(R)), where |L| = |R| = n/2. The construction uses a
composition of several such permutations.
There are Feistel constructions of k-wise δ-dependent permutations, for k up to 2n/2
(see Naor and Reingold [15], Patarin [18–20], and Maurer and Pietrzak [11]).
Feistel permutations approach yields succinct k-wise δ-dependent permutation as
long as k is not too large and δ is not too small, and is probably the method of choice for
this range. To reduce the dependency δ one can use Theorem 5 and obtain a permutation
with description size O(kn log(1/δ) (or even O(k log(1/δ)) for certain ranges of k and
δ). The Feistel method is not known to be useful for k larger than 2n/2
.
4.2 Card Shuffling
Consider a process for shuffling cards. Each round (shuffle) in such a procedure selects
a permutation on the locations of the N cards of a deck (selected from some collection
of basic permutations). Starting at an arbitrary ordering of the cards, we are interested at
how long does it take to get the deck into a (close to) random position. In other words, a
card shuffling defines a Markov chain on the state of the deck, and the goal is to bound
its mixing time.
An “old” proposal by the second author [23, page 17], [15] for the construction
of k-wise almost independent permutations was to utilize “oblivious” card shuffling
procedure. Briefly, a shuffle is oblivious if the location of a card, after each round, is
easy to trace and is determined by only a few random bits, say O(1). An excellent
example is the Thorp Shuffle [27], defined below.
Definition 5 (Thorp Shuffle). Let n ∈ N. Given a deck of 2n
cards, one stage of
the shuffle is determined by 2n−1
bits that we will view as a random function g :
{0, 1}n−1
→ {0, 1}. View the location of each card as an n-bit string according to
the lexical order. Card at location (σ, x) where σ ∈ {0, 1} and x ∈ {0, 1}n−1
moves to
location (x, σ ⊕ g(x)).
Theorem 6. [13] The mixing time for the Thorp shuffle is O(n44
).
It can be seen, that the Thorp Shuffle is oblivious. When using such a card shuffle to
construct an k-wise almost independent permutation,all we care for is the final locations
of k cards. If we replace the random function g by a k-wise independent function, then
this will not change the distribution on the k final locations.
It is also possible to utilize non-oblivious shuffles such as riffle shuffle using range-
summable k-wise independent shuffles; details will be given in the full version.

4.3 Simple 3-Bit Permutations
A very intriguing method for generating k-wise δ-dependent permutation was explored
first by Gowers [6] and then (with some variation) by Hoory et al. [7] and Brodsky and
Hoory [4]. The idea is to pick a few bit positions (actually 3) and chose a permutation
on the resulting small cube. In the Hoory et al. variation only a single bit is changed as
a function of the other bits. This is reminiscent of a shuffle, but there is no chance that
the shuffle will converge in reasonable time (as we invest too few bits in each shuffle).
This approach is treated more formally in the Section 5.4 and it works very well with the
derandomized walk approach, since the underlying set of permutations considered is the
simplest and hence the description length of simple permutations is quite short. What
this line of research shows is that a composition of not too many simple permutations
yields a k-wise almost independent permutation.
5 Main Results
In this section we give a method for reducing the description length of previous con-
structions of k-wise δ-dependent permutations.
5.1 Permutation Families and Random Walks on Graphs
We associate with a family F of permutations a graph as follows:
Definition 6 (Companion Graph). Let F ⊆ Pn be a family of permutations. For k ∈
N, define the companion (multi-)graph of F, GF,k = (V, E) by:
– V = [Nk].
– E = { (i, σ(i)) | i ∈ [Nk], σ ∈ F }.
– Each edge (i, σ(i)) ∈ E is labelled by σ.
All of our families of permutations of Section 4 closed under taking an inverse of a
permutation and always include the identity permutation. We summarize the properties
of the companion graph in the following proposition:
Proposition 1. Let F ⊆ Pn be a family of permutations, which is closed under taking
an inverse and contains the identity permutation. Let k ∈ N. Then, the companion
graph GF,k, is an undirected, |F|-regular, consistently labelled graph, with self-loops.
Remark 2. A consistently labelled graph has the property that for any vertex w, any two
incoming edges to w are labelled with different labels.
Assume that F is such that Ft
is a family of k-wise δ-dependent permutations.
This means that the distribution over the vertices we reach, by taking a walk of length
t, starting at any vertex of GF,k, is δ-close to uniform. Simply, traversing an edge is the
same as applying the permutation that is the label of this edge. Taking t random edges
is the same as applying the composition of t randomly chosen permutations.
Derandomizing the family Ft
will mean that instead of composing independently
chosen permutations from F, we will select the permutations with some dependencies.
Equivalently, we will take a pseudorandom walk instead of a random one. We will
use a pseudorandom generator, to generate this walk. Such a generator was given by
Reingold, Trevisan and Vadhan [21, 22].

5.2 Pseudorandom Walk Generators
We now discuss generators for pseudorandom walks on graphs. We will refer to graphs
with the following parameters:
Definition 7 (Parameters for a Graph). Let G = (V, E) be a connected, undirected
d-regular graph, on m vertices. Then G is an (m, d, λ)-graph if λ(G) ≤ λ.
Definition 8 (Pseudorandom Walk). Let G = (V, E) be a d-regular graph where
each node labels its adjacent edges in [d]. Let A be a distribution over
a = a1, a2, . . . a ∈ [d]
.
We say that A is δ-pseudorandom for G, if for every u ∈ V , the distribution on the
possible end vertices of a walk in G, which starts from u, and follows the edge labels in
a is δ-close to uniform when a is distributed according to A.
Note that if G is an (m, d, λ) graph, λ is sufficiently smaller than 1 and the walk is
sufficiently long, then we expect a (truly) random walk to end in vertex that is close to
being uniformly distributed no matter where the walk started. We are now ready to state
the parameters of the best known construction of pseudorandom walk generators.
Theorem 7. [21, 22][Pseudorandom Walk Generator] For every m, d ∈ N, δ, 0,
there is a pseudorandom walk generator PRG where PRGm,d,δ, : {0, 1}r
→ [d]
,
with the following parameters:
– Seed length r = O(log(md/ δ)).
– Walk length = poly(1/ ) · log(md/δ).
– Computable in space O(log(md/ δ)) and time poly(1/ , log(md/δ)).
such that for every consistently labelled (m, d, 1 − )-graph G, the output of PRG(Ur)
is δ-pseudorandom for G, where Ur is the uniform distribution on {0, 1}r
.
5.3 Derandomizing Compositions of Permutation Families
By Proposition 1, the companion graph GF,k, is regular and consistently labelled. As
argued above, if Ft
(for t not too large) is k-wise almost independent then the random
walk on GF,k has small mixing time. By Claim 1, this implies a bound on the eigen-
value gap ε. Therefore, Theorem 7 gives a way to generate a pseudorandom walk for
GF,k with PRGm,d,δ, with m = |[N]k| and d = |F|. The idea is to use each seed
s ∈ {0, 1}r
of the pseudorandom generator PRG, to define a new permutation σs,
which is the composition of permutations from F. Theorem 8 formalizes this approach
(though we assume for simplicity here the bound on the eigenvalue gap, rather than
deducing it by Claim 1 as in the discussion above).
An advantage we have, which affects the parameters of our results (especially the
time complexity), is that the efficiency of the generator of [22] depends on the spectral
gap of the initial graph. Since we are using families of permutations for which the
companion graph is known to be of good expansion, we manage to achieve non-trivial
parameters in the families we construct.
The following theorem describes the family of permutations we achieve.

Theorem 8. Let F ⊆ Pn be a family of size d = |F|, and GF,k be its companion
graph. Suppose that gap(GF,k) = , where may be a function of n and k. Then, there
exists F
⊆ Pn, such that F
is a k-wise δ-dependent family, with following properties.
– The description length of F
is O(nk + log( d
δ )).
– If the time complexity of any permutation in F is bounded by ξ(n, k), then the time
complexity of F
is poly(1/ , n, k, log(d
δ )) · ξ(n, k).
Proof. We apply Theorem 7 on the companion graph of F. Following Proposition 1 we
know that GF,k fits the requirements there. Let r = O(log(2nk
·d
δ )) and = poly(1/ ) ·
log(2nk
·d
δ ) be as in Theorem 7. For a string s ∈ {0, 1}
r
, we define σs ∈ Pn as follows.
Let w = PRG2nk,d,δ,(s) ∈ [d]
. Then w = τ1, τ2, . . . , τ, where for all 1 ≤ i ≤ ,
τi ∈ F. We let σs = τ ◦ . . . ◦ τ1.
Next define a permutations family F
⊆ Pn by
F
= { σs | s ∈ {0, 1}r
}.
We now show that F
is a k-wise δ-dependentfamily. By Theorem 7, for any starting
vertex u ∈ V (GF,k), the pseudorandom walk starting at u and following the labels of
PRG2nk,d,δ,(Ur) reaches a vertex that is δ-close to uniform. Observe that picking a
random σs ∈ F
and applying it to any value A ∈ V (GF,k) = [Nk] is exactly as taking
a random walk on GF,k according to the output of PRG2nk,d,δ, with a random seed s.
Therefore, the output of a uniform σs on any such A ∈ [Nk], is δ-close to uniform. We
can conclude that F
is k-wise δ-dependent.
The description length of F
is |r| = O(log(2nk
d
δ )) = O(nk + log( d
δ )). The
time complexity of F
depends on the time complexity of running the generator, and of
running permutations from F. This can be bounded by poly(1/ , n, k, log(d
δ ))·ξ(n, k).
5.4 Particular Derandomization – 3-Bit Permutations
We now provide a formal definition and analysis of simple 3-bit permutations, men-
tioned in Section 4.3.
Definition 9 (Simple Permutations). [7] Let w ≤ n. For i ∈ [n], J = {j1, . . . , jw} ⊆
[n] {i}, and a function f ∈ {0, 1}
w
→ {0, 1}, denote by σi,J,f the permutation
σi,J,f (x1, . . . , xn) ˙
=(x1, . . . , xi−1, xi ⊕ f(xj1 , . . . , xjw ), xi+1 . . . , xn)
The following simple permutations family Fw is defined by
Fw = {σi,J,f |i ∈ [n], J ⊆ [n] {i}, |J| = w, f ∈ {0, 1}
w
→ {0, 1}}
Theorem 9. [4] For all 2 ≤ k ≤ 2n
− 2, F2
t
is k-wise δ-dependent , for t =
O(n2
k(nk + log(1
δ ))). Furthermore, gap(GF,k) = Ω( 1
n2k ).
Evaluating σi,J,f ∈ F2 takes O(n) time. The size of F2 is O(n3
), and the size
of F2
t
is O(n3
)t
= nO(n2
k(nk+log( 1
δ )))
. It follows that F2
t
has description length
O(n2
k(nk + log(1
δ )) log(n)), and time complexity O(n3
k(nk + log(1
δ ))).
Combining Theorems 9 and 8 we obtain the main result of this paper:
Theorem 10. There exists F ⊆ Pn, such that F is k-wise δ-dependent. F has descrip-
tion length O(nk + log(1
δ )), and time complexity poly(n, k).

Proof. Apply Theorem 8, with d = |F2| = O(n3
), = Ω( 1
n2k ) and ξ(n, k) = O(n).
6 Discussion and Further Work
One issue that we have not resolved is coming up with k-wise permutations where the
time complexity of evaluating at a given point is small. Note that even for k-wise inde-
pendent functions this issue is not completely resolved; the basic construction based on
polynomials is expensive and some lower and upper bounds are given by Siegel [25]. In
general the transformation we propose via the random walks does not preserve the time
complexity of evaluating permutations in F: when the composed permutation is stored
in its succinct form we do not know how to evaluate it at a given point without first
‘decompressing’ and representing explicitly as a composition of permutations in F.
In order to maintain the complexity of evaluation, we need a generator with ‘random
access’ properties. In such a generator, evaluating the ith bit of its output, does not entail
computing all bits up to i. The Nisan generator [16] has some aspects of this nature, but
is sub-optimal. Also note an advantage of the general space bounded pseudorandomness
over the random walk pseudorandomness: the former preserves the number of rounds
whereas the latter may increase them.
One interesting question is whether it is possible to ‘scale down’ a construction for
k-wise dependent permutations on n bits to one on n
≤ n bits. This is most relevant
in the computational pseudorandomness setting: is it possible to obtain from a block-
cipher on large blocks (e.g. 128 bits) a block-cipher on small blocks (e.g. 40 bits), while
maintaining the security of the former.
An issue that we did not explore so far is constructing k-wise independent permu-
tations over domains that are not powers of 2. This problem was raised by Bar-Noy and
S. Naor inspired by the needs of [2]. Black and Rogaway [3] suggested several meth-
ods for obtaining a pseudo-random permutation on domain size M that is not a power
of 2 from a pseudo-random permutation on domain size N that is a power of 2 (say
N = 2 logM
). The most relevant method for our purposes is the ‘cycle walking’ one,
where the idea is to construct a permutation on [M] elements by iterating a permutation
on [N] until it lands in the first M values of [N]. It is possible to use this method in the
k-wise independent case, but it introduces an additional error. The derandomized walk
method is applicable for reducing the error with no significant penalty, since Theorem
8 does not require the domain size to be a power of 2. For details, see the full version.
Finally, there is no strong reason to suppose that explicit small families (or distri-
butions) of exact k-wise independent permutation do not exist and Theorem 4 hints to
their existence. So how about finding them?
Acknowledgments
The authors are grateful to Ronen Shaltiel for his invaluable collaboration during the
early stages of this work and thank Danny Harnik and Adam Smith for useful com-
ments.
References
1. N. Alon and J. Spencer, The Probabilistic Method, Wiley, 1992.

2. A. Bar-Noy, J. Naor and B. Schieber, Pushing Dependent Data in Clients-Providers-Servers
Systems, Wireless Networks 9(5), 2003, pp. 421-430.
3. J. Black and P. Rogaway, Ciphers with Arbitrary Finite Domains. Topics in Cryptology -
CT-RSA 2002, Lecture Notes in Computer Science, vol. 2271, Springer, 2002, 114–130.
4. A. Brodsky and S. Hoory, Simple Permutations Mix Even Better, Arxiv math.CO/0411098.
5. Y. Z. Ding, D. Harnik, A. Rosen and R. Shaltiel, Constant-Round Oblivious Transfer in
the Bounded Storage Model, First Theory of Cryptography Conference, TCC 2004, LNCS
vol. 2951, Springer, pp. 446-472
6. W. T. Gowers, An almost m-wise independent random permutation of the cube, Combina-
torics, Probability and Computing, vol. 5(2), 1996, pp. 119-130.
7. S. Hoory, A. Magen, S. Myers, and C. Rackoff, Simple permutations mix well, The 31st
International Colloquium on Automata, Languages and Programming (ICALP), 2004.
8. P. Indyk, Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream
Computation, FOCS 2000, pp. 189–197.
9. D. Koller and N. Megiddo, Constructing small sample spaces satisfying given constraints,
SIAM J. Discrete Math. , vol. 7(2), 1994, pp. 260-274.
10. M. Luby and C. Rackoff, How to construct pseudorandom permutations and pseudorandom
functions, SIAM J. Comput., vol. 17, 1988, pp. 373-386.
11. U. M. Maurer and K. Pietrzak, The Security of Many-Round Luby-Rackoff Pseudo-Random
Permutations, EUROCRPYT 2003, LNCS vol. 2656, Springer, pp. 544–561.
12. U. M. Maurer and K. Pietrzak, Composition of Random Systems: When Two Weak Make One
Strong, First Theory of Cryptography Conference, TCC 2004, LNCS vol. 2951, Springer,
pp. 410–427.
13. B. Morris, On the mixing time for the Thorp shuffle, STOC 2005, pp. 403–412.
14. R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, New
York (NY), 1995.
15. M. Naor and O. Reingold, On the Construction of Pseudorandom Permutations: Luby-
Rackoff Revisited, J. of Cryptology, vol. 12(1), Springer-Verlag, 1999, pp. 29-66.
16. N. Nisan, Pseudorandom generators for space-bounded computation, Combinatorica 12(4),
1992, 449–461.
17. N. Nisan and D. Zuckerman, Randomness is Linear in Space, J. Comput. Syst. Sci. vol. 52(1),
1996, pp. 43–52.
18. J. Patarin, Improved security bounds for pseudorandom permutations, 4th ACM Conference
on Computer and Communications Security, 1997, pp. 142–150.
19. J. Patarin, Luby-Rackoff: 7 Rounds Are Enough for 2n(1−epsilon
) Security, CRYPTO 2003:
pp. 513-529.
20. J. Patarin Security of Random Feistel Schemes with 5 or More Rounds, CRYPTO 2004, pp.
106–122.
21. O. Reingold, Undirected ST-Connectibvity in Log-Space, STOC 2005, pp. 376-385.
22. O. Reingold, L. Trevisan, S. Vadhan, Pseudorandom Walks in Biregular Graphs and the RL
vs. L Problem, ECCC, TR05-22, February 2005.
23. S. Rudich, Limits on the provable consequences of one-way functions, PhD Thesis, U. C.
Berkeley.
24. A. Sinclair, Improved bounds for mixing rates of Markov chains and multicommodity flow,
Combinatorics, Probability and Computing, vol. 1(4), 1992, pp. 351-370.
25. A. Siegel, On Universal Classes of Extremely Random Constant-Time Hash Functions,
SIAM Journal on Computing 33(3), 2004, pp. 505–543.
26. D. Sivakumar, Algorithmic derandomization via complexity theory, STOC 2002, pp. 619-626
27. E. Thorp, Nonrandom shuffling with applications to the game of Faro, Journal of the Ameri-
can Statistical Association, vol. 68, 1973, pp. 842-847.

Testing Periodicity
Oded Lachish and Ilan Newman
Haifa University, Haifa, 31905 Israel
{loded,ilan}@cs.haifa.ac.il
Abstract. A string α ∈ Σn
is called p-periodic, if for every i, j ∈
{1, . . . , n}, such that i ≡ j mod p, αi = αj , where αi is the i-th place of
α. A string α ∈ Σn
is said to be period(≤ g), if there exists p ∈ {1, . . . , g}
such that α is p-periodic.
An -property tester for period(≤ g) is a randomized algorithm, that for
an input α distinguishes between the case that α is in period(≤ g) and the
case that one needs to change at least -fraction of the letters of α, so that
it will become period(≤ g). The complexity of the tester is the number
of letter-queries it makes to the input. We study here the complexity
of -testers for period(≤ g) when g varies in the range 1, . . . , n
2
. We
show that there exists a surprising exponential phase transition in the
query complexity around g = log n. That is, for every δ 0 and for
each g, such that g ≥ (log n)1+δ
, the number of queries required and
suﬃcient for testing period(≤ g) is polynomial in g. On the other hand,
for each g ≤ logn
4
, the number of queries required and suﬃcient for testing
period(≤ g) is only poly-logarithmic in g.
We also prove an exact asymptotic bound for testing general period-
icity. Namely, that 1-sided error, non adaptive -testing of periodicity
(period(≤ n
2
)) is Θ(
√
n log n) queries.
1 Introduction
Periodicity in strings plays an important role in several branches of CS and
engineering applications. It is being used as a measure of ’self similarity’ in
many application regarding string algorithms (e.g. pattern matching), compu-
tational biology, data analysis and planning (e.g. analysis of stock prices, com-
munication patterns etc.), signal and image processing and others. On the other
hand, sources of very large streams of data are now common inputs for strategy-
planning or trend detection algorithms. Typically, such streams of data are either
too large to store entirely in the computer memory, or so large that even lin-
ear processing time is not feasible. Thus it would be of interest to develop very
fast (sub-linear time) algorithms that test whether a long sequence is periodic
or approximately periodic, and in particular, that test if it has a very short
period. This calls for algorithms in the framework of Combinatorial Property
Testing [1]. In this framework, introduced initially by Rubinfeld and Sudan [2]

This research was supported by THE ISRAEL SCIENCE FOUNDATION (grant
number 55/03).
c

Testing Periodicity 367
and formalized by Goldreich et al. [1], one uses a randomized algorithm that
queries the input at very few locations and based on this, decides whether it has
a given property or it is ’far’ from having the property. Indeed related questions
to periodicity have already been investigated [3–5], although the focus here is
somewhat different.
In [6], the authors constructs an algorithm that approximates in a certain
sense the DFT (Discrete Fourier Transform) of a finite sequence in sub linear
time. This is quite related but not equivalent to testing how close is a sequence
to being periodic. In [3], the authors study some alternative parametric defini-
tions of periodicity that intend to ’capture the distance’ of a sequence to being
periodic. They mainly relate the different definitions of periodicity. They also
show that there is a tolerant tester for periodicity. That is, they show a simple
algorithm, that given 0 ≤ 1 2 ≤ 1, decides whether a sequence is 1-close to
periodic or 2-far from being periodic, using O(
√
n · poly(log n)) queries.
There are other works on sequences sketching [4] etc. but none of those seems
to address directly periodicity testing.
The property of being periodic is formalized here in a very general form: Let
Σ be a finite alphabet. A string α ∈ Σn
is said to be p-periodic for an integer
p ∈ [n], if for every i, j ∈ [n], i ≡ j(modp), αi = αj, where αi is the i-th
character of α. We say that α is in period(≤ g), where g ∈ [n
2 ], if it is p-periodic
for some p ≤ g. We say that a string is periodic if it is in period(≤ n
2 ).
We study here the complexity of -testing the property period(≤ g) when
g varies in the range 1, . . . , n
2 . An -tester for period(≤ g) is a randomized al-
gorithm, that for an input α ∈ Σn
distinguishes between the case that α is in
period(≤ g) and the case that one needs to change at least an -fraction of the
letters of α, so that it will be in period(≤ g). The complexity of the tester is the
number of letter-queries it makes to the input.
We show that there exists a surprising exponential phase transition in the
complexity of testing period(≤ g) around g = log n. That is, for every δ 0
and for each g, such that g ≥ (log n)1+δ
, the number of queries required and
sufficient for testing period(≤ g) is polynomial in g. On the other hand, for each
g ≤ logn
4 , the number of queries required and sufficient for testing period(≤ g) is
only poly-logarithmic in g. We also settle the exact complexity of non-adaptive
1-sided error test for general periodicity (that is, period(≤ n/2)). We show that
the exact complexity in this case is θ(
√
n log n). The upper bound that we prove
is an improvement over the result of [3] and uses a construction of a small random
set A ⊆ [n] of size
√
n log n for which the multi-set A − A = {a − b|a, b ∈ A}
contains at least log n copies of each member of [n/2]. This can be trivially done
with a set A of size
√
n log n. We improve on the previous bound by giving
up the full independence of the samples, but still retain the property that the
probability of the log n copies of each number in A−A behave as if being not too
far from independent. A similar problem has occurred in various other situation,
(e.g. [7, 8]) and thus could be interesting in its own.
The rest of the paper is organized as follows. In section 2 we introduce the
necessary notations and some very basic observations. Section 3 contains an -

368 Oded Lachish and Ilan Newman
test for period(≤ g) that uses θ(
√
g log g) queries. In section 4 we construct an
-test for period(≤ g), g ≤ log n
4 that uses only Õ((log g)6
) queries. Sections 5 and
6 contain the corresponding lower bounds, thus showing the claimed phase tran-
sition. Finally, the special case of g = θ(n) is treated in Section 7 which contains
a proof that any 1-sided error non adaptive tester for periodicity (period(≤ n
2 ))
requires Ω(
√
n log n) queries.
2 Preliminaries
In the following Σ is a fixed size alphabet that contains 0, 1. For a string α ∈ Σn
and an integer i ∈ [n], ([n] = {1, . . . , n}) we denote by αi the i-th symbol of α,
that is α = α1 . . . αn. Given a set S ⊆ [n] such that S = {i1, i2, . . . , im} and
i1 i2 . . . im we define αS = αi1 αi2 . . . αim . In the following Σ will be
fixed. For short unless otherwise stated, all strings for which the length is not
specified are in Σn
.
For two strings α, β we denote by dist(α, β) the Hamming distance between
α and β. Namely, dist(α, β) = |{i| αi = βi}|. For a property P ⊆ Σn
and a
string α ∈ Σn
, dist(α, P) = min{dist(α, β)| β ∈ P} denotes the distance from
α to P. We say that α is -far from P if dist(α, P) ≥ n, otherwise we say that
α is -close to P.
Definition 1. For a string α and a subset S ⊆ [n] we say that αS is homoge-
neous if for all i, j ∈ S, αi = αj.
Property Testing
The type of algorithms that we consider here are ’property-testers’ [1, 9,
10]. An -test is a randomized algorithms that accesses the input string via a
‘location-oracle’ which it can query: A query is done by specifying one place in
the string to which the answer is the value of the string in the queried location.
The complexity of the algorithm is the amount of queries it makes in the worst
case. Such an algorithm is said to be an -test for a property P ⊆ Σn
if it
distinguishes with success probability at least 2/3 between the case that the
input string belongs to P and the case that it is -far from P.
Periodicity
Definition 2. A string α ∈ Σn
has period p (denoted p-periodic), if αi = αj
for every i, j ∈ [n] such that i ≡ j mod p.
Note that a string is homogeneous if and only if it is 1-periodic.
Definition 3. A string has the property period(≤ g) if it is p-periodic for some
p ≤ g. We say that it is periodic if it has the property period(≤ n
2 ).
Definition 4. A witness that a string α is not p-periodic, denoted as p-witness,
is an unordered pair {i, j} ⊆ [n] such that i ≡ j(mod p) and αi = αj.
According to the definition of periodicity a string has a period p ≤ n
2 if and only
if there does not exist a p-witness.
In the same manner a witness for not having period(≤ g) is defined as follows:

Definition 5. A witness that a string is not in period(≤ g) is a set of integers
Q ⊆ [n] such that for every p ≤ g there are two integers i, j ∈ Q that form a
p-witness.
Fact 1. A string has period(≤ g) if and only if there does not exist a witness
that the string is not in period(≤ g).
Fact 2. If a string α ∈ Σn
does not have a period in (g
2 , g], where g ≤ n
2 , then
it has no period of at most g
2 . If a string α ∈ Σn
is -far from having a period
in (g
2 , g], where g ≤ n
2 , then it is -far from having the property period(≤ g).
Proof. Observe that if a string α ∈ Σn
has period p ≤ g
2 then it also has period q
for every q that is a multiple of p. Note also that there must exist such q ∈ (g
2 , g].

Definition 6. For α ∈ Σn
, p ∈ [n] and 0 ≤ i ≤ p − 1, let Z(p, i) = {j| j ≡
i(mod p)}. We call αZ(p,i) the i-th p-section of α.
The following obvious fact relates the distance of a string to p-periodic and
the homogeneity of its p-sections.
Fact 3. For each α, dist(α, p − periodic) = Σp−1
i=0 dist(αZ(p,i), homogeneous).
In further sections we use the following basic -test for p-period.
Algorithm p-Test
Input: a string α ∈ Σn
, the string length n, a period p ∈ [n
2 ] and a
distance parameter 0 1;
1. Select 1
random unordered pairs (with repetitions) {i, j} ⊂ [n] such that
i ≡ j(modp).
2. Reject if one of the selected pairs is a p-witness (namely a witness for being
non-p-periodic). Otherwise accept.
Proposition 1. Algorithm p-test is a 1-sided error, non-adaptive -test for p-
periodic. Its query complexity is 2
.
Proof. The query complexity is obvious. The test is 1-sided error since it rejects
only if it finds a p-witness. To estimate its error probability let α be such that
dist(α, p − periodic) ≥ n.
Assume that p divides n (the general case is essentially the same). With this
assumption |Z(p, i)| = n/p = m. For every 0 ≤ i ≤ p − 1 set
dist(αZ(p,i), homogeneous) = di. For fixed i and every σ ∈ Σ let nσ be the
number of occurrences of σ in Z(p, i). Assume also that we have renumbered the
letters in Σ so that n1 ≥ n2, . . . ≥ nk. Then di = n − n1 as it is easy to see that
the closest homogeneous string is obtained by changing all letters different from
σ1 to σ1 (exactly m − n1 letters). Hence the number of p-witnesses in Z(p, i),

Wi, is Wi = 1
2 (Σnj)2
−Σn2
j. It is easy to see that Wi ≥ m·di/2. Now, according
to Fact 3 we get that,
dist(α, p − periodic) = Σp−1
i=0 dist(αZ(p,i), homogeneous) =
= Σp−1
i=0 di ≤
2
m
ΣWi ≤
2
m
W
where W is the total number of p-witnesses.
It follows that W ≥ m
2 dist(α, p − periodic) ≥ n2
2p . Note however that the
total number of unordered pairs i, j such that j ≡ i(mod p) is n2
2p . Thus we
conclude that the probability that a random such pair is a p-witness is at least
and the result follows.

We will need the following proposition in a number of lower bound proofs. Let
m n/8, S ⊂ [n], |S| = m and a be an assignment a : S −→ {0, 1}. We define
the following distribution U/S on strings of length n. We choose every letter
αi = a(i) if i ∈ S. For all places i /
∈ S we choose αi to be ‘1’ with probability
1/2 and ‘0’ with probability 1/2, independently between different i’s. Note that
U/S is just the uniform distribution on all binary strings conditioned on the
event that the projection on S is a. We have,
Proposition 2. Let m, S, a be as above and let G(g) be the event that a string
selected according to U/S is 1
16 -far from having period(≤ g). Then ProbU/S(G(g))
≥ 1 − 1
n .
Proof: We present the proof for g = n/2 this immedialty implies the result
for every g ≤ n/2 by Fact 2. Let U be the uniform distribution over Σn
. Again
according to Fact 2, it is enough to prove that the probability that a string
selected according to U/S is 1
16 -far from having a period in
n
4 , n
2

, is at least
1 − 1
n . We prove that for each p ∈
n
4 , n
2

the probability that a string selected
according to U, is 1
16 -close to being p-periodic, is at most 4
n2 , and therefore by
the union bound we are done.
Indeed let p ∈
n
4 , n
2

, and let α be a string selected according to U/S. Since
p ∈
n
4 , n
2

there are at least n
4 p-sections. Because m ≤ n
8 and the size of each
p-section is at least 2, at least 3n
16 of these p-sections contain at least 1 location
not in S. We call such a location a free location, and define B to be the set of
all integers x such that the x p-section contains a free location.
Then for every i ∈ B, ProbU/S(dist(αZ(p,i), homogeneous) ≥ 1) ≥ 1
2 (the
free location is different than some other location in the p-sections). Thus, since
the p-sections are mutually disjoint and |B| ≥ 3n
16 , Chernoff bound implies that,
ProbU/S

i∈B dist(αZ(p,i), homogeneous) ≤ n
16

≤ e− n
24 ≤ 4
n2 .

3 Upper Bound for Testing period(≤ g)
In this section we construct a 1-sided error, non adaptive -test for period(≤ g)
that uses O(
√
g log g/ 2
) queries. The construction consists of two stages. We

first show a 1-sided error, non adaptive algorithm, PT , for testing periodicity
(period(≤ n
2 )), that uses O(
√
n log n
2 ) queries. Then we show how to use algorithm
PT in order to construct the claimed algorithm for testing period(≤ g).
The intuition behind the algorithm PT is simple. We first explain it by
hinting to a test for period(≤ n
2 ) that uses O(
√
n log n
2 ) queries. By Fact 2 α is
-far from period(≤ n) if and only if α is -far from being p-periodic for every
p ∈ (n
4 , n
2 ]. We show that if α is -far from period(≤ n
2 ), we can find a p-witness
for each p ∈ (n
4 , n
2 ] with probability at least 1
n . This is sufficient since by the
union bound we will be done.
Suppose that we can construct a random set Q ⊆ [n] of size O(
√
n log n),
such that Q contains at least log n independent potential p-witnesses for each
p ∈ (n
4 , n
2 ]. Then obviously we achieve the above goal. Indeed such a set Q can
be constructed e.g. by choosing Q uniformly between all sets of the required
cardinality.
To reduce the size of Q, we construct such a random set Q of size
√
n log n
with the above properties while giving up the independence between the log n
pairs for each p. In turn we will have to show that the probability that none will
be a p-witness is still low enough.
We use the following definition.
Definition 7. Let =
√
n log n and let J = {Ii}
i=1 be the set of pairwise
disjoint intervals Ii =

(i − 1)

n
log n , i

n
log n
3
.
We now present the -test for period(≤ n
2 ).
Algorithm PT
Input: a string α ∈ Σn, the string length n and a distance parameter
0 1;
Let , J be as in Definition 7.
1. Select a set of integers T = {t1, . . . , t} by choosing one random point from
each I ∈ J.
2. Repeat the following independently for m = 210
·log n
2 times: uniformly select
an interval I ∈ J. Let J∗
be the set of intervals that where selected and let
H = ∪I∈J∗ I.
3. Reject if for every p ∈ (n
4 , n
2 ] the set H ∪ T contains a p-witness. Otherwise
accept.
Theorem 1. Algorithm PT is a 1-sided error, non-adaptive -test for period-
icity. Its query complexity is O(
√
n log n/ 2
).
The proof is in omitted.
Theorem 2. For any g ≤ n
2 there is a 1-sided error, non-adaptive -test for
period(≤ g). Its query complexity is O(
√
g log g/ 2
).

Proof Idea: Let α ∈ Σn
. We think of α as being composed of n
2g pieces of length
n
= 2g each. We now run the -test for periodicity for strings of length n
and
for each query q ∈ [n
] we query q in one the pieces that is chosen randomly and
independently for each query. We avoid further details here.

4 Upper Bound for Testing period(≤ g), Where g ≤ log n
4
Let g ≤ log n
4 , we describe an algorithm for testing whether a string has period(≤
g) that has query complexity poly(log g). The technicalities are somewhat in-
volved, however, the intuition is simple and motivated by the following reason-
ing: Our goal is to find a witness for not being r-periodic for every r ∈ [g]. One
thing that we could do easily is to check whether α is q-periodic where q is the
product of all numbers in [g] (this number would be a actually too big, but
for understanding the following intuition this should suffice). Now if α is not
q-periodic then it is certainly not in period(≤ g) and we are done. Yet as far as
we know α may be far from having period(≤ g) but is q-periodic. Our goal is to
show that if α is far from p-periodic, where p ≤ g, then there exists q
g, such
that, α is far from q’-periodic and p divides q
. Indeed it follows form Lemma 3
that if α is far form p then there exists such a q
as above. This together with
the following concept enables us to construct an -test.
Definition 8. For integers n,g ∈ [n], a gcd-cover of [g] is a set E ⊆ [n] such
that for every ≤ g, there exists a subset I ⊆ E, that satisfies = gcd(I).
The importance of a gcd-cover of [g] is the following: We prove that if =
gcd(I) for an integer and a subset I, then, the assumption that α is far from
being -periodic implies that it is also sufficiently far from being t-periodic for
some t ∈ I. Using this, let E be a gcd-cover of [g]. Then we are going to query
for each t ∈ E enough pairs i, j ∈ [n] such that i ≡ j(mod t). We say that such
pairs i, j cover t. Now for each ≤ g there is a set I ⊆ E such that = gcd(I).
The pairs of queries that cover I cover . Thus if α is ‘far’ from being -periodic
then, as gcd(I) = , there exists a period t, t ∈ I, such that the string is far
from t-periodic. We thus expect that this t will distinguish between strings that
are -periodic and those that are k -periodic but far from being -periodic.
As the complexity of the test will depend crucially on the size of the gcd-
cover, we need to guarantee the existence of a small one. This is done in the
following Lemma, which also brings in an additional technical requirement.
Lemma 1. For every integers n and g ∈
4
log n
4
3
, there exists a gcd-cover of [g],
E ⊆ [
√
n], of size O((log g)3
).
The proof of Lemma 1 is omitted.
A gcd-cover of [g] as in the Lemma is called an efficient gcd-cover of [g].
We next describe our algorithm.

Algorithm SPT
Input: a string α ∈ Σn, a distance parameter 0 1 and a threshold
period g ≤ log n
4 ;
1. Set E to be an efficient gcd-cover of [g].
2. Let M = |E|·log(8|E|)
2 . For each ∈ E select M pairs of integers uniformly
and with repetitions from the set {{x, y} | x ≡ y mod and x, y ∈ [n]}. Let
Q be the resulting (multi)set.
3. Reject if for each ∈ [g] the set Q contains a witness that α is not -periodic.
Otherwise accept.
Theorem 3. Algorithm SPT is a 1-sided error, non-adaptive -test for
period(≤ g), where g ≤ log n
4 . Its query complexity is Õ

(log g)6

.
Proof: Since for every member of E we used M queries and |E| = O((log g)3
)
the estimate on the query complexity follows. The fact that the algorithm never
rejects a -periodic string, for ≤ g is immediate, as the algorithm rejects only
when it finds a witness that the input string is not in period(≤ g).
In order to compute the success probability of algorithm SPT we first need
the following definition.
Definition 9. A string α ∈ Σn
is called -bad if for every ≤ g and every
S ⊆ E, where = gcd(S), there exists s ∈ S for which α is
2|E| -far from being
s-periodic.
Note: if α is -bad and s, are as in the definition, then a witness for it
showing that it is not s-periodic is also a witness that it is not -periodic.
The proof of the Theorem now follows from Lemma 2 and Lemma 3.

Lemma 2. Algorithm SPT rejects every α that is -bad with probability at
least 3
4 .
Proof. Let α ∈ Σn
be -bad. Let B ⊆ E be the set that contains every b ∈ E
such that α is
2|E| -far from being b-periodic. By Definition 9, for every ≤ g
there exists a b ∈ B such that | b, and hence if we find a b-witness for each
b ∈ B - this is also a witness that α is not in period(≤ g). Thus, it is enough to
prove that for each b ∈ B, the probability that there is no b-witness in Q is at
most 1
8|E| , as then by the union bound we are done.
Let δ =
2|E| and let b be a member of B, namely α is δ-far from being
b-periodic. By Corollary 1, for a random x, y ∈ [n], such that x ≡ y(modb) we
have that Prob(αx = αy) ≥ δ. Since M random pairs are being queried for each
∈ E and in particular for b, the probability that a b-witness is not found is at
most (1 − δ)M
≤ 1
(8|E|) .

Lemma 3. If a string α ∈ Σn
is -far from having short period(≤ g), where
g ≤ (log n)/4 then it is a -bad string.
The proof of Lemma 3 is omitted.

5 Lower Bound for Testing period(≤ g)
In this section we prove that every g ≤ n/2 every adaptive, 2-sided error,
1
32 -test for period(≤ g) uses Ω( g/(log g · log n)) queries. We prove this for
Σ = {0, 1}. This implies the same bound for every alphabet that contains at
least two symbols.
Theorem 4. Any adaptive 2-sided, error 1
32 -test for period(≤ g), uses
Ω( g/(log g · log n)) queries.
Proof. Fix g ≤ n/2. We prove the theorem by using Yao’s principle. That is,
we construct a distribution D over legitimate instances (strings that are either
in period(≤ g), or strings that are 1
32 -far from period(≤ g)) and prove that any
adaptive deterministic tester, that uses m = o( g/(log g · log n)) queries, gives
an incorrect answer with probability greater than 1
3 .
In order to define D we use auxiliary distributions DP , DN and the following
notation. Let Primes = {p | p is a prime such that p ≤ g}. According to the
Prime Number Theorem [11–13], |Primes| = θ( g
log g ).
We now define distributions DP , DN .
– DN is simply the uniform distribution over Σn
.
– An instance α of length n is selected according to distribution DP as follows.
Uniformly select p ∈ Primes, then uniformly select ω ∈ Σp
and finally set
α to be the concatination of ω to itself enough times until a total length of
n (possibly concatinating a prefix of ω at the end if p does not divide n).
We next define distribution D. Let G be the event that α is 1
16 -far from having
period(≤ g). Let DN/G be the distribution DN given that event G is true. A
string α is selected according to distribution D by choosing one of the distri-
butions DP ,DN/G with equal probability and then selecting α according to the
distribution chosen. Namely D = 1
2 DP + 1
2 DN/G.
We don’t work directly with D but rather with a simpler distribution D
which approximates D well enough and is defined as follows D
= 1
2 DP + 1
2 DN .
Let B be the event that the tester gives an incorrect answer. We prove that
ProbD (B) ≥ 2
5 . This is indeed sufficient as ProbD(B) ≥ ProbD (B)−ProbDN (G).
Using Proposition 2, ProbD(B) ≥ 1−o(1)
2 − 1
g 1
3 .
We assume with out loss of generality that for any string of length n the
tester uses the same number of queries. Hence we can view the tester as a full
binary decision tree of depth m which is labeled as follows. Each node of the tree
represents a query location, for each internal node one of the outgoing edges is
labeled by 1 and the other by 0, where 0, 1 represent the answers to the query,
and each leaf is labeled either by “accept” or by “reject” according to the decision
of the algorithm.
For each leaf l in the tree we associate a pair Ql, fl, where Ql is the set of
queries on the path from the root to the leaf l and fl : Ql −→ {0, 1} is a mapping
between each query and its answer, that is the labellings on the edges of the path
from the root to the leaf l. Let L be the set of all leaves, L0 be the set of all

leaves that are labeled by “reject” and let L1 be the set of all leaves that are
labeled by “accept”.
Let h : Σn
−→ {0, 1} be a function that is 1 only on strings α ∈ Σn
such
that for every q ∈ Q we have αq = f(q). That is h(α) = 1 if and only if α is
consistent with Q, f. Let far : Σn
−→ {0, 1} be a function that is 1 only on
strings α ∈ Σn
such that are 1
16 -far from period(≤ g)). Thus the ProbD (B) is
at least
1
2
Σ∈L0 probDP [h(α) = 1] +
1
2
Σ∈L1 probDN [far(α) = 1

h(α) = 1] (1)
We will prove that for each ∈ L0, probDP [h(α) = 1] ≥ 1−o(1)
2m , and for each ∈
L1, probDN [far(α) = 1
5
h(α) = 1] ≥ 1−o(1)
·2m . This implies that ProbD (B) ≥
1
2 Σ∈L
1−o(1)
·2m ≥ 1−o(1)
2 .
Indeed, recall that a string is selected according to DP by first selecting
z ∈ Primes and then selecting a z-periodic string. For Q ⊂ [n], |Q| = m let
A(Q) be the event that for α selected according to DP , there exists no j, k ∈ Q
such that j ≡ k mod z. For any fixed i j ∈ Q there are at most log n prime
divisors of j −i. Hence for each ∈ L0, probDP [A(Q)] ≥ 1− m2
·log n
|Primes| ≥ 1−o(1).
Observe that for any fixed if A(Q) occurs then according to the definition of
DP , for each q ∈ Ql, αq is selected uniformly and independently of any other
q
. Hence, probDP [h(α) = 1] ≥ probDP [h(α) = 1 ∧ A(Q)] ≥ probDP [h(α) =
1 | A(Q)] · ProbDP [A(Q)] ≥ 1
2m (1 − o(1)).
Observe that for each ∈ L1,
ProbDN [far(α) = 1
5
h(α) = 1] ≥
ProbDN [far(α) = 1 | h(α) = 1] · ProbDN [h(α) = 1]
. (2)
By the definition of DN , ProbDN [h(α) = 1] = 1
2m . Also, using the defini-
tion just before Proposition 2 we see that ProbDN [far(α) = 1 | h(α) = 1] =
probU(Q)[far(α) = 1] ≥ 1 − 1
n (where the last inequality is by Proposition 2).

6 Lower Bound for Testing period(≤ g), Where g ≤ log n
4
In this section we prove that every 2-sided error, 1
16 -test for period(≤ g), where
g ≤ log n
4 , uses Ω((log g)
1
4 ) queries. We prove this for Σ = {0, 1}. This implies
the same bound for every alphabet that contains at least two symbols. This also
shows that the test presented in Section 4 cannot be dramatically improved.
Theorem 5. Any 2-sided error 1
16 -test for period(≤ g), where g ≤ log n
4 , re-
quires Ω((log g)
1
4 ) queries.
Proof. The proof is very similar to the proof of Theorem 4. We just describe
here the two probabilities DP and DN that are concentrated on positive inputs

(those that have period(≤ g)) and negative inputs (those that are 1
16 -far from
being period(≤ g)) respectively.
Let S = {p1, . . . , pk} where k = 1 +
√
log g and each pi is a prime such that
2
√
log g−0.9
≤ pi ≤ 2
√
log g
. According to the Prime Number Theorem [11–13]
there exists at least 1
8
√
log g
· 2
√
log g
2 +
√
log g such primes, hence the set S
is well defined. Let t =

p∈S p and let Z = { t
p | p ∈ S}. Note that t g while
z ≤ g for every z ∈ Z.
We now define distributions DP , DN (we assume w.l.o.g that t divides n).
– We use an auxiliary distribution U in order to define DN . To select an
instance according to U, select a random string ω ∈ Σt
and then set α =
(ω)
n
t . Let G be the event that α is 1
16 -far from being period(≤ g). Then DN
is defined as the distribution U|G, namely U given G.
– An instance α of length n is selected according to distribution DP as follows.
Uniformly a select z ∈ Z , select uniformly a string ω ∈ Σz
and then set α
to be the concatination of ω to itself enough times until a total length of n.
We omit further details.

7 Lower Bound for Testing Periodicity
Let Σ = {0, 1}, we prove the following lower bound on the number of queries
that is needed for testing period(≤ n
2 ) over Σ. This clearly implies the same
lower bound for any alphabet that contains at least two letter. It also shows
that the test presented in Section 3 is asymptotically optimal.
Theorem 6. Any non-adaptive 1-sided error 1
16 -test for periodicity requires
Ω(
√
n log n) queries.
Proof: A 1-sided error test rejects only when the input string is not periodic.
Therefore according to corollary 1 such a test rejects only if the set of queries it
uses contains a witness that the string in not periodic. To prove the Theorem we
use Yao’s principle (the easy direction). Namely, we construct a distribution on
1/16-far inputs and show that any deterministic non-adaptive test that queries
at most
√
n log n
100 queries finds a witness for non-periodicity with probability at
most 1/3.
Let U be the uniform distribution over Σn
, and let G be the event that
a string selected according to U is 1
16 -far from being periodic. Let D be the
distribution U/G (U given G). Namely, D is uniform over strings in Σn
that
are 1/16-far from being periodic. Let B be the event that the set of queries Q
contains a witness. Proposition 2 ProbU (G) ≥ 1 − 1
n . Thus, it is enough to prove
that probU (B) 1
4 , since probD(B) ≤ ProbU (B)+ProbU (G) ≤ 1
4 + 1
n . The proof
follows from Lemma 4 below.

Lemma 4. Let Q ⊆ [n] be a set of
√
n log n
100 queries. Then for α chosen according
to U, the probability that Q contains a witness that α is not periodic is at most 1
4 .
The proof is omitted.

References
1. S. Goldwasser O. Goldreich and D. Ron. Property testing and its connection to
learning and approximation. Journal of the ACM, 45:653–750, 1998.
2. R. Rubinfeld and M. Sudan. Robust characterization of polynomials with applica-
tions to program testing. SIAM Journal of Computing, 25:252–271, 1996.
3. F. Ergun, S. Muthukrishnan, and C. Sahinalp. Sub-linear methods for detecting
periodic trends in data streams. In LATIN 2004, Proc. of the 6th Latin American
Symposium on Theoretical Informatics, pages 16–28, 2004.
4. P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying representative trends
in massive time series data sets using sketches. In VLDB 2000, Proceedings of
26th International Conference on Very Large Data Bases, September 10-14, 2000,
Cairo, Egypt, pages 363–372. Morgan Kaufmann, 2000.
5. R. Krauthgamer O. Sasson. Property testing of data dimensionality. In Proceedings
of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 18–
27. Society for Industrial and Applied Mathematics, 2003.
6. A. C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss. Near-optimal
sparse fourier representations via sampling. In STOC 2002, Proceedings of the
thirty-fourth annual ACM symposium on Theory of computing, pages 152–161,
2002.
7. Alex Samorodnitsky and Luca Trevisan. A PCP characterization of NP with
optimal amortized query complexity. In Proc. of the 32 ACM STOC, pages 191–
199, 2000.
8. Johan Hästad and Avi Wigderson. Simple analysis of graph tests for linearity and
pcp. Random Struct. Algorithms, 22(2):139–160, 2003.
9. E. Fischer. The art of uninformed decisions: A primer to property testing. The
computational complexity column of The Bulletin of the European Association for
Theoretical Computer Science, 75:97–126, 2001.
10. D. Ron. Property testing (a tutorial). In Handbook of Randomized computing,
pages 597–649. Kluwer Press, 2001.
11. J. Hadamard. Sur la distribution des zéros de la fonction ζ(s) et ses conséquences
arithmétiques. Bull. Soc. Math. France, 24:199–220, 1896.
12. V. Poussin. Recherces analytiques sur la théorie des nombres premiers. Ann. Soc.
Sci. Bruxelles, 1897.
13. D.J. Newman. Simple analytic proof of the prime number theorem. Amer. Math.
Monthly, 87:693–696, 1980.

The Parity Problem in the Presence of Noise,
Decoding Random Linear Codes,
and the Subset Sum Problem
(Extended Abstract)
Vadim Lyubashevsky
University of California at San Diego, La Jolla CA 92093, USA
vlyubash@cs.ucsd.edu
Abstract. In [2], Blum et al. demonstrated the ﬁrst sub-exponential
algorithm for learning the parity function in the presence of noise. They
solved the length-n parity problem in time 2O(n/ log n)
but it required
the availability of 2O(n/ log n)
labeled examples. As an open problem,
they asked whether there exists a 2o(n)
algorithm for the length-n parity
problem that uses only poly(n) labeled examples. In this work, we provide
a positive answer to this question. We show that there is an algorithm
that solves the length-n parity problem in time 2O(n/ log log n)
using n1+
labeled examples. This result immediately gives us a sub-exponential
algorithm for decoding n × n1+
random binary linear codes (i.e. codes
where the messages are n bits and the codewords are n1+
bits) in the
presence of random noise. We are also able to extend the same techniques
to provide a sub-exponential algorithm for dense instances of the random
subset sum problem.
1 Introduction
In the length-n parity problem with noise, there is an unknown to us vector
c ∈ {0, 1}n
that we are trying to learn. We are also given access to an oracle that
generates examples ai and labels li where ai is uniformly distributed in {0, 1}n
and li equals c·ai(mod 2) with probability 1
2 +η and 1−c·ai(mod 2) with prob-
ability 1
2 − η. The problem is to recover c. In [2], Blum, Kalai, and Wasserman
demonstrated the ﬁrst sub-exponential algorithm for solving this problem. They
gave an algorithm that recovers c in time 2O(n/ log n)
using 2O(n/ log n)
labeled
examples for values of η is greater than 2−nδ
for any constant δ 1. An open
problem was whether it was possible to have an algorithm with a sub-exponential
running time when only given access to a polynomial number of labeled exam-
ples. In this work, we show that by having access to only n1+
labeled examples,
we can recover c in time 2O(n/ log log n)
for values of η greater than 2−(log n)δ
. So
the penalty we pay for using fewer examples is both in the time and the error
tolerance.

Research supported in part by NSF grant CCR-0093029.
c

The Parity Problem in the Presence of Noise 379
The parity problem in the presence of noise is equivalent to the problem of
decoding random binary linear codes in the presence of random noise. Suppose
that A is a random n × m boolean matrix, and let l = cA for some binary string
c of length n. Now ﬂip each bit of l with probability 1
2 −η, and call the resulting
bit string l
. The goal is to recover c given the matrix A and the string l
. Notice
that we can just view every column of A as an example ai and view the ith
bit of
l
as the value of c · ai(mod 2) which is correct with probability 1
2 + η. So this is
exactly the length n parity problem in the presence of noise where we are given
m labeled examples. In this application, being able to solve the length n parity
problem with fewer examples is crucial because we don’t really have an oracle
that will provide us with as many labeled examples as we want. The number of
labeled examples we get is exactly m, the number of columns of A. Our result
for learning the parity function provides the ﬁrst algorithm which can decode
n × n1+
random binary linear codes i

Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques_ 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th International .pdf

More Related Content

Similar to Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques_ 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th International .pdf (20)

Recently uploaded (20)

Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques_ 8th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX 2005 and 9th International .pdf