STOC/FOCS/SODA: The Cage Match (with data!)

(Ed: This post, and the attendant web page and data, was initiated as a joint effort of Piotr Indyk and myself. Many others have helped improve the data, presentation and conclusions.)

Inspired by Michael Mitzenmacher’s flamebait post on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve the biggest outstanding issue in Theoretical Computer Science, namely the great “STOC/FOCS vs. SODA” debate (“P vs. NP” is a tainted second, sullied by all that money being offered to solve it). We have some interesting preliminary observations, and there are many interesting problems left open by our work ;)

The hypothesis:

There is a significant difference in citation patterns between STOC/FOCS and SODA

The plan:

First, we obtained the list of titles of conference papers appearing in STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately excluded 2007 because FOCS hasn’t happened yet. We got this list from DBLP (Note: DBLP does not make any distinction between actual papers and tutorials/invited articles; we decided to keep all titles because there weren’t that many tutorials/invited papers in any case).

For each title, we extracted the citation count from Google Scholar, using a process that we henceforth refer to as “The Extractor”. Life is too short to describe what “The Extractor” is. Suffices to say that its output, although not perfect, has been verified to be somewhat close to the true distribution (see below).

The results, and methodology, can be found at this link. The tables and graphs are quite self-explanatory. All source data used to generate the statistics are also available; you are welcome to download the data and make your own inferences. We’ll be happy to post new results here and at the webpage.

OBSERVATIONS:

The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as well as the short paper option.

Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou’s tutorial paper on algorithms and game theory). The variation between SODA and STOC/FOCS in the 1999-2001 range shows up here too, between STOC and FOCS themselves. So maybe it’s just something weird going on these years. Who knows :)

Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.

There are other observations we might make, especially in regard to what happens outside the peak citations, but for that we need more reliable data. Which brings us to the next point.

VALIDATION OF THE DATA:

To make sure that the output makes sense, we performed a few “checks and balances”. In particular:

  • we sampled 10 random titles from each of FOCS, STOC and SODA for each of the 10 years, and for each title we checked the citation count by hand. Results: there were 7 mistakes in FOCS, 9 in STOC, and 11 in SODA, indicating a current error rate in the 9-10% range.
  • for each of FOCS, STOC, SODA, we verified (by hand) the values of the top 10 citation numbers, as reported by The Extractor
  • we compared our stats for the year 2000 with the stats obtained by Michael. The results are pretty close:

Median (Mike’s/Ours) Total (over all 10 years) (Mike’s/Ours)
FOCS 38/38 3551/3315
STOC 21/21 3393/2975
SODA 14/13 2578/2520

A CALL FOR HELP:
Warning: the data displayed here is KNOWN to contain errors (our estimate is that around 10% of citation counts are incorrect). We would very much appreciate any efforts to reduce the error rate. If you would like to help:

  1. choose a “random” conference/year pair (e.g., STOC 1997)
  2. check if this pair has been already claimed in this blog; if yes, go to (1)
  3. post a short message claiming your pair (e.g., “CLAIMING STOC 1997″) on the blog.
  4. follow the links to check the citations. For each incorrect citation, provide two lines: (1) paper title (2) a Google Scholar link to the correct paper
  5. Email the file to Suresh Venkatasubramanian.

(and yes, we know that this algorithm has halting problems). Many thanks in advance. Of course, feel free to send us individual corrections as well.

Citation guidelines: The world of automatic citation engines is obviously quite messy, and sometimes it is not immediately clear what is the “right” citation count of a paper. The common “difficult case” is when you see several (e.g., conference and journal) versions of the same paper. In this case, our suggestion is that you ADD all the citation counts that you see, and send us the links to ALL the papers that you accounted for.


Acknowledgement: Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)

Published in: on July 25, 2007 at 4:20 am Comments (44)

SODA vs STOC/FOCS

Michael Mitzenmacher has an interesting post up comparing citation counts for papers at SODA/STOC/FOCS 2000. The results are quite stunning (and SODA does not fare well). For a better comparison, we’d need more data of course, but this is a reasonable starting point.

As my (minor) contribution to this endeavour, here are the paper titles from 1997-2006 (last 10 years) for STOC, FOCS, and SODA. There might be errors: this was all done using this script. Someone more proficient in Ruby than I might try to hack this neat script written by Mark Reid for stemming and plotting keywords trends in ICML papers; something I’ve been hoping to do for STOC/FOCS/SODA papers.

If there are any Google researchers reading this, is there some relatively painless way using the Google API to return citation counts from Google Scholar ? In fact, if people were even willing to do blocks of 10 papers each, we might harness the distributed power of the community to get the citation counts ! If you’re so inspired, email me stating which block of 10 (by line number, starting from 1) you’re planning to do and for which conference. Don’t let Michael’s effort go in vain !

There are 681 titles from FOCS, 821 from STOC, and 1076 from SODA.

Published in: on July 18, 2007 at 3:03 am Comments (11)

SODA 2008 Server Issues ?

Piotr asks if anyone else has been having problems with the SODA 2008 server shortly before 5pm ET (i.e around 90 minutes ago).

Update (7/707): At least one person thinks the server was on (and accepting submissions) for a few hours after the deadline. If this is true, this would be quite interesting, considering that I’ve never heard of this happening before (if the server did crash, it’s a rational response of course).

P.S Maybe the server was out watching fireworks.

Published in: on July 6, 2007 at 10:29 pm Comments (6)

SODA 2007: Day 2: Da Bidness

I am sorry to say that the business meeting was so dull we couldn’t even muster up the energy to argue about old controversies, let alone create new ones. Here are some of the highlights:

  • Attendance was 276, comparable to SODA 2004 in New Orleans. 96 students
  • 796 distinct accepted authors, from 29+ countries.
  • Milan Ružić was awarded the best student paper prize for “Making Deterministic Signatures Quickly“.
  • The domain gmail.com had a 44% acceptance rate, compared to yahoo.com’s 14% acceptance rate. Yahoo’s stock price fell 3% on hearing this news. Google’s stock price fell 5% on worries that researchers were wasting their time writing papers.
  • Manufactured abstract from a random selection of words in accepted paper abstracts:

We present arbitrary coinciding factors that are hierarchical and use predecessors as well as important jobs. We show there exist revenues that are treasure sales and independent for identical parametric desires.

  • Hal Gabow makes some attempts to shake things up with suggestions about pandering to the discrete math community, prompting this from Ian Munro:

“The notion of a discrete math community is an interesting one and somewhat perverse”

  • He tries to resurrect short papers, prompting this, from an unnamed PC member (but we know who you are:))

“Please don’t exhume this dead horse just to kick the rotting bones around. Again”

  • SODA 2008 is in San Francisco; Shang-Hua Teng is chair.
  • Major discussion on locations for SODA 2009. Three candidates emerge (NYC/Puerto Rico/Las Vegas). After an attempt at vote-rigging that would put Diebold to shame, the organizers do a straight vote and NYC wins !!

There will be no SODA Day 3 summary from me; I have to leave early to teach. The joys of professorhood !!

Published in: on January 9, 2007 at 9:39 am Comments (15)

SODA 2007: Day 2

David has been posting detailed impressions of specific papers. Do check them out.

He mentions the constant-degree expander result of Alon, Schwartz and Shapira. I was at that talk today morning, and it’s a very neat result. Expander graphs have a long and fascinating history (see the Linial-Wigderson course notes for details), and I won’t try to summarize it here. Basically, the goal for a long time has been to construct reasonably sparse, constant degree expanders that also have “simple” proofs of expansion. Many of the constructions to date were either non-constructive (like the random expanders) or had very sophisticated proofs of expansion. Luca Trevisan had two very detailed posts that give an idea of how complicated some of these arguments can be.

The main contribution of the Alon/Schwartz/Shapira paper is the explicit construction of an expander that is both small (constant degree) and which has a “simple” proof of expansion. I should mention that the celebrated zig-zag product of Reingold-Vadhan-Wigderson already does this. However, their proof (and basically all proofs of expansion) rely on the spectral analysis of the graphs, using the relation between expansion and the gap between the first and second eigenvalues of the adjacency matrix of a graph.

This paper uses a graph product called a replacement product, and presents an elementary (i.e combinatorial) proof that the replacement product of two expanders is also an expander. With that in hand, they construct three graphs, and with two replacement products, get a constant-degree expander.

The invited talk today was by Monika Henzinger, of Google and EPFL. This talk was the “applied” talk (SODA usually has one such); Monika talked about algorithmic success stories at Google, discussing PageRank (and HITS), detecting document duplicates, and load balancing queries on servers. Each of these topics deserves a post on their own (the work on detecting duplicates has some particularly elegant ideas), so I don’t want to go into detail here.

There’s a point worth making here. The success of PageRank and other methods for search is really a success story for algorithmic modelling, rather than algorithms per se.

What do I mean by this ? I mean that the key success of PageRank, to take an example, was the idea that pages could be viewed as nodes, and edges as transitions in a Markov chain, and that relevance (or PageRank) could be modelled as a kind of “return probability”. Of course, once you do this, all your theoretical machinery kicks in, and you can prove bounds on convergence, design fast algorithms, etc. But the main step was the modelling step, where you took the raw data (web pages) and viewed them in a certain abstract framework. For those of you who don’t remember this, the existing paradigm of search at the time was text-based IR, and Altavista was the main exemplar of this. What Google was proposing was a very different way of viewing documents and the problem of relevance.

This is a common, and yet widely misunderstood, aspect of doing “applied” algorithms. You can define all kinds of problems, and write all kinds of papers proving results about these problems. But the mathematical tools developed for solving these problems will always take a backseat to the essentially scientific question of whether the problems and models fit the data correctly or not.

There are many domains where this is not true; cryptography is one domain where provable guarantees are not just nice, but are the crucial element of a secure system. But the success of algorithms in Web search come not from knowing to simulate a Markov chain efficiently, but from realizing that web documents are essentially nodes in a gigantic graph, and that the problem of ranking pages can be translated to a mathematical abstraction on graphs. As an aside, one of the things that the Kleinberg/Tardos textbook appears to do well is walk students through the process of problem abstraction, via the extended real-world exercises.

Understanding this aspect of data modelling changes the questions somewhat. The issue now is not, “Is this the most efficient algorithm for the problem”, but rather, “Is this the right problem for the data” ? The first question will become relevant only when the second one is answered satistfactorily, more akin to a scientific discovery of truth than a mathematical one.

Outtakes:

  • (Thanks to Vijay Kumar) You could, at some point, buy a watch on Amazon.com for the heavily discounted (50% off) price of $499,999. The comments on the product page are hilarious.
  • What’s the title of Britney Spears’ only SODA paper ? “Stable Marriage is Hard”
Published in: on at 8:49 am Leave a Comment

SODA 2007: Day 1

I didn’t always know I wanted to do algorithms; in fact, I came to Stanford thinking I was more interested in AI. One of my most embarrassing moments in graduate school was when I went to talk to my advisor-to-be. He told me he worked in the design and analysis of algorithms, and I asked him if he did design or analysis.

(Note to graduate students everywhere; when someone tells you that no question is a stupid question, don’t believe it)

But after attending Philippe Flajolet’s talk today on “Analytic Combinatorics”, and after hearing Luc Devroye’s talk yesterday, I’m not so sure that my question was off the mark.

A bit of background. What we refer to as “the analysis of algorithms” is usually associated with Don Knuth and the Art of Computer Programming. It referred to the initial methods being developed to analyze structures like hash tables, search trees and the like. Most computer scientists have taken some kind of discrete math class, and have seen the Knuth-Graham-Patashnik “Concrete Mathematics” textbook, and much of the content (basic combinatorics, recurrence relations, generating functions, etc) was used for early algorithm analysis.

These methods were quite precise. It’s not uncommon to look back at papers from the 70s and see detailed constants in front of running times for algorithms; Bob Sedgewick mentioned one such calculation in his introduction to Flajolet’s talk. People didn’t use O() notation like a sledgehammer, the way we do today.

Over time, we became more comfortable with O() notation; algorithms became more sophisticated, and it became harder to figure out actual constants. It didn’t seem to matter as much. After all, when you were coming up with elaborately complicated algorithms that ran in exotic times like O(n^(11/7)), it hardly seemed to matter what the constant was. This was, and has continued to be, the Age of Design.

But all along, with people like Flajolet, Knuth, Sedgewick, and many many others, the work of really analyzing algorithms continued on. ANALCO is an offshoot of this long effort; a way to attract people working on some of the considerably hard problems in this area, while creating some cross-fertilization with the design crowd at SODA. Of course, by no means do I claim that algorithm designers don’t analyze; of course we do. But it’s fair to say that the sophisticated analysis methods and sophisticated design methods do appear to have diverged.

Which brings us to the talk today. The rough theme for his talk was an overview of how the combinatorial problem of counting structures (trees, steps in a linear probe, etc) can be transformed into a generating function, to which then the methods of real, and more recently, complex analysis can be applied. There’s some pretty heavy mathematical machinery being thrown out here: we saw large deviation theory in yesterday’s talk, for example, and there are things Flajolet talked about that I have only the barest inkling of.

Doing such analysis is hard; it’s not as if we’re suddenly going to abandon O() notation. But, as Piotr Indyk pointed out when we discussed this later, computers aren’t getting any faster, and data is getting larger and larger, and it’s more and more true that the actual constants in front of a running time matter, sometimes even more than the asymptotic bound. If more sophisticated analysis methods allow us to reveal algorithm running times more transparently, this also helps repair some of the “bad press” theoreticians can get with more applied folk.

So the analysis of algorithms takes on its original meaning again; there is a conference as well, now in its 13th year, and there’s an upcoming book by Flajolet and Sedgewick that covers much of the mathematics Flajolet refers to in his talk. I looked at it briefly (it’s 753 pages and counting!), and I hope that when it does come out, we learn more about how to use methods from analytic combinatorics to improve analysis techniques for even our run-of-the-mill algorithms.

Outtakes:

  • I’ve quite enjoyed the talks I’ve attended thus far. I haven’t written much about them, but that’s mostly due to laziness on my part. I’ve been quite torn having to navigate the multiple parallel sessions; human cloning, where art thou ?
  • [From a random research paper session] It’s funny to see a speaker struggling with their desire to display EVERY SINGLE SLIDE that they made, when faced with a remorseless clock ticking down to zero. Word of advice: no one really cares if you go through all your slides, or even flip thru them frantically while muttering very fast. They do care if you go over time and drag things ON and ON and ON.
  • Contrary to the general confusion being spread around, the wireless DOES work and it IS free.
  • I don’t like hotels with two towers; especially when I’m in one and the conference is in the other, and ESPECIALLY when the only connection between the two towers is at the lobby.
Published in: on January 8, 2007 at 1:18 am Comments (8)

SODA 2007: Day 0

Getting into New Orleans at 1 am because of “mechanical trouble” meant that I haven’t been at my best so far. But I’ve already heard one amazing talk today.

Luc Devroye gave the ANALCO plenary lecture on “Weighted Heights of Random Trees”, based on work with his students Erin McLeish and Nicolas Broutin. After having sat through many talks with titles like this, I generally approach them with great caution and with a clear escape route. But…

This was an amazing exposition of a topic that could have become dry and terse, and essentially incomprehensible, within a slide or two. He had jokes, (that were funny), a global plan for the material, enough technical material that I went away feeling like I’d learnt something, and intuition galore. And the work itself is very beautiful.

So what was it all about ? The problem is really quite simple to state. Suppose I give you a (random) weighted binary tree, where nodes attach to parents randomly, and edges may have weights chosen randomly. What is the maximum height of such a tree ?

The standard application for such a tool is in analyzing binary search trees. The height of a such a tree controls the running time of an algorithm that needs to use it. And there’s now a vast literature analyzing both the asymptotics of the height distribution (basically it’s sharply concentrated around 2 log n) and the specific constants (the maximum height of a random binary search tree is roughly 4.3 log n, and the minimum is around 0.37 log n).

The “master goal” that Devroye described in his talk was this: Suppose I have a general way of attaching nodes to parents (that leads to a general distribution on subtree sizes), and a general way of attaching weights to edges (rather than being deterministically 1 for binary search trees). Such a general model captures the analysis of tries (trees on strings that are very important in text searching), geometric search structures like k-d trees, and even restricted preferential attachment models in social network analysis (Think of the edges as hyperlinks, and the height of the tree as the diameter of a web tree).

Is there a generic theorem that can be applied to all of these different situations, so that you can plug in a set of distributions that describes your process, and out pops a bound on the height of your tree ? It turns out that you can (with some technical conditions). The method uses two-dimensional large-deviation theory: can you estimate the probability of a sum of random variables being bounded by some function, while at the same time ensuring that some other sum of random variables (that might depend slightly on the first) is also bounded ?

An example of a 1D large deviation result is of course a Chernoff bound. Devroye showed that a 2D large deviation bound for the height of such trees can be expressed in a similar form using the so-called Cramér exponent, something that will probably not be surprising to experts in large deviation theory. After that, the analysis for any tree process becomes a whole lot easier. You have to analyze the corresponding Cramér function for your distributions, and a bound (with a constant; no big-O nonsense here!) pops out.

He also talked about a neat extension of this method to analyzing the “skinnyness” of k-d tree decompositions, showing that for a kind of “relaxed” k-d tree construction, the skinniest cell can be extremely skinny (having a super-linear aspect ratio). It’s the kind of result that I imagine would be very difficult to prove without such a useful general theorem.

Published in: on January 6, 2007 at 8:44 pm Leave a Comment