Streaming Summer School Report

(Ed: This post is by Piotr Indyk, who is always willing to be your near neighbor)

Greetings from the Kingdom of Denmark! The country of Vikings, meatballs, and football teams that just refuse to win, has hosted the Summer School on Data Stream Algorithms last week (August 20-23). The school was organized under the banner of MADALGO, a new research center dedicated to MAssive DAta ALGOrithms, set up in Aarhus University. The inauguration ceremony for the center took place on August 24, with several people giving invited lectures.

Muthu (one of the invited lecturers) has covered the inauguration ceremony, so I will skip the detailed description. Suffices to say, it was a pleasure to see that the Danish Research Foundation (or as the locals like to say, Grundforskningsfond) is eager to support an algorithmic research center with a budget of roughly $10M over 5 years, while its US counterpart spends about $7M per year for the entire Theory of Computing program. Did I mention that the population of Denmark is roughly 2% of that of US ?

Anyway, back to the summer school. We had 70+ participants altogether, including 5 lecturers. The school covered the following topics:

  • The dynamic Sudipto Guha gave two lectures. The first lecture was on algorithms for clustering. Massive amounts of data were clustered, including metric data, graph data, and a few careless participants sitting in the first row. In the second lecture, Sudipto covered the “random stream model”, where the elements are assumed to be arriving in a random order, which circumvents the usual worst-case paranoia.
  • The twin duo of T.S. Jayram and Ravi Kumar covered lower bounds: communication complexity, information complexity, and generally “everything you wanted to know but were afraid to ask”. It was the first time I have seen the details of the linear-space lower bound for estimating the L_infty distance, and I am happy to report that I understood everything, or at least that is what I thought at the time. Jayram and Ravi have also occasionally ventured into the land of upper bounds, covering algorithms for the longest increasing sequences and probabilistic data streams.
  • The scholarly Martin Strauss gave an overview of the algorithms for finding frequent elements, heavy hitters (sometimes on steroids) and their more recent versions used in compressed sensing.
  • I have covered the basic upper bounds for the L_p norm/frequency moments estimation, as well as the algorithms for geometric data (clustering, MST, matching), notably those based on core-sets. The latter topic was originally supposed to be covered by Sariel Har-Peled; however, the dark forces highly enlighted and insightful geniuses of the INS [Sariel's corrections] have jeopardized his plans. I guess the force was not strong enough with this one…

We also had an open problem session. Some of the problems were copy-pasted from the “Kanpur list”, but a few new problems were posed as well. The list will be available shortly on the school website, so sharpen your pencils, prepare your napkins, pour some coffee, and … give all of this to your students!

The lecture slides are also available on-line. If you spot any typos, let the lecturers know.

Overall, I think the school has been a success, perhaps with the notable exception of the weather: it started to rain promptly after the school has began, and it stopped when the school has ended. One has to admire the timing though.

SOCG 2009 will be held in Aarhus. See you then!

(Ed: But what about the beer report ?)

Published in: on August 28, 2007 at 8:08 pm Comments (1)

STOC/FOCS/SODA: The Cage Match (with data!)

(Ed: This post, and the attendant web page and data, was initiated as a joint effort of Piotr Indyk and myself. Many others have helped improve the data, presentation and conclusions.)

Inspired by Michael Mitzenmacher’s flamebait post on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve the biggest outstanding issue in Theoretical Computer Science, namely the great “STOC/FOCS vs. SODA” debate (“P vs. NP” is a tainted second, sullied by all that money being offered to solve it). We have some interesting preliminary observations, and there are many interesting problems left open by our work ;)

The hypothesis:

There is a significant difference in citation patterns between STOC/FOCS and SODA

The plan:

First, we obtained the list of titles of conference papers appearing in STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately excluded 2007 because FOCS hasn’t happened yet. We got this list from DBLP (Note: DBLP does not make any distinction between actual papers and tutorials/invited articles; we decided to keep all titles because there weren’t that many tutorials/invited papers in any case).

For each title, we extracted the citation count from Google Scholar, using a process that we henceforth refer to as “The Extractor”. Life is too short to describe what “The Extractor” is. Suffices to say that its output, although not perfect, has been verified to be somewhat close to the true distribution (see below).

The results, and methodology, can be found at this link. The tables and graphs are quite self-explanatory. All source data used to generate the statistics are also available; you are welcome to download the data and make your own inferences. We’ll be happy to post new results here and at the webpage.

OBSERVATIONS:

The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as well as the short paper option.

Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou’s tutorial paper on algorithms and game theory). The variation between SODA and STOC/FOCS in the 1999-2001 range shows up here too, between STOC and FOCS themselves. So maybe it’s just something weird going on these years. Who knows :)

Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.

There are other observations we might make, especially in regard to what happens outside the peak citations, but for that we need more reliable data. Which brings us to the next point.

VALIDATION OF THE DATA:

To make sure that the output makes sense, we performed a few “checks and balances”. In particular:

  • we sampled 10 random titles from each of FOCS, STOC and SODA for each of the 10 years, and for each title we checked the citation count by hand. Results: there were 7 mistakes in FOCS, 9 in STOC, and 11 in SODA, indicating a current error rate in the 9-10% range.
  • for each of FOCS, STOC, SODA, we verified (by hand) the values of the top 10 citation numbers, as reported by The Extractor
  • we compared our stats for the year 2000 with the stats obtained by Michael. The results are pretty close:

Median (Mike’s/Ours) Total (over all 10 years) (Mike’s/Ours)
FOCS 38/38 3551/3315
STOC 21/21 3393/2975
SODA 14/13 2578/2520

A CALL FOR HELP:
Warning: the data displayed here is KNOWN to contain errors (our estimate is that around 10% of citation counts are incorrect). We would very much appreciate any efforts to reduce the error rate. If you would like to help:

  1. choose a “random” conference/year pair (e.g., STOC 1997)
  2. check if this pair has been already claimed in this blog; if yes, go to (1)
  3. post a short message claiming your pair (e.g., “CLAIMING STOC 1997″) on the blog.
  4. follow the links to check the citations. For each incorrect citation, provide two lines: (1) paper title (2) a Google Scholar link to the correct paper
  5. Email the file to Suresh Venkatasubramanian.

(and yes, we know that this algorithm has halting problems). Many thanks in advance. Of course, feel free to send us individual corrections as well.

Citation guidelines: The world of automatic citation engines is obviously quite messy, and sometimes it is not immediately clear what is the “right” citation count of a paper. The common “difficult case” is when you see several (e.g., conference and journal) versions of the same paper. In this case, our suggestion is that you ADD all the citation counts that you see, and send us the links to ALL the papers that you accounted for.


Acknowledgement: Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)

Published in: on July 25, 2007 at 4:20 am Comments (44)

Theoretician, comedy writer, sailboat racer, …

Since Bill brought up Jeff Westbrook (yes, people, I have co-written a paper with a Simpsons screenwriter, thank you very much), I thought that I should mention that he’s now, along with everything else, a sailboat racer !

He’s racing in a race called Transpac, from Long Beach, CA to Honolulu, HI. His boat is called the Peregrine, and it has a 4-man crew. You can track their position (they were leading when I last checked) at http://www.transpacificyc.org/ (go to “track charts,” then click the center logo. when the page loads, hit boat selector, pick division 6, then hit boat selector again, then wait). You can also visit the boat blog (bblog?) at http://peregrine-transpac.blogspot.com/.

One should note that Jeff, being the considerate soul that he is, signed off on revisions to pending papers before he left.

(Source: Adam Buchsbaum)

Published in: on July 18, 2007 at 3:36 am Leave a Comment

SODA vs STOC/FOCS

Michael Mitzenmacher has an interesting post up comparing citation counts for papers at SODA/STOC/FOCS 2000. The results are quite stunning (and SODA does not fare well). For a better comparison, we’d need more data of course, but this is a reasonable starting point.

As my (minor) contribution to this endeavour, here are the paper titles from 1997-2006 (last 10 years) for STOC, FOCS, and SODA. There might be errors: this was all done using this script. Someone more proficient in Ruby than I might try to hack this neat script written by Mark Reid for stemming and plotting keywords trends in ICML papers; something I’ve been hoping to do for STOC/FOCS/SODA papers.

If there are any Google researchers reading this, is there some relatively painless way using the Google API to return citation counts from Google Scholar ? In fact, if people were even willing to do blocks of 10 papers each, we might harness the distributed power of the community to get the citation counts ! If you’re so inspired, email me stating which block of 10 (by line number, starting from 1) you’re planning to do and for which conference. Don’t let Michael’s effort go in vain !

There are 681 titles from FOCS, 821 from STOC, and 1076 from SODA.

Published in: on at 3:03 am Comments (11)

My Biased Coin: A New Blog !!

It’s been a while since I’ve been able to announce a new CS blog. Please point your bookmarks/RSS newsreaders/browsers to Michael Mitzenmacher’s new blog, My Biased Coin. Michael has occasionally posted over here, and has often had comments and suggestions for me. When he’s not brokering peace agreements between power law distributions and log normal distributions, Michael corrupts young grad students at Harvard for a living, filling their minds with all kinds of random bits that are occasionally error corrected.

Published in: on June 8, 2007 at 2:48 pm Comments (2)

SoCG 2007: CUP takes over all of geometry…

The good news from today is that the Demaine-O’Rourke book on folding and linkages is finally out. Erik had a copy of the book at the conference, and it’s a handsome volume that will hopefully soon reside on my bookshelf.

The authors have a web page associated with the book, and it has a number of applets that go along with constructions from the text.

The publisher of this book is Cambridge University Press, which is great, because I love their fonts :) . CUP clearly has a plan for domination of the entire computational geometry catalog: along with the folding book, they are publishing:

Phew ! Looks like CUP will be making a tidy sum off of me. Now if only bloggers got review copies (hint hint hint).

Published in: on at 2:17 pm Comments (1)

SoCG 2007: "Magic Hausdorff Lions Have Happy Endings"

(It’s at the point now where people complain if the business meeting post is NOT up 5 minutes after the actual meeting. Sigh…)

Summary for people in a hurry:

  • Attendance this year was a record low
  • Next PC Chair: Monique Teillaud
  • SoCG 2008 will be at the University of Maryland (rather than Virginia: long story)
  • SoCG 2009 (by a 4-1 margin) will be in Aarhus, Denmark. BEER !!!!!!!!!

And now for the details:

Local News:
Otfriend Cheong was in charge of local arrangements. I have to say that the hotel we are at is quite amazing: It’s on the Bomun Lake, and right outside the hotel is this lake-side path that leads to all the restaurants one can imagine, a tourist village with acres of motorized scooters (!), and some not-disgusting coffee. The hotel facilities themselves are excellent (again, modulo the coffee); much better for a cheaper price in comparison to a US Hotel. And as I sit here writing this, I can hear the rehearsals for our Korean classical music concert tonight.

Unfortunately, there was no one here to enjoy it ! Attendance is down tremendously (111 people), a factor exacerbated by FCRC, being held in a day or two in the exotic locale of San Diego (fish tacos ! blech!), and other related conferences. The relative remoteness of Gyeongju played no small role in this, I imagine.

There was much discussion about whether we should continue to be sponsored by ACM or not (the main issue is cost, and logistics when organizing in a non-US location); no resolution on this point.

PC Stuff: (obvious caveat: I was on the PC this year)
Jeff Erickson presented the usual stats (45/139 papers accepted, out of 286 submitting authors, and with 115+ external reviews). It turns out that the formula for succes at SoCG this year involves

submitting seven papers, with 4 co-authors, from an email address in Pakistan.

The title of this post was composed from words in accepted papers.

Lots of other stats, nothing particularly enlightening.

The Near Future.
Monique Teillaud will chair the SoCG 2008 program committee. The committee has 17 people on it, an unusually large number. She’s hoping to get 4 reviews per paper, so maybe that’s why.

David Mount explained why we had to move from Alexandria, Virginia to UMD for SoCG 2008. Apparently the main hotel we would have used was bought out and is now much more expensive. Boo hoo. On the bright side, UMD will be a lot cheaper in terms of accomodation.

SoCG 2009.

We had two competing bids for SoCG 2009. Aarhus (Beer ! Lego ! Beer ! City life ! Beer!) and Portland (Roses ! Beer ! More Roses ! Powell’s Books ! Microbreweries!).

No, we are not a bunch of boozing alcoholics.

Lars did a bang up job with his Aarhus presentation. John Hershberger did a great presentation on Portland (a great city to visit, btw), but it was no contest. By a 4-1 margin, Aarhus it is !

Open Discussion.
It was getting rather late by the time we got to open discussions. Guenter Rote initiated the question, “Should we have a rebuttal process for reviewing papers”, in response to apparently some aggrieved set of rejected authors.

We had a heated discussion on this point, and although we ultimately went nowhere, it was felt that we should continue things electronically (i.e on blogs). I’ll post at greater length on this issue later, but to summarize the main points pro and con:

Pro:

  • Creates an improved sense of fairness
  • A safety net for potential errors

Cons:

  • Time waste for reviewers
  • Time waste for authors (!) (according to Jack Snoeyink), but I am not convinced that this is a valid argument
  • Won’t make a significant difference

I’m personally dubious whether there’s any measurable benefit to a rebuttal process, but I’m mildly in favor primarily because of the “placebo effect” it has.

Published in: on June 7, 2007 at 2:15 am Comments (6)

Creating a new publication model

I like to gripe about our current conference/journal system, and how ungainly a method it is for publishing and disseminating quality work. Not surprisingly, other areas of computer science appear to have similar issues as well.

Hal Daume, my colleage at the U. of U., has sparked off a rather interesting discussion with his post on NLPers about the need for a new open-access journal for NLP research. The original motivation for his post is partly NLP-specific, but the evolving discussion over at his wiki provides a closeup view of the process of designing a new publication model that sits somewhere between traditional journals and traditional conference.

It’s too early to tell what will come of this, but if their model gains some traction, it might provide a good exemplar for future reform in other disciplines in computer science.

Published in: on May 21, 2007 at 11:33 pm Comments (4)

Silvio Micali elected to the NAS

The National Academy of Sciences just announced 72 new members, among them Silvio Micali from MIT. For those who may not know, he’s famous for his work in pseudorandomness, cryptography, and interactive proof systems. Among his many awards are the Godel Prize in 1993 and being named a Fellow of the AAAS in 2003.

Published in: on May 5, 2007 at 2:47 am Comments (3)

The joy of AMS Notices.

For the past few months, I’ve been reading articles from the Notices of the AMS, something I never did earlier. This is a direct consequence of reading In Theory and Ars Mathematica, both of which will regularly highlight articles from the latest issue of the Notices.

I’m amazed at how many of the articles I find interesting enough to peruse, or even print out to read offline. Most of these are technical articles, on specific topics in mathematics. Quite often, there will be discussions on topics tangentially related to theoryCS or even things that I’m interested in. For example, the May issue of the Notices has an article titled, “How to generate random matrices from the classical compact groups“. Anyone who messes around in geometry and graphics will immediately see the connection to the standard problem of generating a random rotation, and in fact there’s a very interesting explanation of the group theory underlying the standard procedure for generating a random orthogonal matrix.

A second article with the whimsical title, “If Euclid had been Japanese” discusses the constructibility of points using folding operations rather than “Euclidean” constructions. Apart from the obvious origami connections, this is particularly interesting to me because I’ve been experimenting with a short presentation that tries to explain algorithms via origami (folds are like steps in a program, etc..).

I could go on like this. Every issue has at least one article that is both acecssible and interesting to me, and I’m not even a mathematician ! Why can’t we have such a delightful magazine for computer science, or even for theoryCS ?

SIGACT News does a valiant job. And it’s hard work to manage a newsletter, along with all one’s other activities. But it comes out far too rarely, and one doesn’t often find the kind of short vignettes that entertain and illuminate at the same time. I enjoy the algorithms column and the geometry column. I will make valiant attempts to read the complexity column, but it’s often too “structural complexity” for me. But in a way I am unable to articulate, I find SIGACT News somehow less stimulating than the Notices. I wonder if others share this view as well ?

Update (5/3/07): I think I just had an epiphany. The above post is the equivalent of my standing in front of the Model T, wondering why my horse-buggy can’t go any faster. In other words, we already have excellent publications that write expository surveys and cute vignettes that connect the community together. They’re called blogs !!

Published in: on May 3, 2007 at 7:57 am Comments (8)