STOC 2006 results out

List of papers here.

Some quick notes on seeing the papers:

More to (hopefully) come as I peruse the list.

Update: 1/31/06: Sariel points out two more interesting papers:

  • Searching Dynamic Point Sets in Spaces with Bounded Doubling Dimension - Richard Cole and Lee-Ad Gottlieb.
  • A Quasi-Polynomial Time Approximation Scheme for Minimum Weight Triangulation – Jan Remy and Angelika Steger.

    This one is particularly interesting in the light of the new NP-hardness result for MWT by Mulzer and Rote. A quick glance at the above paper indicates that they use a technique akin to the method used by Sanjeev Arora, and Joe Mitchell, to solve TSP in the plane. You divide the plane into small pieces, consider the different triangulations in each piece, and prove a “structure lemma” that describes how the interface of almost-optimal solutions look. Since the algorithm runs in quasi-polynomial time, I imagine the next step is to see if it can be be beaten down to a PTAS.

Credit goes where credit is due. The geometry community often moans and groans about their neglect at the hand of STOC/FOCS committees. This time, there is really nothing to complain about; a lot of nice geometry papers have made it in.

Categories
Published in: on January 31, 2006 at 6:58 am Comments (8)

Collaborative Mathematics and Kepler’s conjecture/theorem

If the 1990s was the decade of the web, the 2000s might very well be the decade of collaborative methods, whether it be social bookmarking, tagging, or recommendation systems. A collaborative system we possibly did not anticipate is combined theorem proving.

Kepler’s conjecture about the optimal packing of balls in 3D was finally proved by Thomas Hales in 1998. Or was it ? Since 1999, a 12-member review committee of the Annals of Mathematics has been going over the proof in order to verify it. The NYT reported nearly two years ago that there were problems in the verification process, and now we hear the details, on Thomas Hales’ Flyspeck page:

Robert MacPherson, editor of the Annals, wrote a report that states

“The news from the referees is bad, from my perspective. They have not been able to certify the correctness of the proof, and will not be able to certify it in the future, because they have run out of energy to devote to the problem. This is not what I had hoped for.”

Interestingly, Gabor Fejes Tóth, the chief referee, wrote a report certifying that he was “99%” confident of the proof, but no more, and the editor therefore concluded that

it’s not enough for complete certification as correct, for two reasons. First, rigor and certainty was what this particular problem was about in the first place. Second, there are not so many general principles and theoretical consistencies as there were, say, in the proof of Fermat, so as to make you convinced that even if there is a small error, it won’t affect the basic structure of the proof.”

Apparently, the Annals will publish a paper on this topic with a disclaimer stating that the computer parts of the proof have not been verified. In the meantime, (and here’s the collaborative aspect of this), Thomas Hales and others have undertaken a joint effort to rewrite his proof in a computer-verifiable form using OCAML. The project is called Flyspeck, and the page contains information on how to “participate” in the proof:

The first step is to learn the relevant background material:

  • Objective CAML,
  • HOL light, and
  • the idea of the proof of the Kepler Conjecture.

See the resources listed at the bottom of the page for more information about these topics.

The second step is to become an experienced user. In practical terms, this means being able to create your own proof in the HOL light system.

If you have successfully proved a non-trivial theorem in the HOL light system, then contact Thomas C. Hales (http://www.math.pitt.edu/~thales) if you wish to participate.

(HT: sigfpe)

Published in: on at 4:24 am Comments (2)

PIR and turning the tables on the NSA

My previous post on PIR generated some interesting comments on the practicality of such schemes and why Google isn’t already doing it. An angle to this that I should have mentioned earlier is the way that PIR-like methods can actually help government snoopers.

David Molnar had pointed to this a while back: it’s a paper by Ostrovsky and Skeith from CRYPTO 2005 titled ‘Private Searching on Streaming Data’. The premise is that you are the NSA or some other intelligence organization, and you want to run searches on various data streams without any adversary being able to detect what your searches are about (presumably so that they can’t game the system to avoid detection). This of course is the same as the PIR paradigm, except with the “good guys” and “bad guys” flipped around.

Categories
Published in: on January 30, 2006 at 5:13 am Comments (8)

Comments RSS feed

Since Blogger continues to ignore my plaintive requests for the most useful feature any blogging software can have, I made a hack based on suggestions made elsewhere.

If you’d like to subscribe to a comment RSS feed (if you posted a comment and want to know if there were any responses, this is particularly handy), throw this link into your blogreader. It’s not has snazzy as the kinds of feeds my wordpress brethren can generate (the feed is not separated by post, for example), but it’s handy. Of course, if you always read this blog directly on the web and/or don’t know what RSS is, ignore this message.

Categories
Published in: on January 26, 2006 at 8:41 pm Comments (16)

Food for thought.

sed -e ’s/physics/computerscience/’ I think it would actually be healthier for theoretical physics these days to take a look at how mathematicians operate, because mathematics has always been a less faddish subject. In mathematics there is much more of a culture where people spread out and devote their lives to thinking hard about something that interests them. There has always been much more of a culture in physics that you want to work on something where you can get results and produce a paper a few months from now. And when the problems are very hard and no one knows what to do, I think people need to be willing to dig in and spend years thinking about something different than what other people are thinking. And there really isn’t the kind of institutional support within the physics community for this kind of behavior, whereas there is in mathematics.

Hits close to home, methinks…

(Source: Discover Magazine interview with Not Even Wrong proprietor Peter Woit)

Categories:
Published in: on at 5:08 am Comments (4)

Private Information Retrieval

Google, MSN and Yahoo were recently served with subpoenas for search data from their databases. The background, from an article by Columbia professor Tim Wu in Slate:

Back in the 1990s, Congress passed a succession of laws designed to keep porn off the Internet. Those laws didn’t work—largely because the courts kept striking them down as violations of the First Amendment. Sick of losing and finding themselves back in the district court defending a reconfigured version of the anti-porn bill (now named the “Child Online Protection Act”), lawyers in the Justice Department decided to hire Berkeley professor Philip Stark. Stark’s assignment: Use statistics to show what everyone already knows—that there’s an awful lot of porn on the Internet.

Stark’s plan is to demand that Google and the other major search engines supply him, and thus the government, with a random selection of a million domain names available for search and a million sample user queries. He will then demonstrate how truly nasty the Internet is, and in particular, just how hard it is to block out stuff that’s “Harmful to Minors.” Faced with this third-party subpoena, Microsoft and Yahoo! agreed to supply Stark with this information, but Google refused, calling the subpoena request “overbroad, unduly burdensome, vague, and intended to harass.”

He goes on to make the point that the problem is not that Google et al were slapped with subpeonas, but that they had the data lying around in the first place. His suggestion: get rid of tell-tale information:

That’s why the public’s demand must be of Google—not the state. It should be that Google please stop keeping quite so much information attached to our IP addresses; please modify logging practices so that all identifying information is stripped. And please run history’s greatest “search and delete,” right now, and take out the IP addresses from every file that contains everyone’s last five years of searches.

Coincidentally, I was chatting with Piotr Indyk the other day at Penn, and he made the very valid point that in a sense, theoretical computer science is already on the case, and has been studying this problem for a very long time. The field of private information retrieval is precisely about this:

Can you make queries from a database without the database knowing what you want ?

It seems impossible (if you don’t already know the answer). But here’s a silly solution:

Ask for all the data from the database ! There’s no way the database will know what you really wanted.

The solution is indeed silly, but why exactly is it ? Well, because you’d need to extract all the n bits of information from the database [1] to get the answer bit [2] for one query So if we think of “stuff the database returns” as a bounded resource, can we do better ?

Well, as you might have guessed, if you want to be deterministic, then not really. But if you can toss coins when asking your queries, while still guaranteeing that the answer is always correct,
and if you assume that you are really querying multiple non-communicating copies of a database (not so unrealistic), then you can do a lot better. For example, if you wanted to know the value of a single bit (the “query”) of a string of n bits (the “database”), you could pick a random subset of the string, and ask the first database for the XOR of the bits. Then XOR the index you want with the set, and ask the second database for a similar XOR. You can XOR the two answer to get your bit, and the databases are none the wiser.

Now this really doesn’t save you much, since you’d send n bits in total (on average) to the servers, but the scheme can be generalized to k servers, in which case you’ll end up with a total transmission of something like k log k n^(1/log k), which is significantly sublinear.

There are many variations: how smart can the servers be at trying to decipher your answers ? (in the above example, even an all powerful server can’t do a thing) ? How much can the servers talk to each other ? what kinds of queries can you ask ? and so on.

One might argue that assuming the servers can’t communicate with each other is an unreasonable restriction. Not so really, especially if you consider that it might be in Google’s interests to facilitate this kind of querying [3]. Another interesting question that I don’t think has been studied would be: can you construct effective link graph structures and search results without needing to retain any information about the pages themselves ?

Piotr made the interesting point that there’s a lot of scope here for a company that wants to innovate in the search space: “Search with our engine! We have no clue what you’re looking for, and we’ll find it anyway!”. In fact, there’s a web page on practical PIR maintained at the Institut für Informatik at the Humboldt-Universität zu Berlin by Prof. Johann-Christoph Freytag and Dmitri Asonov.

Since the business of theoryCS these days is PR, it’s worth pointing out that privacy is rapidly becoming a major area of concern, and work springing out of theoretical computer science is ideally placed to have an impact here. Just sayin’….

I drew my sources from the wonderful survey on PIR written by Bill Gasarch (who’s current guest blogging at the Complexity Blog) for Lance Fortnow’s Complexity Column. He also has a companion web page, and the two are a valuable resource of information on this topic.

(HT: BoingBoing)


[1] Theoretical computer scientist love referring to all sizes as ‘n’. If there are more, we will often use ‘k’, ‘m’, and the often scorned ‘U”.


[2] The one thing we love more than calling things ‘n’ is designing questions with YES/NO answers.


[3] There is another post waiting to be written on why it’s NOT in Google’s best interests to help make private information retrieval a reailty. Such a post would have a heavy dose of phrases like “Web 2.0″, “AdSense”, “AdWords”, “revenue streams” and “targeted search” and would bore everyone to tears.

Categories:
Published in: on at 4:15 am Comments (18)

SODA Day III.

I will attend more talks next year…
I will attend more talks next year…
I wil attend more talks next year…

The morning session talks were moved from their original location because of flooding in the conference room. No, I am not making this up.

So we reach day 3, and by now most people have given up the pretence of attending talks. Alan Frieze’s invited lecture on random graphs was packed though. Most of you probably know what random graphs are: briefly, imagine a graph generated by adding an edge between any pair of vertices with probability p.

There are all kinds of interesting properties one can prove on random graphs. One of the most striking is a sharp “phase transition” that occurs in connectivity, going from a mostly disconnected graph to a connected one, as p gets large enough (slightly larger than log n/n). Random graphs have become a useful way of modelling the Web graph: for example, the preferential attachment process of Barabasi and Albert defines a random graph whose degree sequence follows a power-law, and that mimics many properties of the Web graph.

This was probably the most technical of the three talks, and had (at least hints of) some seriously cool methods in probabilistic analysis. I’ll link to the talk when it comes online.

Outtakes:

  • One of the “pitfalls” of being known as the geomblogger is that when I sit down with my laptop, the default assumption appears to be that I am blogging about the conference.

Categories:
Published in: on January 24, 2006 at 8:20 pm Leave a Comment

SODA Day II…

Today’s invited talk was by Princeton’s Larry Peterson. He spoke about the aging of the Internet, and how it can’t quite cater to the demands of today’s world. One of the challenges of adapting the network is the legacy problem: as he put it, it’s hard to convince Cisco to change the interpretation of one bit in their routers.

One of the issues that seems critically important in the internet of today is the issue of control. The original internet grew “under the radar”. Developed primarily by researchers and technologists, its governing principles were constrained by technical, rather than political, considerations. The next generation of the internet will not have this luxury, and the brouhaha over ICANN is a tiny indicator of the kinds of battles likely looming. Interesting times…

Jeff Erickson did an admirably thorough stream-of-consciousness rendering of the business meeting, which I won’t try to duplicate. There was no beer; absolutely none; nada; zilch; not a drop. Everything seemed pointless and arid after that….

I will say this: I continue to be amazed at the capacity of the theory community to repeat the same arguments on the same topics year after year, while presenting said arguments with the kind of gravitas that suggests deep thought and contemplation. Maybe that’s how we write so many papers. This year’s topic du jour was the always-delightful “submission size formatting” discussion (one of these days I should make separate pages for each of these annoying arguments, debunking all the annoying comments that everyone makes).

Am I irritated ? of course not; my mouth is strangely dry though. Oh yeah, there was no beer.

It was interesting to see Cliff Stein’s list of the most active topics at this conference; there were the usual suspects like approximation algorithms, graph algorithms, and computational geometry, but game theory made a (somewhat) surprising appearance in the top five. In retrospect, this is not surprising, especially when we consider things like the minimax theorem, and the fact that most of theoretical computer science consists of games played against an (often unbounded) adversary. The last few years have seen a steady increase in the number of game theory papers in STOC/FOCS/SODA, and this trend will likely continue.

Categories:
Published in: on at 5:23 am Comments (12)

SODA II: Predictions….

Today’s highlight was the invited talk by Rakesh Vohra titled ‘Predicting the Unpredictable’. It starts from the following very simple questions:

Given a generator that’s producing 0s and 1s, your goal is to predict the next bit. You don’t know anything about the generator, and you have to make some prediction. The measure of your performance is the asymptotic fraction of predictions you make that are correct.

An amazing result is that if the number of 1s in the output thus far is hn, you can come up with a randomized scheme that achieves max(hn, 1-hn) -eps. This result was proved in 1957 by James Hannan, and has apparently been reproved 13 times !

If you think of 1 as “it will rain tomorrow”, then you see why this is an important problem. Of course, weather forecasters usually predict with a probability: there is a 20% chance of rain tomorrow. How do we evaluate the effectiveness of a prediction in that case. The answer is quite elegant, and is called “calibration“. The idea is that you look at the subsequence of 0s and 1s where the forecaster predicted a 20% chance of rain, and check if over this sequence, there was indeed a 20% fraction of rain. Repeat for all probability predictions, weighting by the length of the subsequence.

A stunning result is that no matter what measure of discrepancy you use, and what kinds of subsequences you pick, it is possible to design a strategy that drives the calibration error to zero. In other words, a completely clueless weather forecaster might be a perfect predictor under this measure. Rather disturbing…

In the afternoon, there was an interesting talk by Seshadri Comandur on self-improving algorithms. The idea is that as an algorithm, you want to learn from the input, and then be able to generate answers that are optimal with respect to the distribution the input is being drawn from (think of this as a generalization of average-case analysis). Notice the similarities between this work and the invited talk; here also, you want to adapt your behaviour to the input so that you are optimal for the distribution it comes from.

Outtakes:

  • If you walk out of a skywalk, thru a parking lot, down the elevator, thru the parking lot, via tunnel, you can actually find somewhat passable Starbucks-branded coffee. Trust me, it’s worth the walk :)
  • Jeff Erickson once again has his business meeting drinking game. I have to say that it’s so complicated, I’ll need to print out a sheet with all the rules, and will probably forget them once I get drunk enough…
  • Maybe that’s the point…
  • Can you get drunk before the business meeting ? By my count David Eppstein needs to take 6 drinks for his hotel complaints, plus a special bonus for his innovative complaint about airplane noise at 1am. I myself am at about 3 drinks…

Categories:
Published in: on January 22, 2006 at 10:06 pm Comments (2)

SODA report: ALENEX…

Today is the pre-SODA workshop today; ALENEX (Experimental Algorithms) and ANALCO (Analytic Algorithms and Combinatorics). David Mount was the plenary speaker at ALENEX, and he gave what I thought was an excellent overview of the area of nearest neighbour methods.

As befits an “experimental” talk, it wasn’t your usual survey of techniques. He structured the talk around seven points (the “seven habits of highly effective nearest neighbour searching” as he put it), and skilfully weaved the known universe of NN methods around this structure.

The points: to summarize

  • Grids are good: quite good for many problems
  • Quad trees are better, if you need higher dimensions
  • kd-trees are even better, if you want the best bounds
  • In high dimensions, only locality sensitive hashing can save us.

The central theme was that there are a few basic ideas that work extremely well in near-neighbour searching, and depending on the kind of data you have (uniform, highly skewed, intrinsically high dimensional, ever-changing), some techniques are better suited than others.

Listening to his talk, I realized that we need something like this for clustering methods as well. Like the nearest-neighbour problem, clustering is a fundamental problem in a variety of domains (maybe it’s the 3rd-most important problem in all of science…. more on this later). Like NN, there are a huge number of papers out there on different ways of solving the problem, and like NN, what a practitioner really needs is a way of organizing and classifying the different clustering techniques.

Outtakes:

  • The hotel is in the middle of downtown, which apparently means ‘always down’. The closest restaurant to the hotel is a Burger King, and the next closest is a Checkers. Much anger will be vented at the business meeting on Monday.
  • People who haven’t arrived yet will be happy to know that the wireless in the conference area works fine, and the wireless ghetto is up and running.
  • The SODA proceedings is once again twice as heavy (and probably 5 times as thick) as my laptop, once again raising the question:
    Why can’t we get PDF copies of the papers before the conference, or even via password-protection at the conference itself.

    There is a bigger issue here about electronic proceedings that I don’t really care to get into; all I really want is the ability to read the papers without getting a sprained wrist (or neck).

  • Last year all the proceedings went for a tour of Minnesota, and we ended up downloading papers from a protected SIAM site. People didn’t seem to mind too much.
  • Did I mention that there are few good eating places near the hotel ? There is only one place that looks like it might have decent coffee, and it’s a bit of a hoof away.
  • In a talk early this morning, John Hershberger started his talk by asserting that determining the shape of points was “the most important problem in science”. This apparently stole Dave Mount’s thunder, who was then reduced to asserting that the nearest neighbour problem was “the second most important problem in science”.
  • I like the fact that ALENEX and ANALCO have joint registration, so that we can walk into talks in either workshop. There are a number of interesting talks at ANALCO that I will attempt to attend.

Categories:
Published in: on January 21, 2006 at 5:57 pm Comments (8)