From the article: The paper is here and Donald Knuth wrote about it here.
From the article:
The CVM algorithm, named for its creators — Sourav Chakraborty of the Indian Statistical Institute, Vinodchandran Variyam of the University of Nebraska, Lincoln, and Kuldeep Meel of the University of Toronto — is a significant step toward solving what’s called the distinct elements problem, which computer scientists have grappled with for more than 40 years. It asks for a way to efficiently monitor a stream of elements — the total number of which may exceed available memory — and then estimate the number of unique elements.
“The new algorithm is astonishingly simple and easy to implement,” said Andrew McGregor of the University of Massachusetts, Amherst. “I wouldn’t be surprised if this became the default way the [distinct elements] problem is approached in practice.”
The paper is here and Donald Knuth wrote about it here.
That's really the kicker that limits its usefulness for many applications. Not to say there isn't some great ways it can be applied, but there's an old idiom in the USA: "Close Only Counts in...
then estimate the number of unique elements.
That's really the kicker that limits its usefulness for many applications. Not to say there isn't some great ways it can be applied, but there's an old idiom in the USA: "Close Only Counts in Horseshoes and Hand Grenades."
This algorithm is useless if you need complete accuracy.
I don't understand why you say this. Rather than being useless, this algorithm makes a good estimate (arguably, the best possible estimate) with tight and well-understood error bars, given memory...
I don't understand why you say this.
Rather than being useless, this algorithm makes a good estimate (arguably, the best possible estimate) with tight and well-understood error bars, given memory constraints.
In particular, this algorithm is strictly better than deterministic (i.e. completely accurate) algorithms --- in every case where a deterministic algorithm has enough space to compute the true answer, this algorithm would give the true answer as well.
“Of course,” Variyam said, “if the [memory] is so big that it fits all the words, then we can get 100% accuracy.”
This sort of estimate is used all the time for things such as inexact number of rows in a database table - I think PostgreSQL is an example. Not exactly the same problem, but certainly close....
This sort of estimate is used all the time for things such as inexact number of rows in a database table - I think PostgreSQL is an example. Not exactly the same problem, but certainly close.
Edit: Another example is HyperLogLog for this type of probabilistic counting.
From the article:
The paper is here and Donald Knuth wrote about it here.
This is really interesting! I don't have an immediate application for it, but I'm definitely going to mentally file it away for later use.
That's really the kicker that limits its usefulness for many applications. Not to say there isn't some great ways it can be applied, but there's an old idiom in the USA: "Close Only Counts in Horseshoes and Hand Grenades."
This algorithm is useless if you need complete accuracy.
I don't understand why you say this.
Rather than being useless, this algorithm makes a good estimate (arguably, the best possible estimate) with tight and well-understood error bars, given memory constraints.
In particular, this algorithm is strictly better than deterministic (i.e. completely accurate) algorithms --- in every case where a deterministic algorithm has enough space to compute the true answer, this algorithm would give the true answer as well.
This sort of estimate is used all the time for things such as inexact number of rows in a database table - I think PostgreSQL is an example. Not exactly the same problem, but certainly close.
Edit: Another example is HyperLogLog for this type of probabilistic counting.
I've always heard this from my dad as "Close only counts in Horseshoes, hand grenades, and H-Bombs" xD