Estimating the n percentile of a set

Here is an interesting idea that I had recently, this is just a high level concept about how it would work, there are no proofs for error bounds or quality, in fact there would be a handful of orderings of sets which would product terrible results.

 

To accurately compute the \(n^{th}\) percentile value of a given set of values, one ends up having to sort the values which if they are not integers, can take \(O(n \log n)\) time.  However, then getting value itself is trivial since it is simply going to the correct place in the sorted values.  I am thinking that one should be able to compute an estimate for the \(n^{th}\) value for a randomly ordered set of elements in \(O(n)\) time.

First the basic idea, imagine that you have a set of elements \(X = \{ x_i \}\).  Then if we had this set sorted \(S\), then finding the \(n^{th} (0 \le n \le 1)\) percentile out of \(X\) would simply become \(s_{n * |S|}\).  This implies that we have \(n * |S|\) elements less than \(s_{n *|S|}\) and \(|S|*(1 – n)\) elements greater.  From this point we can imagine constructing two sets, \(\alpha = \{ x \in X : x < s_{n * |S| } \}, \beta = \{ x \in X : x > s_{n * |S|} \}\) which represent the elements greater and less then the \(n^{th}\) value.  This also means that \(\frac{|\alpha|}{|\alpha| + |\beta|} \approx n\).  Now using this concept for \(\alpha\) and \(\beta\), we can attempt to construct these sets while iterating through \(X\) by having a current estimate for the value \(s\), and then tracking the elements current in each set.  This essentially becomes if \(\frac{|\alpha|}{|\alpha| + |\beta|} > n + \epsilon\) then take the current value of \(s\) and insert it into \(\beta\), then take the largest element out of \(\alpha\) and set it equal to \(s\).  In the case of \(\frac{|\alpha|}{|\alpha| + |\beta|} < n – \epsilon\) we simply do the reverse by inserting the current value of \(s\) into \(\alpha\), and then removing the smallest value from \(\beta\) and setting it equal to \(s\).

Now the problem has been reduce to splitting \(X\) into two different sets and keeping them sorted somehow to be able to get and remove the largest/smallest elements.  However, this would give an exact answer to finding the \(n^{th}\) percentile.  Now given that we want to find an estimate, we can imagine capping the size of these sets to \(k\), where \(k\) is a small number such as \(2\).  Then instead of tracking the elements themselves, we are simply counting the number of elements that are greater or less than the current \(s\) value.  Additionally, we have the sets tracking the \(k\) elements that are largest but still less then \(s\), and the smallest but greater then \(s\).  As we iterate through the set, we can track the \(k\) values in \(\alpha, \beta\) and the size of \(\alpha, \beta\) accordingly, and when we want to change the value of \(s\) to keep \(\frac{|\alpha|}{|\alpha| + |\beta|} \approx n\), then we just take the new value from \(\alpha\) or \(\beta\) respectively and update the cardinality of each set.

An additional extension to this algorithm could be for an \(n \approx .999\) then the size of \(\beta\) would only be \(\frac{1}{1000}\) the size of the original data set.  Keeping track of largest \(.001\) of the data set would not be linear in the side of the data set, however it could take out a considerable chunk of computation depending on how large or small the value of \(n\) is.

Leave a Reply

Your email address will not be published. Required fields are marked *