# Estimating the n percentile of a set

Here is an interesting idea that I had recently, this is just a high level concept about how it would work, there are no proofs for error bounds or quality, in fact there would be a handful of orderings of sets which would product terrible results.

To accurately compute the $$n^{th}$$ percentile value of a given set of values, one ends up having to sort the values which if they are not integers, can take $$O(n \log n)$$ time.  However, then getting value itself is trivial since it is simply going to the correct place in the sorted values.  I am thinking that one should be able to compute an estimate for the $$n^{th}$$ value for a randomly ordered set of elements in $$O(n)$$ time.

First the basic idea, imagine that you have a set of elements $$X = \{ x_i \}$$.  Then if we had this set sorted $$S$$, then finding the $$n^{th} (0 \le n \le 1)$$ percentile out of $$X$$ would simply become $$s_{n * |S|}$$.  This implies that we have $$n * |S|$$ elements less than $$s_{n *|S|}$$ and $$|S|*(1 – n)$$ elements greater.  From this point we can imagine constructing two sets, $$\alpha = \{ x \in X : x < s_{n * |S| } \}, \beta = \{ x \in X : x > s_{n * |S|} \}$$ which represent the elements greater and less then the $$n^{th}$$ value.  This also means that $$\frac{|\alpha|}{|\alpha| + |\beta|} \approx n$$.  Now using this concept for $$\alpha$$ and $$\beta$$, we can attempt to construct these sets while iterating through $$X$$ by having a current estimate for the value $$s$$, and then tracking the elements current in each set.  This essentially becomes if $$\frac{|\alpha|}{|\alpha| + |\beta|} > n + \epsilon$$ then take the current value of $$s$$ and insert it into $$\beta$$, then take the largest element out of $$\alpha$$ and set it equal to $$s$$.  In the case of $$\frac{|\alpha|}{|\alpha| + |\beta|} < n – \epsilon$$ we simply do the reverse by inserting the current value of $$s$$ into $$\alpha$$, and then removing the smallest value from $$\beta$$ and setting it equal to $$s$$.

Now the problem has been reduce to splitting $$X$$ into two different sets and keeping them sorted somehow to be able to get and remove the largest/smallest elements.  However, this would give an exact answer to finding the $$n^{th}$$ percentile.  Now given that we want to find an estimate, we can imagine capping the size of these sets to $$k$$, where $$k$$ is a small number such as $$2$$.  Then instead of tracking the elements themselves, we are simply counting the number of elements that are greater or less than the current $$s$$ value.  Additionally, we have the sets tracking the $$k$$ elements that are largest but still less then $$s$$, and the smallest but greater then $$s$$.  As we iterate through the set, we can track the $$k$$ values in $$\alpha, \beta$$ and the size of $$\alpha, \beta$$ accordingly, and when we want to change the value of $$s$$ to keep $$\frac{|\alpha|}{|\alpha| + |\beta|} \approx n$$, then we just take the new value from $$\alpha$$ or $$\beta$$ respectively and update the cardinality of each set.

An additional extension to this algorithm could be for an $$n \approx .999$$ then the size of $$\beta$$ would only be $$\frac{1}{1000}$$ the size of the original data set.  Keeping track of largest $$.001$$ of the data set would not be linear in the side of the data set, however it could take out a considerable chunk of computation depending on how large or small the value of $$n$$ is.