## Estimating the n percentile of a set

Here is an interesting idea that I had recently, this is just a high level concept about how it would work, there are no proofs for error bounds or quality, in fact there would be a handful of orderings of sets which would product terrible results.

To accurately compute the $n^{th}$ percentile value of a given set of values, one ends up having to sort the values which if they are not integers, can take $O(n \log n)$ time.  However, then getting value itself is trivial since it is simply going to the correct place in the sorted values.  I am thinking that one should be able to compute an estimate for the $n^{th}$ value for a randomly ordered set of elements in $O(n)$ time.

First the basic idea, imagine that you have a set of elements $X = \{ x_i \}$.  Then if we had this set sorted $S$, then finding the $n^{th} (0 \le n \le 1)$ percentile out of $X$ would simply become $s_{n * |S|}$.  This implies that we have $n * |S|$ elements less than $s_{n *|S|}$ and $|S|*(1 - n)$ elements greater.  From this point we can imagine constructing two sets, $\alpha = \{ x \in X : x < s_{n * |S| } \}, \beta = \{ x \in X : x > s_{n * |S|} \}$ which represent the elements greater and less then the $n^{th}$ value.  This also means that $\frac{|\alpha|}{|\alpha| + |\beta|} \approx n$.  Now using this concept for $\alpha$ and $\beta$, we can attempt to construct these sets while iterating through $X$ by having a current estimate for the value $s$, and then tracking the elements current in each set.  This essentially becomes if $\frac{|\alpha|}{|\alpha| + |\beta|} > n + \epsilon$ then take the current value of $s$ and insert it into $\beta$, then take the largest element out of $\alpha$ and set it equal to $s$.  In the case of $\frac{|\alpha|}{|\alpha| + |\beta|} < n - \epsilon$ we simply do the reverse by inserting the current value of $s$ into $\alpha$, and then removing the smallest value from $\beta$ and setting it equal to $s$.

Now the problem has been reduce to splitting $X$ into two different sets and keeping them sorted somehow to be able to get and remove the largest/smallest elements.  However, this would give an exact answer to finding the $n^{th}$ percentile.  Now given that we want to find an estimate, we can imagine capping the size of these sets to $k$, where $k$ is a small number such as $2$.  Then instead of tracking the elements themselves, we are simply counting the number of elements that are greater or less than the current $s$ value.  Additionally, we have the sets tracking the $k$ elements that are largest but still less then $s$, and the smallest but greater then $s$.  As we iterate through the set, we can track the $k$ values in $\alpha, \beta$ and the size of $\alpha, \beta$ accordingly, and when we want to change the value of $s$ to keep $\frac{|\alpha|}{|\alpha| + |\beta|} \approx n$, then we just take the new value from $\alpha$ or $\beta$ respectively and update the cardinality of each set.

An additional extension to this algorithm could be for an $n \approx .999$ then the size of $\beta$ would only be $\frac{1}{1000}$ the size of the original data set.  Keeping track of largest $.001$ of the data set would not be linear in the side of the data set, however it could take out a considerable chunk of computation depending on how large or small the value of $n$ is.