PhD Done!!!

My PhD on Declarative Programming Via Term Rewriting, which I completed at Johns Hopkins University, is done. https://matthewfl.com/phd

This research project was about developing a declarative, weighted logic programming language for machine learning, artificial intelligence, and natural language processing applications. The programming language we were researching was interesting because it allowed for programs that do interesting probabilistic and symbolic reasoning to be concisely expressed in a few lines of code. It accomplishes this by allowing the programmer to leave out as many details as possible about how the program is executed. This is similar to a database, where the database needs to automatically figure out how to retrieve the data given a high-level declarative query.

To make this work, I created a relational algebra, which is capable of representing entire programs. To make the system flexible enough to find a good execution strategy, I created a term rewriting approach that includes hundreds of rewrite rules to run the program. This is similar to rewriting an expression like “2 + 3” as “5” where both of these expressions are semantically equivalent.

To make the term rewriting relational algebra approach tractable, I additionally had to redesign many of the traditional approaches that programming languages are implemented. For example, I created a new way to think about memoization (dynamic programming) to make it work with our system. Additionally, I created a (JIT) compiler for our term rewrite system because the naive implementation was too slow for real world use.

In the end, this was an interesting research project. However, I think that this work was set a bit too firmly in the realm of symbolic systems for AI (the AI paradigm of yesteryear). Hence, I do not know if this is applicable to modern only big neural AI that is dominating. Eventually, I do think that this work may see some use. The reason is that while pure neural creates really cool demonstrations, it will also fabricate information. This creates an issue when these systems are deployed into applications, and that is a problem for their usability in industry. Hence, having a system that incorporates weighted reasoning (necessary for neural networks), and symbolic reasoning into a single system is a very powerful programming paradigm.

The dissertation document and recording of the defense are available at: https://matthewfl.com/phd

Dissertation Abstract

I present a new approach to implementing weighted logic programming languages. I first present a bag-relational algebra that is expressive enough to capture the desired denotational semantics, directly representing the recursive conjunctions, disjunctions, and aggregations that are specified by a source program. For the operational semantics, I develop a term-rewriting system that executes a program by simplifying its corresponding algebraic expression.

I have used this approach to create the first complete implementation of the Dyna programming language. A Dyna program consists of rules that define a potentially infinite and cyclic computation graph, which is queried to answer data-dependent questions. Dyna is a unified declarative framework for machine learning and artificial intelligence researchers that supports dynamic programming, constraint logic programming, reactive programming, and object-oriented programming. I have further modernized Dyna to support functional programming with lambda closures and embedded domain-specific languages.

The implementation includes a front-end that translates Dyna programs to bag-relational expressions, a Python API, hundreds of term rewriting rules, and a procedural engine for determining which rewrite rules to apply. The rewrite rules generalize techniques used in constraint logic programming. In practice, our system is usually able to provide simple answers to queries.

Mixing disparate programming paradigms is not without challenges. We had to rethink the classical techniques used to implement logic programming languages. This includes the development of a novel approach for memoization (dynamic programming) that supports partial memoization of fully or partially simplified algebraic expressions, which may contain delayed, unevaluated constraints. Furthermore, real-world Dyna programs require fast and efficient execution. For this reason, I present a novel approach to just-in-time (JIT) compile sequences of term rewrites using a custom tracing JIT.

Theoretical online voting system

With the election a few days away, I found myself recently looking at the state of voting in America and contemplating that there is still no online-based voting system in place. The main arguments against online voting or digital-based voting has been that it would be hard to verify and would require a computer security expert to identify if something has been tampered with.

Now to create a system that is “provably incorruptible” would be very difficult and impracticable to expect average poll workers to verify the correctness of such a system, however, there is probably a widely unexplored range of systems that are better than our current system but still have some easy to verify properties. In this post, I attempt to create a voting system which no worse then our current voting system with respect to voter fraud and ensuring the votes are counted.

First, let’s consider the state of our current voting system, specifically the voting-by-mail system. Step 1 is to go online and register your address along with some identifying voter information. At a later point in time, the state will mail a ballot to your address which contains a “voting key” which maps various voting positions (who you would like for some office or position on a proposition) to some number \([1, 200]\). To vote, you bubble in your corresponding chosen numbers, wrap your ballot in more paper called a “secrecy sleeve,” put this in another envelope and mail it to the ballot counting location. Presumably, once your ballot arrives, someone will check the identifying information on the mailing envelope to prevent duplication and then pass the ballot and secrecy sleeve to someone else who is just going to count the votes. This two-level operation would prevent people from knowing who you voted for, assuming that the first poll works don’t look at the ballot inside the secrecy sleeve. In terms of ensuring that your vote is counted, we have to then trust the second poll worker to count the votes correctly. We might use more than one person for this second part to prevent errors etc.

Now in making a new system, we have to consider what possible vulnerabilities exist in the current system, as those could still be allowed in the new system:

  1. Trusting the United states postal services (USPS) to properly deliver mail — If your ballot never makes it back to the polling place, then it will essentially be lost (there might be some ways to identify that it is lost, but still no real/easy recourse for ensuring that it gets counted)
  2. The USPS needs to get you your ballot to you in the first place — If the ballot was sent to the wrong address, it is possible that someone fills in the ballot in your name, forges your signature, and then mails it back in
  3. People are trusted to bubble in their choice correctly — Eg, they are at least able to understand that given some “number” on a “ballot key,” they are supposed to transfer that number correctly to the ballot itself
  4. A malicious poll worker could prevent a vote from getting counted that they didn’t agree with — Given that your vote is easily identifiable on the ballot, it is trivial for someone to reject all ballots which have bubbled in number 10 (ideally, there are two or more people to double check that this does not happen)

Given this set of vulnerabilities in our current system, lets now try to design a better system that allows for internet voting:

Our first steps would be very similar to the current voting system, where someone goes online and registers with their mailing address. The state would then mail out a “ballot key” to the provided address. The reason that we would still require that something is mailed out is that there is currently no good way to identify a citizen online in a secure way, however, like the current vote-by-mail system, it is acceptable to trust the USPS as a “broker of identities.” Now our vote by internet ballot key will be a bit different from existing ballots where each vote is represented by \([1, 200]\) and instead have a number in \([0, 2^{256}]\), additionally, instead of having a single number (say 10) represent a position on the ballot, each voter would be given a unique number for each position on the ballot. (A sample ballot is at the end of this post) We can then use a simple website to collect the keys which represent a person’s choice. Given that each user has different codes generated for their ballot, we can use untrusted channels to communicate these codes to the vote-counting authority. Additionally, we do not have to worry about “suppressing” the vote that a poll worker disagrees with since the intermediate communication mechanisms don’t even know which vote was cast. All they know is that they are responsible for is communicating some number to the voting authority.  Even if voter’s computer was infected with a computer virus, it would be unable to change your vote since it only knows the key that was entered representing your choice, while the other keys would only be present on the paper ballot key that was mailed to your address.

Some properties of this system:

  1. We are still trusting the USPS to properly identify people and communicate information with them securely. (Same as before)
  2. Submitting a vote for someone else still depends on your receiving or intercepting their ballot and “forging” a signature (Same as before)
  3. The intermediaries do not know your vote (better than before) — Now your vote is a number that is specific to you, so the only people who will know the vote are the person who generated the “voting key” and whoever has the voting key
    1. The intermediaries can not suppress your vote based on who you voted for — They do not who you voted for, so it can not be suppressed based on this reason
    2. Your vote can not be changed after the fact — Changing your vote would require that the malicious intermediary have your “voting key book,” which was printed by the state and mailed by the USPS (which is a trusted medium)
    3. Your computer (now technically also an intermediary) can not change your vote even if it was infected with a virus —  your computer does not know the alternate keys you were provided since they were printed and mailed, so it can not switch between them.
  4. The number that you have to enter is a lot longer (worse) — Currently, you only enter some number \([1, 200]\), however, a 256 bit number is notably longer.  Given how people are already used to entering 16 digit credit card numbers, this might not be such a big issue.  We could even include checksums to limit erroneously entering something (bitcoin already uses a 32-bit checksum on all addresses)

Some might point out that one could set up false voting websites to try to confuse voters or perform a DOS attack on the voting website. First, with false websites, we could follow the trend of some banking websites where an image is displayed to ensure that you are on the correct website. However, we might make it some confirmation code that is sufficiently long that it would be difficult to counterfeit and easy to print on a “ballot key.” For the DOS attack, we already know how to build systems that can deal with DOS attacks. Additionally, if we have a confirmation code system that confirms that a vote has been recorded, then any mechanism which takes a voting key and returns the confirmation code is as good as any other. This means you could have voting via email or even text message, which are “more difficult” to perform a DOS attack against or allow for third-party websites to spring up to collect votes as they would have to be still backed by the state vote recording authority.

Sample theoretical ballot key:

Politician Office key Confirmation code
T Sandwich President NMosFJizjPgUV2BKEhGE rjvUZzKZVAFCyqPy7w3t FuT8VDz3z
Giant D President Tru4oZn9y3RMnxAsb7g 5Gqs7Fu13FX4ExaQSer6y bFcCf4MJA
None of the above President LaGeinvoBUduEbovp5z JDQJ6DQEdgSqZWgXzArhi xjzEahMdi

(These politician names are based on this current season of south park.)

TL;DR: Online voting where you are still mailed your ballot via USPS, and your ballot contains keys that we consider “secure,” and you only submit one key that corresponds to your vote.

Update / additional background info / other posts on this topic:

While the mathematical concepts in these schemes are sound, it would be difficult to convince the public at large.  In these cases, people would have to just generally trust that someone has done the correct thing with designing the voting systems.  From an academic point of view, if these systems are implemented correctly, there wouldn’t even be a need for there to be vote checkers since they would “have to be correct.”

Estimating the n percentile of a set

Here is an interesting idea that I had recently, this is just a high level concept about how it would work, there are no proofs for error bounds or quality, in fact there would be a handful of orderings of sets which would product terrible results.

 

To accurately compute the \(n^{th}\) percentile value of a given set of values, one ends up having to sort the values which if they are not integers, can take \(O(n \log n)\) time.  However, then getting value itself is trivial since it is simply going to the correct place in the sorted values.  I am thinking that one should be able to compute an estimate for the \(n^{th}\) value for a randomly ordered set of elements in \(O(n)\) time.

First the basic idea, imagine that you have a set of elements \(X = \{ x_i \}\).  Then if we had this set sorted \(S\), then finding the \(n^{th} (0 \le n \le 1)\) percentile out of \(X\) would simply become \(s_{n * |S|}\).  This implies that we have \(n * |S|\) elements less than \(s_{n *|S|}\) and \(|S|*(1 – n)\) elements greater.  From this point we can imagine constructing two sets, \(\alpha = \{ x \in X : x < s_{n * |S| } \}, \beta = \{ x \in X : x > s_{n * |S|} \}\) which represent the elements greater and less then the \(n^{th}\) value.  This also means that \(\frac{|\alpha|}{|\alpha| + |\beta|} \approx n\).  Now using this concept for \(\alpha\) and \(\beta\), we can attempt to construct these sets while iterating through \(X\) by having a current estimate for the value \(s\), and then tracking the elements current in each set.  This essentially becomes if \(\frac{|\alpha|}{|\alpha| + |\beta|} > n + \epsilon\) then take the current value of \(s\) and insert it into \(\beta\), then take the largest element out of \(\alpha\) and set it equal to \(s\).  In the case of \(\frac{|\alpha|}{|\alpha| + |\beta|} < n – \epsilon\) we simply do the reverse by inserting the current value of \(s\) into \(\alpha\), and then removing the smallest value from \(\beta\) and setting it equal to \(s\).

Now the problem has been reduce to splitting \(X\) into two different sets and keeping them sorted somehow to be able to get and remove the largest/smallest elements.  However, this would give an exact answer to finding the \(n^{th}\) percentile.  Now given that we want to find an estimate, we can imagine capping the size of these sets to \(k\), where \(k\) is a small number such as \(2\).  Then instead of tracking the elements themselves, we are simply counting the number of elements that are greater or less than the current \(s\) value.  Additionally, we have the sets tracking the \(k\) elements that are largest but still less then \(s\), and the smallest but greater then \(s\).  As we iterate through the set, we can track the \(k\) values in \(\alpha, \beta\) and the size of \(\alpha, \beta\) accordingly, and when we want to change the value of \(s\) to keep \(\frac{|\alpha|}{|\alpha| + |\beta|} \approx n\), then we just take the new value from \(\alpha\) or \(\beta\) respectively and update the cardinality of each set.

An additional extension to this algorithm could be for an \(n \approx .999\) then the size of \(\beta\) would only be \(\frac{1}{1000}\) the size of the original data set.  Keeping track of largest \(.001\) of the data set would not be linear in the side of the data set, however it could take out a considerable chunk of computation depending on how large or small the value of \(n\) is.