# Probabilistic programming, natively

Quasilinear Research is pioneering a new way of programming.
By writing code for a machine that inherently knows how to make intelligent choices, powerful artificial intelligence becomes available in everyday programming languages.

## Probabilistic ISA

We provide a probabilistic extension to the RISC-V Instruction Set Architecture that offers new machine instructions for expressing nondeterministic/probabilistic choices under constraints/observations. Software programs exploiting these new instructions can express complex search and inference problems without committing to a particular search algorithm. The probabilistic semantics are available to high-level programming languages such as C/C++ and Python.

## Applications

Generate exercises for adaptive digital education that are guaranteed to practice certain skills. Generate puzzles, maps, and scenarios for games that are guaranteed to exercise key game mechanics or drive narrative progression. Recover most likely solutions and enumerate misconceptions. Use statistical user models in simulation to sample behaviors, allowing analytics-driven design prior to deployment.

## Technology

The radically new machine instructions are implemented, not in expensive physical hardware, but in software simulators that apply powerful combinatorial search and optimization algorithms behind the scenes. These systems implement advanced techniques from formal verification and probabilistic inference to sample feasible executions with strong assurances.

# Code Example

### Problem Definition

Suppose we want to generate simple linear equations to be included in a student's mathematics worksheet. We'll generate equations of the form a x + b = c where a and x are small positive numbers and b and c are positive and possibly bigger.

### C Source (what you write)

Let's use the C interface to implement our equation generator. The abstraction it provides is a random number generator with perfect foresight — it will never return values that lead to trouble later on. We'll ask it to uniformly sample values for a, x, b, and c filtered by the condition that the equation actually holds. Simply sampling random integer coefficients without constraint doesn't work because the equation is very likely to have a fractional solution (e.g., 2*x + 1 == 2 implies x == 0.5), a situation that might not be appropriate for students at their current level of knowledge.

```int main() {

if(oracle_acquire()) {                // Fork alternative executions here.

int a = oracle_uniform_int(1, 10);  // Draw integer values from the
int x = oracle_uniform_int(1, 10);  // bounded uniform distribution.
int b = oracle_uniform_int(1, 100);
int c = oracle_uniform_int(1, 100);

oracle_ensure( a*x + b == c );      // Prevent executions where condition does not hold.

oracle_release();                   // Mark this execution as successful.

printf("[a=%d]*[x=%d] + [b=%d] == [c=%d]\n", a, x, b, c);

} else {

printf("no solutions!\n");

}

}
```

### Compiled Machine Code (what gets simulated)

Here's a look at how the C compiler encodes the program above. Because our probabilistic extension follows the RISC-V extension guidelines, existing utilities such as compilers, debuggers, profilers, and disassemblers can read and write probabilistic executables without any modifications. Probabilistic instructions are shown here in bold. At this level, it starts to feel like we really have an Infinite Improbability Drive integrated as physical co-processor.

```0001a434 <main>:
1a434:       0000778b                custom0 a5,zero,zero,0   # oracle_acquire
1a438:       00079863                bnez    a5,1a448 <main+0x14>
1a43c:       0001b537                lui     a0,0x1b
1a440:       a4450513                addi    a0,a0,-1468 # 1aa44 <__clz_tab+0x124>
1a444:       f9df506f                j       103e0 <puts>
1a448:       00100793                li      a5,1
1a44c:       00a00613                li      a2,10
1a450:       02c7f58b                custom0 a1,a5,a2,1       # oracle_uniform_int
1a454:       02c7f60b                custom0 a2,a5,a2,1       # oracle_uniform_int
1a458:       06400713                li      a4,100
1a45c:       02e7f68b                custom0 a3,a5,a4,1       # oracle_uniform_int
1a460:       02e7f70b                custom0 a4,a5,a4,1       # oracle_uniform_int
1a464:       02c587b3                mul     a5,a1,a2
1a468:       00f687b3                add     a5,a3,a5
1a46c:       40f707b3                sub     a5,a4,a5
1a470:       0017b793                seqz    a5,a5
1a474:       0407f00b                custom0 zero,a5,zero,2   # oracle_ensure
1a478:       0600700b                custom0 zero,zero,zero,3 # oracle_release
1a47c:       0001b537                lui     a0,0x1b
1a480:       a2050513                addi    a0,a0,-1504 # 1aa20 <__clz_tab+0x100>
1a484:       e55f506f                j       102d8 <printf>
```

### Execution

The compiled program executes normally until the first special instruction is reached. At this point, various backend system are allowed to examine the processor state and contents of memory. Different backends will work in different ways, but all will produce sequences of data to be returned through calls like oracle_uniform_int that witness a safe execution path that passes through oracle_release without failing any oracle_ensure conditions. Once such witness values have been found, execution continues normally with full programmer observability.

From the end-user's (programmer's) perspective, the program simply only outputs valid results, drawing a new sample every time:

```\$ riscv64-unknown-elf-gcc -O3 example.c -o example
\$ file example
example: ELF 64-bit LSB executable, version 1 (SYSV), statically linked, not stripped
\$ spike \$INFERENCE_CONFIG pk ./example
[a=2]*[x=6] + [b=61] == [c=73]
\$ spike \$INFERENCE_CONFIG pk ./example
[a=9]*[x=6] + [b=14] == [c=68]
\$ spike \$INFERENCE_CONFIG pk ./example
[a=4]*[x=3] + [b=77] == [c=89]
```

### Inference

Let's consider two backend systems that are up to the task of sampling feasible executions. The first, based on Particle Filtering, is easier to understand because it's Rejection Sampling approach mimics how we, as humans, might try to draw samples by simply trying many guesses. Upon reaching oracle_acquire, we'll ask the operating system to fork a copy of the currently running process and set a different random seed. When the forked process needs to execute oracle_uniform_int, it simply uses existing random number generators. If the oracle_ensure check fails, the forked process exits without yielding a sample. If oracle_release is successfully encountered, the forked process relays the sequence of random values it used to the master process and then exits as well. Eventually, the master process will get the data it needs to safely continue execution where the programmer might be watching it in a debugger or waiting to capture its output in a file, and samples will have been drawn with the correct probability.

A refined variant of this approach (Particle Cascade) powers tools like Probabilistic C. A backend system based on this approach will be very fast to start up and will quickly return solutions if the search problem is easy: when a significant fraction of the potential executions are feasibile (oracle_ensure checks pass). For hard problems, where the chance of hitting a feasible execution via forward sampling is astronomically low, an approach that can exploit the deep structure of the search problem is necessary.

In the second approach, we'll use formal verification techniques to seek out those extremely rare solutions. Software model checkers, such as CBMC, work by transforming the program into a constraint satisfaction problem. High-performance solvers for these problems then either find a satisfying solution (the ingredients of a feasible execution trace) or formally prove that no solutions exist. In exchange for longer start-up times, these approaches provide the best available performance on hard search problems. Care is needed to ensure that the solutions resulting from this process arise with the correct distribution, however recent research on hashing-based approximate probabilistic inference provides a solid theoretical foundation here. For resource constrained environments that cannot host the solving process, this computation can be securely delegated to an external service in a new kind of cloud computing market.

Because our users will be expressing both easy and hard search problems — what matters is that the problems are easy for human programmers to express in high-level languages — Quasilinear Research provides several backend inference systems that make different design trade-offs. Newly developed backends will be compatible with existing compiled software that targets the fixed Probabilistic ISA. Defining search problems at the level of machine language allows third-party inference providers to exploit a white-box specification of the search problem without requiring users to reveal their private source code. Users may program in the high-level language of their choice, even if that language is internal to their organization (presuming it can be compiled for or interpreted by our probabilistic machine abstraction).

### Probabilistic Processors

Quasilinear Research is developing a probabilistic processor model, defined in the Chisel hardware description language, to capture the execution of user-level software in terms of the gate-level logic primitives understood by existing search and inference back-ends. Depending on the type of work performed in the user's probabilistic program, different probabilistic processor models will be appropriate (e.g., including or excluding hardware support for complex math operators, and varying the effective number of parallel hardware threads or harts).

# Contact

Adam M. Smith, PhD. Founder, Quasilinear Research. (adam@quasilinear.com)