Predicate Logic Solving: TPTP vs SMT

Many interesting problems can be presented declaratively using predicate logic. Real life examples are scheduling, logistics, and software and electric circuit verification. That is, the problems are hard and logic provides a way to solve them declaratively. Logic solvers take as input the problem declaration and spit out the solution to the problem.

Examples of solvers are the famous Z3 and GKC. Z3 uses SMTLIB language to specify the problem. GKC is from a different family, and uses TPTP language. The languages are quite different and it seems to me that SMT is geared towards propositional logic while TPTP is for predicate logic. I couldn’t find any simple comparison so I made one.

Our toy predicate logic problem is as follows. We have four people: Agnetha, Björn, Benny, and Anni-Frid (“Frida”), and one binary predicate knows(x,y). We know a priori that:

  • knows(Agnetha, Björn), that is, Agnetha knows Björn.
  • knows(Benny, Björn)
  • ∀x∀y(knows(x,y)→knows(y,x))
  • ∀y¬knows(Frida,y)

Can we say for sure that Agnetha doesn’t know Frida? That is, does the logical consequence hold: S ⊨ ¬knows(Agnetha,Frida)? Simple, but how to write the structure in SMTLIB and TPTP?

First TPTP. The syntax fits this kind of problem very nicely and is clear. Runnning: bin/gkc ./abba.tptp

fof(formula1,axiom, (knows(agnetha,bjorn))).
fof(formula2,axiom, (knows(benny,bjorn))).
fof(formula3,axiom, (! [X] : ! [Y] : ((~ knows(X,Y)) | knows(Y,X)))).
fof(formula4,axiom, (! [X] : ~ knows(frida, X))).
% Proof by contradiction
fof(formula5,conjecture, (~ knows(agnetha,frida))).

Next, the same in SMTLIB. Run with bin/z3 -smt2 abba.smt

; https://smtlib.github.io/jSMTLIB/SMTLIBTutorial.pdf
;(set-logic AUFLIA)
(declare-sort A 0) ; A new sort for persons
(declare-fun knows (A A) Bool)
(declare-const agnetha A)
(declare-const bjorn A)
(declare-const benny A)
(declare-const frida A)
(assert (knows agnetha bjorn))
(assert (knows benny bjorn))
(assert (forall ((x A) (y A)) (=> (knows x y) (knows y x))))
(assert (forall ((x A)) (not (knows frida x))))
;(check-sat)
;(get-model)
; assert the negation of the conjecture
(assert(knows agnetha frida))
(check-sat)

There we have it, the same problem solved in both TPTP and SMTLIB.

This is interesting because the solvers are getting really capable nowadays. Geoff Sutcliffe has written a nice piece about Automated Theorem Proving and how it can be utilized to solve age-old problems.

ATP is thus a technology very suited to situations where a clear thinking domain expert can interact with a powerful tool, to solve interesting and deep problems.

Geoff Sutcliffe

Buying the More Expensive Option

After a visit to a bicycle shop I realised that I need to increase my budget. It makes sense to buy the more expensive bike, as it’s more fun, nicer to ride, and I totally need the fancy features, right?

With this premise I was happy to find Olof Hoverfält‘s post about data-supported decision making. In the genius piece, Olof uses his wardrobe as a case-example of the effect of value-vs-cost. Through meticulous data collection over three years (and counting) he is able to make informed statements about clothing categories, quality, pricing, value, cost of preference, and actual frequency of use. The significance of the post is that it explains important concepts of informed decision making in familiar terms and a relatable context.

Let’s take a look at some highlights.

Real cost. Expensive can be cheap and vice-versa. You can’t know the real cost of an item unless you know it all: purchase price, depreciation rate (or lifetime + value at divestment), actual frequency of use, and quality. A pair of shoes may cost a lot, but if they’re used daily during the looong winter and they can take it (durability), they turn out very cost-effective.

Category differences. There may be subtle differences between seemingly similar categories. Every item in knitwear category is available for wearing every day (unrestricted category). Underwear may spend days in wash cycle after use, becoming available after a significant delay (resricted category). Value of investments can’t be compared directly across categories as the competitive attributes are different.

Cost of quality. The definition of quality is not obvious. Durability is a factor, so it makes sense to buy cheap and durable items. But it makes no sense to buy cheap and durable items that are used very rarely. An expensive shirt may not be cost effective but has other attributes: nicer style and cut, better details and materials, etc. There is a cost related to perceived quality, and the cost can be quantified. In Olof’s post, the cost of “fancy shirts” is 500 euros per year.

Value of long term data. You’d think that after a year of daily tracking you’d have a pretty good data set for making informed decisions about something as simple as clothes. Not so: Olof’s analysis after a year is very different from after three years, and the results keep changing as more data flows in. The actual frequency of use may be very different from the estimate. In terms of data the saying holds: The best time to plant a tree is 20 years ago, the second best time is today.

The data collection template is available at https://hoverfalt.github.io/.

Bayesian Data Analysis of Capacity Factor

Stan is a platform for probabilistic programming. To demonstrate its features I did data analysis of wind energy capacity factor in Finland. Wind energy is feasible in Finland, and we have quite high seasonal variance, so modeling wind data makes an interesting case. This case study presents a Bayesian data analysis process starting from data, modeling, model diagnostics to conclusions.

Statistical modeling on a modern computing platform such as Stan let’s you construct the model quite freely. I mean, you can all but ignore such constraints as conjugate priors. Stan’s implementation of Hamiltonian Monte Carlo can generate reliable estimates of very hard integrals. You can pay more attention to the model at hand, instead of computational constraints.

The full report is here: Wind Power Generation Efficiency and Seasonality. One reviewer of the report said it well: “In many cases, modeling itself only produces more development ideas than the answers themselves, which is also very evident in this work.”

Original data is from Fingrid (data.fingrid.fi, license CC 4.0 BY).

Covid-19 False Positives

Lab test false positive rates may feel counter-intuitive. Let’s take a closer look at the state-of-the-art Covid-19 real time PCR test.

In Interpreting a covid-19 test result Watson & al., The BJM, May 2020 say that the sensitivity of the test is between 71–98%, and specificity around 95%.

The English statistics authority estimates that in August 2020 about 0.5‰ of the population had the virus. In Finland, THL estimates that there have been a bit under 8000 cases, which would be 1.5‰ of the population. Of these, most are already healed, and the current incidence rate is around 0.03‰ i.e. about a decade better that in England.

What do the numbers mean in practice? If we pick a random person and the test shows a positive result, what is the probability that the person is actually healthy? Let T = positive test result, and V = has virus. In the BJM article they use sensitivity of 70% for real-life testing. Let’s be generous and say that 1‰ of the population has the virus. Then, according to Bayes’ theorem, we can calculate that there’s a 99% chance the result is a false positive!

How about the opposite case? Pick a random person, test shows negative. What is the probability that the person has the virus anyway? It’s 0.03%.

The key above is the “random person”. The calculations show that there’s no point in testing everyone. In reality, the tested patients are not picked randomly, but they are, and should be, chosen based on their exposure to the virus and/or relevant symptoms.

Functional Programming in Elm—First Impressions

I built a serverless (progressive) web app ppkk.fi using Elm. Elm is a functional language that compiles to JavaScript so you can make web apps and components to use on web sites. Elm is not an extension to JavaScript or something. Indeed, Elm is a stand-alone programming language with its own compiler and standard libraries.

Getting started wasn’t easy. Was the steep learning curve worth it? At this point I’d say it was. A quick list of first impression pros and cons:

Pros

  • When it compiles, it works.
  • Elm-UI makes it possible to avoid complex CSS.
  • Forces good program structure.
  • Pure functions and static types make runtime errors go away.

Cons

  • S l o w to rapidly test new app features. Always have to tie up all loose ends.
  • Have to build up from low level. In JS you’d just npm install a package.

Elm’s pure functions and immutable values forced me to focus on program states, and write functions that are very explicit about the state changes. It was difficult to get right in the beginning. After some UML state diagramming I got the first version running. I quickly realized my first state structure was lacking and started to refactor the code. That was definitely Elm’s strength: refactoring was not the nightmare it often is with e.g. JavaScript. Elm’s compiler checked everything for me, pointed out errors, and I could trust that whenever compiler errors were fixed, I had a pretty solid program up and running again.

For example, I originally had a user data record that was passed to functions building the user interface views. It worked but was difficult with side effects like save to database. So I tagged the data record with a custom type RemoteData that explicitly models writes and loads. With RemoteData User it was nice to build a user interface that doesn’t leave the user wondering if something’s happening or not.

It was even nicer was to find out about phantom types. I could use a phantom type to restrict function parameters to e.g. only “write done” RemoteData User. So now the compiler would check—in compile time—that a function is called with only Users whose data are safe in the database. Proper types and the compilers type checking would help me write an app that would have no runtime errors!

Conclusion: for a simple app Elm was a delightful experience. The result is fast and efficient. I have a feeling tha the code will be reasonable easy to maintain. Specifically, building the user interface with Elm-UI was great. I spent much less than usual time tweaking CSS.

-- 1st version, won't work with side effects
userUpdated : User.Model -> El.Element Msg
userUpdated user = 
  -- ...


-- 2nd version, with remote data
userUpdated : RemoteData User.Model -> El.Element Msg
userUpdated rUser =
  case rUser of
    Loading -> -- Show spinner
    Stored -> -- Show the updated view


-- 3rd version with phantom types
type ValidData a = ValidData
type Loading = Loading
type Saved = Saved

userUpdated : ValidData User.Model -> El.Element Msg
userUpdated validUser =
  user =
    case validUser of
      ValidData u -> u
  -- Now we have a guaranteed valid user record

8 Requirements of Intelligence

What is intelligence, in the context of machine learning and AI? A classic from 1979, Hofstadter’s GEB, gives eight essential abilities for intelligence:

  1. to respond to situations very flexibly
  2. to take advantage of fortuitous circumstances
  3. to make sense out of ambiguous or contradictory messages
  4. to recognize the relative importance of different elements of a situation
  5. to find similarities between situations despite differences which may separate them
  6. to draw distinctions between situations despite similarities which may link them
  7. to synthesize new concept by taking old concepts and putting them together in new ways
  8. to come up with ideas which are novel

It seems to me that the keyword is “flexibility“. Our world is complex, and a creature must be able to act in an infinite variety of circumstances. Sometimes a simple rule is enough, sometimes you need a combination of rules, and sometimes a totally new rule is required.

Hofstadter’s usage of the term stereotyped response got me thinking about Kahneman’s System 1 and 2. In those terms, it seems that System 1 covers all eight abilities. System 1 is fast thinking, applying a stereotypic solution to a situation. No actual reasoning or logical “thinking” is required to fulfil the requirements. However, the stereotypic solutions or rules must be flexible.

Two Views to Zonings Laws

By co-incidence I read two views on zoning. This resonated with me because I haven’t thought about zoning being optional, that is, that there could exist Western cities without proper zoning.

First, about San Francisco:

Imagine you’re searching for an apartment in San Francisco – arguably the most harrowing American city in which to do so. The booming tech sector and tight zoning laws limiting new construction have conspired to make the city just as expensive as New York, and by many accounts more competitive.

Second, about Houston:

Houston is the largest city in the United States without any appreciable zoning. While there is some small measure of zoning in the form of ordinances, deed restrictions, and land use regulations, real estate development in Houston is only constrained by the will and the pocketbook of real estate developers. […] This arrangement has made Houston a very sprawled-out and very automobile-dependent city.

Having lived in the capital area of Finland, that I surmise suffers from a similar effect as San Francisco, I wonder if there are cities that strike the balance right? The housing costs are (too?) high, but I enjoy the beautiful, walkable city.

Visualizing Neural Net Using Occlusion

I got good results using an RNN for a text categorization task. Then I tried using a 1D CNN for the same task. To my surprise, I got even better results, and the model was two magnitudes smaller. How can such a lightweight model perform so well?

Out of curiosity, and also to verify the results, I wanted to visualize what the neural network was learning. This being one-dimensional text data the convolution filter visualizations were not interesting. For image data, check out Understanding Neural Networks Through Deep Visualization.

I turned to occlusion, as described in Matthew Zeiler’s paper. To find out what a neural network is recognizing we occlude parts of the input to the network. When the model’s output deviates, the occluded part is significant to the prediction. I haven’t seen occlusion used for text data but decided to give it a go.

Test case: categorize sports team names by age. For example:

 "Westend Old Boys" -> "adult";
 "Indians G Juniors" -> "children";
 "SV Alexanderplatz U7" -> "children";
 "Karlsruhe Erste Männer" -> "adult";

Occlusion Visualization

The model outputs the probability of a junior team. In Finland, D-boys are typically 13-14 years old. When the occluder slides over the characters that signify the age group, i.e. “_D_”, the probability drops drastically. On the other hand, occluding parts of the town name “Rosenburg” makes very little difference. It seems that the model is truly identifying the relevant region in the input data.

Why Loss and Accuracy Metrics Conflict?

loss function is used to optimize a machine learning algorithm. An accuracy metric is used to measure the algorithm’s performance (accuracy) in an interpretable way. It goes against my intuition that these two sometimes conflict: loss is getting better while accuracy is getting worse, or vice versa.

I’m working on a classification problem and once again got these conflicting results on the validation set.

Loss vs Accuracy graph

Loss vs Accuracy

Accuracy (orange) finally rises to a bit over 90%, while the loss (blue) drops nicely until epoch 537 and then starts deteriorating. Around epoch 50 there’s a strange drop in accuracy even though the loss is smoothly and quickly getting better.

My loss function here is categorical cross-entropy that is used to predict class probabilities. The target values are one-hot encoded so the loss is the best when the model’s output is very close to 1 for the right category and very close to 0 for other categories. The loss is a continuous variable.

Accuracy or more precisely categorical accuracy gets a discrete true or false value for a particular sample.

It’s easy to construct a concrete test case showing conflicting values. Case 1 has less error but worse accuracy than case 2:

 

Loss vs Accuracy example spreadsheet

For reference, calculating categorical cross-entropy in Keras for one sample:

truth = K.variable([[1., 0., 0.]])
prediction = K.variable([[.50, .25, .25]])
loss = K.eval(K.categorical_crossentropy(truth, prediction))

In what kind of situations does loss-vs-accuracy discrepancy occur?

  • When the predictions get more confident, loss gets better even though accuracy stays the same. The model is thus more robust, as there’s a wider margin between classes.
  • If the model becomes over-confident in its predictions, a single false prediction will increase the loss unproportionally compared to the (minor) drop in accuracy. An over-confident model can have good accuracy but bad loss. I’d assume over-confidence equals over-fitting.
  • Imbalanced distributions: if 90% of the samples are “apples”, then the model would have good accuracy score if it simply predicts “apple” every time.

Accuracy metric is easier to interprete, at least for categorical training data. Accuracy however isn’t differentiable so it can’t be used for back-propagation by the learning algorithm. We need a differentiable loss function to act as a good proxy for accuracy.

Spell Out Convolution 1D (in CNN’s)

I’m working on a text analysis problem and got slightly better results using a CNN than RNN. The CNN is also (much) faster than a recurrent neural net. I wanted to tune it further but had difficulties understanding the Conv1D on the nuts and bolts level. There are multiple great resources explaining 2D convolutions, see for example CS231n Convolutional Neural Networks for Visual Recognition, but I couldn’t find a really simple 1D example. So here it is.

1D Convolution in Numpy

import numpy as np
conv1d_filter = np.array([1,2])
data = np.array([0, 3, 4, 5])
result = []
for i in range(3):
 print(data[i:i+2], "*", conv1d_filter, "=", data[i:i+2] * conv1d_filter)
 result.append(np.sum(data[i:i+2] * conv1d_filter))
print("Conv1d output", result)

[0 3] * [1 2] = [0  6]
[3 4] * [1 2] = [3  8]
[4 5] * [1 2] = [4 10]

Conv1d output [6, 11, 14]

The input data is four items. The 1D convolution slides a size two window across the data without padding. Thus, the result is an array of three values. In Keras/Tensorflow terminology I believe the input shape is (1, 4, 1) i.e. one sample of four items, each item having one channel (feature). The Convolution1D shape is (2, 1) i.e. one filter of size 2.

1D Convolution Kernel Size 2

The Same 1D Convolution Using Keras

Set up a super simple model with some toy data. The convolution weights are initialized to random values. After fitting, the convolution weights should be the same as above, i.e. [1, 2].

from keras import backend as K
from keras.models import Sequential
from keras.optimizers import Adam
from keras.layers import Convolution1D
K.clear_session()
toyX = np.array([0, 3, 4, 5]).reshape(1,4,1)
toyY = np.array([6, 11, 14]).reshape(1,3,1)

toy = Sequential([
 Convolution1D(filters=1, kernel_size=2, strides=1, padding='valid', use_bias=False, input_shape=(4,1), name='c1d')
])
toy.compile(optimizer=Adam(lr=5e-2), loss='mae')
print("Initial random guess conv weights", toy.layers[0].get_weights()[0].reshape(2,))

Initial random guess conv weights [-0.99698746 -0.00943983]

Fit the model and print out the convolution layer weights on every 20th epoch.

for i in range(200):
  h = toy.fit(toyX, toyY, verbose=0)
  if i%20 == 0:
    print("{:3d} {} \t {}".format(i, toy.layers[0].get_weights()[0][:,0,0], h.history))

  0 [-0.15535446  0.85394686] 	 {'loss': [7.5967063903808594]}
 20 [ 0.84127212  1.85057342] 	 {'loss': [1.288176417350769]}
 40 [ 0.96166265  1.94913495] 	 {'loss': [0.14810483157634735]}
 60 [ 0.9652133   1.96624792] 	 {'loss': [0.21764929592609406]}
 80 [ 0.98313904  1.99099088] 	 {'loss': [0.0096222562715411186]}
100 [ 1.00850654  1.99999714] 	 {'loss': [0.015038172714412212]}
120 [ 1.00420749  1.99828601] 	 {'loss': [0.02622222900390625]}
140 [ 0.99179339  1.9930582 ] 	 {'loss': [0.040729362517595291]}
160 [ 1.00074089  2.00894833] 	 {'loss': [0.019978681579232216]}
180 [ 0.99800795  2.01140881] 	 {'loss': [0.056981723755598068]}

Looks good. The convolution weights gravitate towards the expected values.

1D Convolution and Channels

Let’s add another dimension: ‘channels’. In the beginning this was confusing me. Why is it 1D conv if input data is 2D? In 2D convolutions (e.g. image classification CNN’s) the channels are often R, G, and B values for each pixel. In 1D text case the channels could be e.g. word embedding dimension: a 300-dimensional word embedding would introduce 300 channels in the data and the input shape for single ten words long sentence would be (1, 10, 300).

K.clear_session()
toyX = np.array([[0, 0], [3, 6], [4, 7], [5, 8]]).reshape(1,4,2)
toyy = np.array([30, 57, 67]).reshape(1,3,1)
toy = Sequential([
 Convolution1D(filters=1, kernel_size=2, strides=1, padding='valid', use_bias=False, input_shape=(4,2), name='c1d')
])
toy.compile(optimizer=Adam(lr=5e-2), loss='mae')
print("Initial random guess conv weights", toy.layers[0].get_weights()[0].reshape(4,))

Initial random guess conv weights [-0.08896065 -0.1614058   0.04483104  0.11286306]

And fit the model. We are expecting the weights to be [[1, 3], [2, 4]].

# Expecting [1, 3], [2, 4]
for i in range(200):
  h = toy.fit(toyX, toyy, verbose=0)
  if i%20 == 0:
    print("{:3d} {} \t {}".format(i, toy.layers[0].get_weights()[0].reshape(4,), h.history))

  0 [-0.05175393 -0.12419909  0.08203775  0.15006977] 	 {'loss': [51.270969390869141]}
 20 [ 0.93240339  0.85995835  1.06619513  1.13422716] 	 {'loss': [34.110202789306641]}
 40 [ 1.94146633  1.8690213   2.07525849  2.14329076] 	 {'loss': [16.292699813842773]}
 60 [ 2.87350631  2.8022306   3.02816415  3.09959674] 	 {'loss': [2.602280855178833]}
 80 [ 2.46597505  2.39863443  2.96766996  3.09558153] 	 {'loss': [1.5677350759506226]}
100 [ 2.30635262  2.25579095  3.12806559  3.31454086] 	 {'loss': [0.59721755981445312]}
120 [ 2.15584421  2.15907145  3.18155575  3.42609954] 	 {'loss': [0.39315733313560486]}
140 [ 2.12784624  2.19897866  3.14164758  3.41657996] 	 {'loss': [0.31465086340904236]}
160 [ 2.08049321  2.22739816  3.12482786  3.44010139] 	 {'loss': [0.2942861020565033]}
180 [ 2.0404942   2.26718307  3.09787416  3.45212555] 	 {'loss': [0.28936195373535156]}
...
  n [ 0.61243659  3.15884042  2.47074366  3.76123118] 	 {'loss': [0.0091807050630450249]}

Converges slowly, and looks like it found another fitting solution to the problem.

1D Convolution and Multiple Filters

Another dimension to consider is the number of filters that the conv1d layer will use. Each filter will create a separate output. The neural net should learn to use one filter to recognize edges, another filter to recognize curves, etc. Or that’s what they’ll do in the case of images. This excercise resulted from me thinking that it would be nice to figure out what the filters recognize in the 1D text data.

K.clear_session()
toyX = np.array([0, 3, 4, 5]).reshape(1,4,1)
toyy = np.array([[6, 12], [11, 25], [14, 32]]).reshape(1,3,2)
toy = Sequential([
 Convolution1D(filters=2, kernel_size=2, strides=1, padding='valid', use_bias=False, input_shape=(4,1), name='c1d')
])
toy.compile(optimizer=Adam(lr=5e-2), loss='mae')
print("Initial random guess conv weights", toy.layers[0].get_weights()[0].reshape(4,))

Initial random guess conv weights [-0.67918062  0.06785989 -0.33681798  0.25181985]

After fitting, the convolution weights should be [[1, 2], [3, 4]].

for i in range(200):
  h = toy.fit(toyX, toyy, verbose=0)
  if i%20 == 0:
    print("{:3d} {} \t {}".format(i, toy.layers[0].get_weights()[0][:,0,0], h.history))

  0 [-0.62918061 -0.286818  ] 	 {'loss': [17.549871444702148]}
 20 [ 0.36710593  0.70946872] 	 {'loss': [11.24349308013916]}
 40 [ 1.37513924  1.71750224] 	 {'loss': [4.8558430671691895]}
 60 [ 1.19629359  1.83141077] 	 {'loss': [1.5090690851211548]}
 80 [ 1.00554276  1.95577395] 	 {'loss': [0.55822056531906128]}
100 [ 0.97921425  2.001688  ] 	 {'loss': [0.18904542922973633]}
120 [ 1.01318741  2.00818276] 	 {'loss': [0.064717374742031097]}
140 [ 1.01650512  2.01256871] 	 {'loss': [0.085219539701938629]}
160 [ 0.986902    1.98773074] 	 {'loss': [0.022377887740731239]}
180 [ 0.98553228  1.99929678] 	 {'loss': [0.043018341064453125]}

Okay, looks like first filter weights got pretty close to [1, 2]. How about the 2nd filter?

# Feature 2 weights should be 3 and 4
toy.layers[0].get_weights()[0][:,0,1]

array([ 3.00007081,  3.98896456], dtype=float32)

Okay, looks like the simple excercise worked. Now back to the real work.