## Converting Boolean-Logic Decision Trees to Finite State Machines

### for simpler, high-performance detection of cybersecurity events

When analyzing cybersecurity events, the detection algorithm evaluates attributes against boolean expressions to determine whether the event belongs to a class. This article describes converting boolean expressions to finite state machines to permit simpler, high-performance evaluation.

The open-source project Cyberprobe features this implementation. Conversion of rules to finite state machine (FSM) and application of the rules in FSM form is implemented in Python. Cyberprobe supports the use of millions of rules, which can be applied at greater than 200k events/second on a single processor core.

**Problem**

Applying boolean logic criteria to events solves many scanning and detection problems. For instance, an event occurs that is generated from an interaction with a service under protection. The event has the following attributes:

- Source address:
`123.123.123.123:14001`

- Destination address:
`192.168.0.1:19001`

- URL:
`https://myservice.com/path1`

One or more boolean expressions for the class of thing I am trying to detect:

If TCP port number is 80 or 8080 AND IP address is 10.0.0.1 AND URL is http://www.example.com/malware.dat OR http://example.com/malware.dat …

The aim is to analyze a high-rate stream of such events against a large set of boolean expressions to classify the events.

The boolean expressions get unreadable quickly with English, which has no built-in operator precedence.

**Boolean expressions**

Boolean operators are represented as functions, and ** type:value** represents attribute type/value match terms.

and( or( tcp:80, tcp:8080 ), ipv4:10.0.0.1, or( url:http://www.example.com/malware.dat, url:http://example.com/malware.dat ) )

A boolean expression consists of a combination of, ** and(…)**,

**and**

`or(…)`

**functions, along with**

`not(…)`

**match terms. I am using**

`type:value`

**pairs for match terms as that is useful in the domain I’m working in, but we could just as easily use strings.**

`type:value`

**Input**

When evaluating the attributes of an event, attributes are ** type:value** pairs. e.g.

ipv4:123.123.123.123 tcp:14001 ipv4:192.168.0.1 tcp:19001 url:https://myservice.com/path1

## A basic evaluation algorithm

A simple approach for evaluation of a boolean expression using ** type:value** pair input is to represent the boolean expression as a tree, and then use

**pairs to trigger evaluation. Observations are stored in the tree.**

`type:value`

The rules for evaluating a boolean tree against an event are:

- For each
attribute, see if there is a corresponding`type:value`

term in the boolean tree. If it exists, set the term node as true, and evaluate the parent node.`type:value`

- When evaluating a parent
node, when any child is true, the**or**node is true, and its parent node is evaluated.**or** - When evaluating a parent
node, when ALL children are true, the**and**node is true, and its parent node is evaluated.**and** - When evaluating a parent
node, when the child node is true, the**not**node is false. Once evaluation of all attributes is complete, if a**not**node has not been deemed false because its child is false, then it is evaluated true, and it’s parent node is evaluated.**not**

That’s a straightforward algorithm; the point of this article is to provide an optimization.

There is a compromise here, the algorithm to convert the boolean tree to an FSM is compute intensive: it has complexity which is non-linear with the number of nodes: it is linear with the product of combination nodes (described below) and ** type:value** terms. In real-world scenarios, boolean expressions will be converted to FSM when the rule is parsed, thereafter the FSM can be used numerous times.

## Converting to an FSM

**Step 1: Identify the ‘basic states’**

In order to find the FSM, we look for all of the nodes in the boolean tree where state needs to be observed as evaluation proceeds. If you look at the example above, you can see that * or* nodes and

*nodes are different. A child of an*

**and***node when evaluated as true immediately results in its parent being true, so no state needs to be kept regarding the children of*

**or***nodes. Whereas, when a child of an*

**or***node is true this is something which may need to be stored for later evaluation to determine the point at which the*

**and***node can be evaluated true.*

**and**The evaluation of * not* nodes is also complicated: a

*node can be evaluated as true by virtue of its child maintaining a false evaluation for the duration of analysis.*

**not**The rules we state here are that some nodes in the boolean tree can be described as * basic states*:

- The root of a tree is inherently a
state, which means the boolean expression is true. This is a basic state.`hit`

- A
node is never a**not**.**basic state** - A child of an
node is a basic state unless it is a**and**node.**not** - A child of a
node is a basic state unless it is a**not**node itself.**not**

In the above example, the basic states are the two * or* nodes, and the

**node. All qualify under rule 3.**

`ip:10.0.0.1`

The implementation gives each state a state name which consists of the letter * s* plus a unique number, assigned in a depth-first walk. The example boolean tree with states is shown below; the three children of the

*node are given states, with the parent*

**and***node representing the hit state.*

**and****Step 2: Identify the ‘combination states’**

The basic states are nodes where partial state needs to be recorded. One node in an FSM represents *all* state at the same time i.e. all the valid basic state combinations. Hence the * combination states* set consists all combinations of

*. This includes the empty set, and a union of all states.*

**basic states**Combination states need to have a state name: in my implementation, I combine states to a name by ordering, separating state numbers with a hyphen preceded by ** s**. For example, a combination of states

**,**

`s4`

**,**

`s7`

**is called**

`s13`

**.**

`s4–7-13`

The empty set has a special name which we call ** init**. It represents the initial state of the FSM where no information is known.

There is a special state ** hit** which is used to describe any combination of basic states which include the root node evaluating to true. The combination of other states is ignored.

In the above example, the combination state set consists of:

: The empty set`init`

: The first`s3`

node:**or**: The`s4`

node`ip:10.0.0.1`

: The second`s7`

node**or**: The first`s3-4`

node and**or**`ip:10.0.0.1`

`s4-7`

The**:**node and the second`ip:10.0.0.1`

node**or**: The first and second`s3-7`

nodes**or**`hit`

the root node**:**

**Step 3: Find all match terms**

This is the set of all ** type:value** match nodes in the boolean expression tree.

**Step 4: Find all transitions**

This step is essentially about working out what all ** type:value** match nodes do to all combination states. There is a special match term,

**which is used to evaluate what happens to**

`end:`

*nodes when the list of terms is completed.*

**not**The algorithm is:

For every combination state: Work out the state name of that 'input' combination state For every match term: Given the input state What state results from evaluating that term as true? Work out the state name of that 'output' combination state Record a transition (input, match term, output) Given the input state What state results from evaluating end: as true? Work out the state name of that 'output' combination state Record a transition (input, end:, output)

For this analysis, when the whole boolean expression evaluates as true i.e. the root node of the boolean expression is true, we give that a special name ** hit**.

The result is a complete set of triples: * (input, term, output)*. If the input and output states are the same, we can ignore the transition so that the FSM only contains edges which change state.

At this point, the FSM has some inefficiencies: there may be areas of the FSM which it is not possible to navigate to from ** init**. This is addressed in the next step.

**Step 5: Remove invalid transitions**

Not all combination states can be reached from ** init**, and so some of the transitions discovered can be discarded as irrelevant.

We start by constructing a set of states which can navigate to ** hit**:

- Create a set containing only the combination state
.`hit`

- Iterate over the FSM adding all transitions for which there is a navigation to any state in the set.
- Repeat 2. until the full set of states is discovered.

At this point we know all states which can lead to ** hit**. However, there will be transitions which lead to states which are not in this set, and thus cannot ever travel to

**. So, the first simplification of invalid transitions is to reduce all transitions to states which are NOT in this set to the single state named**

`hit`

**.**

`fail`

There is a second simplification of the FSM: some of the states are not navigable from ** init**, and can be removed:

- Construct a set containing only
.`init`

- Iterate over the FSM finding all transitions for which there is a navigation from any state in the set.
- Repeat 2. until the set of states is discovered.

At this point we know areas of the FSM which are not reachable, and they can be removed.

**Resultant FSM**

The FSM of the above binary tree is depicted below. The ** init** state represents the initial FSM state. The

**state represents successful evaluation of the boolean expression as true. We have mentioned the**

`hit`

**state, which only occurs when**

`fail`

*expressions are used, which do not appear as a result of the boolean expression described as above. See below for an example.*

**not**### Using the FSM

Evaluation of a boolean expressions using the FSM is simple:

- The FSM starts in the
state.`init`

- As attributes are discovered, the
is compared to the transitions from the current state. If a transition exists, the FSM moves to a new state.`type:value`

- When the
state is achieved, that is the equivalent of the boolean expression evaluating to true.`hit`

- When the
state is achieved, no further attribute discovery is needed, and the evaluation can be fast-failed.`fail`

The fact that for each term a single FSM lookup is needed means that this approach has performance advantages.

## Example 2: Using not

For this example, the * not* node is used:

and( not( or( tcp:8081, tcp:8082 ) ), and( tcp:80, or( url:http://www.example.com/malware.dat, url:http://example.com/malware.dat ) ) )

It is interesting to view this graph before removal of invalid transitions and discovery of fail states. Some example artifacts:

- There is no transition to the combination state consisting only of state
. This is because it is not possible to arrive at this state without evaluating both`s9`

and`s5`

as true. There are transitions that lead from`s8`

, to hit, but there is no path which leads to the`s9`

state, so they can never be taken.`s9`

- State
is similarly not valid, if`s5-8`

and`s5`

are evaluated as true,`s8`

is also true. In both cases, the valid state for this condition would be`s9`

. This results in an unreachable part of the FSM with two nodes which is not connected with the rest of the FSM.`s5-8-9`

- Any state with
true cannot lead to`s3`

because the root`hit`

node is necessarily false. All nodes which include**and**can be replaced by the`s3`

state.`fail`

After removal of invalid transitions and mapping to fail states, the FSM is easier to understand:

This example illustrates the ** fail** state: once transitions lead to this state, it is not possible for further information to permit transition to the

**state. Discovering**

`hit`

**or**

`tcp:8081`

**to be present in any state causes a transition to the**

`tcp:8082`

**state. The fail state could be useful depending on your analysis strategy: it may be a point to fast-fail and shortcut further evaluation. This example also illustrates the special**

`fail`

**term which leads to**

`end:`

**.**

`hit`

## Example 3: More state

This example requires much more state to be kept as a result of all of the * and* conditions.

and( or( url:http://www.example.com/malware.dat, url:http://example.com/malware.dat ), ipv4:10.0.0.1, not( and( or( tcp:8081, tcp:8082 ), ipv4:10.0.0.2 ) ) )

The resultant FSM has many states as a result:

## Evaluating many rules concurrently

The FSM model lends itself to highly performant scanning using a large set of rules, each converted to an FSM using the approach described above. While it is theoretically possible to produce an uber-FSM from the individual FSMs, the size of the FSM rapidly becomes unwieldly. As there needs to be a single state in the uber-FSM for each combination of states in the set of contributary FSMs.

However, tracking a large number of FSM states concurrently could be compute-intensive.

A simple evaluation approach is to identify a set of * initiators*, which are the set of terms which lead away from the

**state in each FSM. If any of the initiators are detected while analyzing attributes, the corresponding FSM is activated and put on the set of FSMs which are being tracked for state changes in subsequent match term evaluation. This approach reduces the number of FSMs which need to be tracked. I find in practice, this results in a small set of FSMs used for evaluation.**

`init`

Using this approach, it is not appropriate to fast-fail an FSM and remove it from the set of tracked FSMs; FSMs must be tracked to the fail state, to prevent the FSM from subsequently being re-activated.

## Implementation: cyberprobe indicators

The cyberprobe project includes a means to write rules in JSON format. There are a number of utilities which parse the rule format and output FSM information e.g. ** indicators-show-fsm** takes a rule/indicator file and dumps out the FSMs of every rule in the file. This is output in human-readable form showing the state transitions:

[indicators]$indicators-show-fsm case1.json3ce77704-abe4–4527–84e6-ed6a745aebcf: URL of a page serving malware init — tcp:8080 -> s6 init — tcp:80 -> s6 init — url:http://example.org/malware.dat -> s3 init — url:http://www.example.org/malware.dat -> s3 s3 — tcp:8080 -> hit s3 — tcp:80 -> hit s6 — url:http://example.org/malware.dat -> hit s6 — url:http://www.example.org/malware.dat -> hit

** indicators-graph-fsm** is a utility which takes a rule/indicator file and a rule ID, and outputs a Graphviz format graph describing the FSM, which is how I generated the diagrams in this article:

[indicators]$indicators-graph-fsm case1.json \ 3ce77704-abe4–4527–84e6-ed6a745aebcf > graph.dot[indicators]$dot -Tpng graph.dot > graph.png

** indicators-dump-fsm** is a utility which takes an indicator file and outputs the FSM in a JSON form.

You can embed the FSM transformation in your code using the ** cyberprobe.fsm_extract** module:

#!/usr/bin/env python3

import sys import cyberprobe.fsm_extract as fsme from cyberprobe.logictree import And, Or, Not, Match

expression = And([ Or([ Match("tcp", "80"), Match("tcp", "8080") ]), Match("ipv4", "10.0.0.1"), Or([ Match("url", "http://www.example.com/malware.dat"), Match("url", "http:/example.com/malware.dat") ]) ])

fsm = fsme.extract(expression)

# Dump out FSM for v in fsm: for w in v[1]: print(" %s -- %s:%s -> %s" % (v[0], w[0], w[1], v[2]))

## A quick word on performance

To benchmark my algorithm, I have compared the performance of the FSM approach with the basic boolean-tree algorithm discussed at “A basic algorithm” above, coding both in Python. The plot below shows the number of rules in use as the x-axis, and the event handling rate as the y-axis. This is a way to show how the number of rules in use affects event throughput. The orange line represents the performance of the FSM. You can see that the number of rules in use has very little affect on algorithm throughput. Note logarithmic scale on the x-axis.

This was run on VirtualBox on my old MacBook, expect better performance from a cloud VM, for instance.

That’s a very quick look at performance, maybe I’ll do a follow-up article on performance later.

## Conclusion

In this article I have discussed:

- How boolean expressions can be represented as trees
- Mapping boolean expressions to finite state machines
- How the use of an FSM simplifies detection logic and enables performance advantages
- How to use multiple FSMs for evaluation of concurrent boolean expressions
- That the FSM approach is implemented in the open source cyberprobe project