# Controlling Rate-Limited Events - Implementation

This is the second of a two-part article. In our previous article, we described how rate limiting is pervasive in low-latency trading and technology in general. If you have not read it, it’s advised to do so as it will give context for what is next in here.

In this article, I will discuss a few requirements for such a limiting algorithm and then present a practical implementation.

#### Requirements

The main requirements for such an algorithm and its companion data structures are:

The algorithm needs to precisely count the number of events per an arbitrary time period. This means we cannot have approximations as the thresholds are always coded as a regulation.

More, it needs to be

`O(1)`

such that it behaves well during times of turmoil in the market - since that’s when well-tuned algorithms shine.The algorithm and its companion data structure need to be fast, such that it does not block the critical path of other trading components.

The algorithm needs to have little impact on cache memory. This means it needs to be memory efficient. An algorithm that trashes memory cache just measuring a few events is not worth having in our system.

The interface needs to be clean and easy to use. This is a matter of encapsulating its complexity well and having a simple and intuitive access point.

This is a requirement not from the algorithm but from the interface: the client must update the data structure every time a new event is added to the system and an event is removed from the rolling window.

#### The Trivial Solution

This is surely the worst approach, but I had to add it here because I have seen it used several times in real trading systems.

To start setting the deployment, let’s assume that we have a representation of time as an unsigned integer of 64 bits, as in nanoseconds since the epoch in UTC.

The solution goes like this: you create a map, more specifically a multi-map, with the key being time, and the value is the number of events at that data point. We have `interval,`

a period of time for which we will record and count events. Then, we can start defining some types as discussed above.

We insert the pair time and number to add an event to the map. The printout is here for demonstration purposes only.

As events are added, we check the back of the map for expired events.

Then, traverse the map and sum up the number of events still outstanding in the time window.

A sample driver for the above class can be written as follows. In the main, we create a TimedMap with an interval of 100 time units. We then add events at different times, but it is important to know that these times need to be monotonically increasing.

Running the above code produces:

```
Adding 1 events at 1000
Total:1
Adding 3 events at 1010
Total:4
Adding 1 events at 1050
Total:5
Adding 4 events at 1100
Total:9
Adding 1 events at 1200
Erasing 1 events 1000 at 1200
Erasing 3 events 1010 at 1200
Erasing 1 events 1050 at 1200
Total:5
Adding 1 events at 1300
Erasing 4 events 1100 at 1300
Total:2
```

#### A Better Solution

Right away, we can see a few problems with this approach. First, the use of a multimap is very inappropriate for this use. As events are inserted and removed in order, there is no need for a balanced tree - we could use a list or a deque. Second, as the events are already inserted ordered, we do not need to look up the current time in the container, which is `O(logN)`

. We could access the head or the tail of such a list (or deque).

Another issue is that we recount the entire sample universe every time we add one sample. We could keep a running total and then incrementally change it as these events are added and removed.

With these three enhancements, we reach the following implementation.

We create a structure to hold the time and number of events. Then, we build a double-ended queue to hold them. The advantage of the deque is that it does not need to copy memory as we add or remove them.

Adding events is simple as we push them back into the deque.

Removing expired events is also simple. As the deque is already ordered, we must pop them from the front (the first added) if expired.

This solution is optimal for the requirements stated. It is simple and efficient. Note that now we have `O(1) `

operations internally everywhere.

Running this code produces the same result as the previous implementation, which is as correct as the previous one.

```
Adding 1 events at 1000
Total:1
Adding 3 events at 1010
Total:4
Adding 1 events at 1050
Total:5
Adding 4 events at 1100
Total:9
Adding 1 events at 1200
Erasing 1 events 1000 at 1200
Erasing 3 events 1010 at 1200
Erasing 1 events 1050 at 1200
Total:5
Adding 1 events at 1300
Erasing 4 events 1100 at 1300
Total:2
```

There is still an issue that as we progress in time, it might be that there will be no insertions in the queue, so as time goes by, the count in the window would decay as events internally expire. However, as we are not updating the queue by inserting new items, the internal stats will not be updated. To correct this, we need to enhance this class with a `check(time) `

call, which should be trivial to implement.

However, the queue itself is `O(N)`

concerning the number of events in it. We need to make it memory efficient, i.e., we need to avoid storing individual events.

The first idea that comes up to many is to use an exponential moving average. The problem with EMAs is that they can cause problems on both sides for this particular usage: they will err on the conservative side by lingering a lock (signaling an excess rate) for longer than necessary, thus hurting your underlying business logic. Second, EMAs will not react fast enough to a subtle stream of events and will open us to regulatory penalties. So, let’s forget about EMAs in this case. We need something smarter.

#### A Bounded Approximation

A clever way to achieve memory efficiency is by making one simplifying assumption: we accept that we will have a minimum time precision. This will not hurt as we can take a conservative approach.

Say, for example, that we can only accept 300 events within a period of 1.0 seconds, and we decide that our time precision will be 0.1s. We can then transform this requirement such that we only accept 300 events in a period of 0.9 seconds. This guarantees that the liberties we will take in our algorithm will not affect the original requirement.

The way this idea works is to set up an array of `N`

slots, each one covering an interval `[t, t+T)`

where `T `

is a time period and `t`

is a multiple of it, such that

`t = index*T`

This allows us to locate the slot respective to a given timestamp by dividing the timestamp by the period and taking the remainder to find the slot.

```
index = t/T
slot = index % N
```

The algorithm starts with all slots zeroed. For the sake of visualization, let’s assume we need to keep track of the number of messages every ten seconds, and we will use ten slots. Therefore `N=10`

and `T=1.0`

. This way, the vector behaves as a deque - as we progress in time, we keep rolling back to the start.

This is the initial state of our vector.

```
Total = 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
```

Assume then we take two events at 6.5s and six events at 8.8s.

```
Added two events at 6.5s. index= 6.5/1.0 = 6. slot = 6 % 10 = 6
Total = 2
| 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
Added six events at 8.8s. index= 8.8/1.0 = 8. slot = 8 % 10 = 8
Total = 8
| 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 6 | 0 |
```

Let’s then add one event at t=12.7s

```
Added one events at 12.7s. index= 12.7/1.0 = 12. slot = 12 % 10 = 2
Total = 9
| 0 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 6 | 0 |
```

and 3 events at t=17.0.s.

```
Added three events at 17.0s. index= 17.0/1.0 = 17. slot = 17 % 10 = 7
Total = 12
| 0 | 0 | 1 | 0 | 0 | 0 |
```*** 2 *** | 3 | 6 | 0 |

Note that we must expire those two events we introduced at 6.5s (highlighted above). We do so by zeroing the slot and deducting that amount from the running total.

```
Expire 2 events at 6.5s. index=6.5/1.0=6. slot = 6 % 10 = 6
Total = 12-2 = 10
| 0 | 0 | 1 | 0 | 0 | 0 |
```*** 0 *** | 3 | 6 | 0 |

This way, we keep progressing with expiring slots and adding new events. The total will magically be the sum of all events in the past period.

One further optimization can be done by noting that when we jump over more than the length of the entire queue, in this case, ten seconds, cleaning up all the slots and zeroing the total is simpler.

```
Jump to 32.0s. Old index=17, new index=32, difference = 32-17 = 15s
Cleanup entire queue
Total = 0
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
```

#### Implementation

We start by creating the internal data structures, as shown below. We will have a vector of slots, which can also be made a `std::array`

for maximum cache performance. We then store the total count, as previously. We now keep track of the last index used as we progress while cleaning the slots.

The constructor takes an expiration interval and a time precision, both Date/Time values or objects. With those, we compute the number of slots necessary in our vector.

The key in this algorithm is the advance-and-clear mechanism, which is a bit tricky to get right for several reasons. First, we need to take care of big jumps. If we jump from slot 110 to slot 2000 and have 100 slots, we will unnecessarily loop 10 times, cleaning up the same slots, which is a complete waste of time. We can clean up all the slots once and move on. In this case, we can zero the total. This is the first branch in the picture above.

Second, we need to ensure we’re cleaning- i.e., removing counts from the slots as we move forward - and not cleaning up our current slot. So, if we move from index 105 to index 110, we need to clean up slots 106, 107, 108, 109, and 110. This logic is in the second loop.

There are still two optimizations that can be done in this queue. First, plenty of integer divisions exist in the code above, particularly the one inside the loop at line 10. These can be mitigated in several different ways:

using floating points to represent time, as FP divisions are typically ten times faster than the integer respective.

Another way to mitigate the impact of integer divisions, particularly in the loop at line 9 in

`advance()`

is to compute the start and finish indices and then break the loop into two cases.a third way to avoid the cost of divisions is always to make the number of slots a power of two, in which case the division is just a right shift. However, that is only possible if you agree to decrease the control window to the largest multiple of two smaller than the required window. In extreme cases, that would be half the control window size.

the final way is to make the number of slots constant by making the class a template and the number of slots a template argument. Using the Granlund-Montgomery method, the compiler will optimize away the integer division using modular arithmetic.

The rest of the algorithm is simpler. For the `addEvent()`

method, we advance if necessary and then add the given number of events to the appropriate slot, updating the total.

#### Conclusion

The algorithm presented is fast, cheap, and cache efficient as it does not scale with the number of events input. It is also acceptable in mission-critical projects since it only allocates memory at the startup phase.