# Algorithmic Trading for Fixed Income Securities : Part – 1

Regarding my background, I have worked as a High Frequency Market Maker on the NSE (National Stock Exchange of India), MCX (multi-commodity exchange) and the CME (Chicago Mercentile Exchange) mainly in the Currency/Commodities F&O and thereby built a lot of strategies and Algorithms around it. Feel free to connect with me on Linkedin: My Linkedin Profile . For further queries you can reach me at info@alphaticks.com .

Bonds in general are more complicated and have more moving parts than Stocks/FX/Commodities as they behave like derivatives themselves with rates being the underlying (since they derive their value from the underlying benchmark curve in addition to their credit/riskiness and of course their terms of issuance). Therefore, it’s more important to understand the basic risk measures before moving over to the more complicated ones. Here I wouldn’t describe the simple measures like the duration, convexity,  interest rate / prepayment risk, instead I would strongly recommend that somebody with no Fixed Income background familiarize themselves with the basic concepts before moving on with the more complicated details that I’m going to talk about in this post.

Before even proceeding further with the idea of algorithmic trading in Fixed Income and their derivatives, as we could easily do with time-series analysis of the mean-reverting behavior of their relative spreads in FX/Stocks/Commodities derivatives, its more important to understand the not so simple spread measures of the Fixed Income instruments relative to their benchmark curve before we can even start doing the time-series analysis on them. The usual spread measure as we all know is the Z-spread or the Zero Volatility Spread which in general is a measure of the risk over the benchmark given that it lends itself to be a measure of the excess return over its risk-free counterpart. There are certain flaws about it, but we can ignore that as of now. To me, its simply a static parallel spread over the benchmark security with similar characteristics.

So what’s wrong with the Z-spread ? Like modified duration, it doesn’t account for the variability of the cash flows of the derivatives ! So how to measure the effective OAS (option adjusted spread) ?  Interest Rate Models.

A lot of Fixed Income fund managers do not understand this very important measure of credit spread because of several reasons. Some of which are:

1. Its not a straightforward calculation.
2. Its opinionated and can vary by at most a few hundred basis points between different models and the method of calculation. The major concern is that people rely too much on the accuracy of the OAS to 6 places of decimal. However, all that matters is to have a fair and reasonable estimate and the order of magnitude of this risk metric that the particular instrument is exposed to.

So let’s talk about these interest rate models and their applications in assessing Fixed Income and their Derivatives. In general I would describe them as probabilistic (need I say realistic) measures of the evolution of the interest rate upon which the future unknown cash flows of the instrument can be assigned a probability space. Once we have a probability measure of various occurrences, we can calculate the Expected Excess Spread over the benchmark which is nothing but the model implied OAS .  It’s not like saying whether the short term rates will absolutely go up/down and the rest of the yield curve would follow suit, but it’s more like they will go up by a certain probability say 0.7 and down by 0.3 (so 0.7 + 0.3 = 1.0). So, its not only important but necessary to have an unbiased view of the underlying benchmark to even think of constructing a model. Again, I would like to re-emphasize that a model doesn’t tell you the future, it only gives you the expected value. So needless to say that people who blame models in times of crisis are the same people who don’t quite understand it too well enough to use it, but unfortunately depend too much on it.

The model implied OAS can be calculated either using a lattice method or Monte Carlo simulation. First I will walk through the lattice approach. In further posts, I will talk more about simulations and its pros/cons. The two popular non-arbitrage stochastic short rate models that I’m going to start (and end) with are very popular industry standard models namely: BDT(Black Derman Toy) and BK (Black-Karasinsky) both of which are log normal. Once the lattice is constructed, it can be used to recursively value any kind of interest rate contingent claim such as embedded options, interest rate options, interest rate futures, vanilla / bermudan swaps, swaptions etc , some of which are similar to european/american/bermudan options but the others are more complicated than their equity/FX/commodity counterparts and cannot be evaluated using a closed form solution.

A very significant advantage of OAS is that this measure applies to a wide array of asset classes and so an entire portfolio’s aggregated OAS can be compared to any benchmark index’s OAS.

While I will do the model validation for both, I will also highlight that what in general are the reasons to choose one over the other in my case.

Black Derman Toy Model:

$d\ln(r) = [\theta_t + \frac{\sigma'_t}{\sigma_t}\ln(r)]dt + \sigma_t\, dW_t$

where,
$r\,$ = the instantaneous short rate at time t
$\theta_t\,$ = Expected drift in period t,t+ $\delta$
$\sigma_t\,$ = Short rate volatility or local volatility
$W_t\,$= a Standard Brownian motion under a risk-neutral probability measure;$dW_t\$,.

I won’t go into the details of explaining it since there are more deserving literature(s) for it.

Since I’m doing a simple one factor model with constant volatility, the $\sigma' _t$ = 0. Therefore, the discrete time representation of the above can be approximated as:

$r(t + \delta) = r(t) \exp^{\theta_t \delta + \sigma_t \sqrt \delta }$

where $\delta =$ 1 time step

Some important points to note here:

1. Time step can be chosen based on the frequency of the cash flows. More granularity in the discretization wouldn’t help much unless the rates interpolation is vital to valuation process.

2. With no prior information, I assume the probability of rates going up and down to be same as 0.5.

3. Solve for the drift $\theta_t$ at every step under the no-arbitrage and the re-combination condition and generate the rate lattice with the given term structure.

4. Use linear interpolation mechanism to get the instantaneous short rate (or forward rate for future t1 and t2)

So we are all set to calculate the OAS.

Term Structure: arrays of rates and maturities starting from 3 months upto 30 years.

Maturity —— Rates
1/12  ——  0.25/100.0
3/12  ——  0.29/100.0
6/12  ——  0.38/100.0
1.0     —— 0.55/100.0
2.0     —— 0.76/100.0
3.0     —— 0.91/100.0
5.0     ——  1.22/100.0
7.0     —— 1.51/100.0
10.0   —— 1.71/100.0
20.0    —— 2.14/100
30.0    ——  2.55/100.0

Volatility = 15% (flat volatility term structure)

Bond Characteristics:

Start with a simple bond: A 10 year bond with zero-coupon, semi-annual compounding with initial price of 61.027 (5% yield)

Balanced Tree: (the tree is meant to be interpreted as a triangle with say in the below example, 0.38 being the root, 0.644995 being the left node and 0.797415 being the right node. In the third row, 0.851222 is the re-combination node.

0.38
0.644995       0.797415
0.688518       0.851222      1.05238
0.760528       0.940249       1.16244       1.43714
0.711589       0.879746       1.08764       1.34466       1.66242
0.709424       0.877069       1.08433       1.34057       1.65736       2.04901
0.705541       0.872268       1.0784       1.33323       1.64829       2.0378       2.51935
………………………………………………………………………………………………………………………………………

Unfortunately, I can’t put the entire tree here, since it will be hard to understand. click the image below to enlarge for the highlighted full tree view.

And since the price at maturity is 100.0, we recursively discount the cash flows through backward induction and try to converge to the given price with the spread ‘x’ over the benchmark. When we converge successfully, the spread ‘x’ is the OAS.

Here it converges to the price : 61.027

Again, click on the image below and enlarge to see the price tree. Sorry again for the inconvenience but the limitations of WordPress.

Through backward induction from the redemption price, we calculate the OAS as 355.189 bps .

Its easily noticeable that the rates on the left side are continuously falling while on the right-most are continuously rising. This is not very realistic.

For Model validation purpose, let’s move forward and calculate the effective duration.

To calculate the effective duration, we need to bump the rate curve by say 10 basis points up and down and calculate the prices p1 and p2.

Effective Duration as we know is defined as $\frac {V_{-\Delta y}-V_{+\Delta y}}{2(V_0)\Delta y}$

Using Python / Matplotlib library , its easy to generate how the original/parallel bumped up / down yield curve looks like.

Note the red curve is the original curve and it’s bumped up (blue) and down (green) to get the price sensitivity w.r.t yield

Price with the yield curve bumped up by 30 bps: 60.5175

Price with the yield curve bumped down by 30 bps: 61.5297

Calculated Effective Duration: 8.29316 years.

Calculated Effective Convexity: 74.1095

Since the expected effective duration should be close to 10.0 years, it’s not quite right, which leads us to the point of trying out the Black Karasinsky Model which also has a mean reversion component to simulate more realistic movement of the short rate.

Black Karasinsky Model:

$d\ln(r) = [\theta_t-\phi_t \ln(r)] \, dt + \sigma_t\, dW_t$

Again,

$r\,$ = the instantaneous short rate at time t
$\theta_t\,$ = Expected drift in period t,t+ $\delta$
$\sigma_t\,$ = Short rate volatility or local volatility
$W_t\,$= a Standard Brownian motion under a risk-neutral probability measure;$dW_t\$
$\theta_t$= cap/floor on the rate evolution
$\phi_t$= speed of mean reversion

Again, the time discretized version for approximation would be:

$r(t + \delta) = r(t) \exp^{[\theta_t-\phi_t \ln(r)] \delta + \sigma_t \sqrt \delta }$

C++ code generated balanced Tree looks something like this.

0.38
0.438704 0.542375
0.506482 0.626169 0.77414
0.584735 0.722914 0.893746 1.10495
0.675082 0.834612 1.03184 1.27568 1.57713
0.709661 0.877362 1.08469 1.34102 1.65791 2.0497
0.706319 0.87323 1.07958 1.3347 1.65011 2.04004 2.52213
0.684033 0.845678 1.04552 1.29259 1.59804 1.97568 2.44255 3.01975
0.654734 0.809455 1.00074 1.23722 1.52959 1.89105 2.33793 2.89041 3.57344
0.620599 0.767254 0.948564 1.17272 1.44985 1.79246 2.21604 2.73972 3.38714 4.18756
…………………………

Click on the image below to view the entire tree:

And the corresponding Price Tree and fitted OAS:

The OAS calculated here is: 359.135 bps as opposed to 355.189 bps (calculated through BDT model) previously. Since this OAS causes more sensitivity, the effective duration would be slightly higher. As expected, it comes out to be: 10.2225 (~ 10.0) years which is reasonably acceptable for a zero coupon bond maturing in 10 years.

So far, so good. Lets make this bond say a 5% coupon bond, semi-annual frequency, with a price of par (yield = 5%) and do the calculation similarly. You would expect the OAS to be fairly close but the effective duration falling. Let’s check it out.

The OAS as expected comes out to be: 366.011 bps which is somewhat close to 359.135 bps (for zero coupon) because they have the same yields (5.0%) but periodic cash flows which can be re-invested. The effective duration comes out to be: 8.47687 years (< 10.0 years), since again, due to its periodic cash flows, it seems to be relatively less riskier than its zero counterpart.

Finally, let’s assess the OAS values for bullet VS callable bond.

Let’s say we have a 5% coupon, 10 year bond, priced at 115.0,

and we get the following measures:
OAS: 185.082
Effective Duration: 8.77167

which is reasonably fair.

What if it’s continuously callable at par after 5 years ? Because of this, it will be slightly cheaper. Let’s say it is: 113.5. What happens to the OAS and the effective duration now ?

OAS: 110.052
Effective Duration: 4.56597

Look at the highlighted region in the top right. It captures the low rates region of the model implied interest rate evolution. From American option point of view, its the efficient exercise region and for a Mortgage Backed Security (we will get into this later), its the prepayment region. The price on these nodes were greater than the call price (par or 100.0), so I set them explicitly to 100.0. This will reduce the expected excess spread due to the optionality in comparison to the Bullet Bond.

As can be seen, though the callable bond appears to be a good deal as its cheaper (than its bullet counterpart), the expected excess spread is far lesser by almost 75bps . It does get compensated to some extent due to less (almost half) duration risk but again, the reinvestment rate would be much lesser to mark a huge loss for a total duration of 8.77167 years (bullet bond’s duration).

The significance of effective duration is that it probabilistically captures the optionality of the embedded option as the price goes into the efficient exercise region and slowly shifts the effective duration, which can reduce hedging costs in the long run.

Another point to not miss is that as we move the price up, more and more nodes in the lattice will come in the efficient exercise region of the option. The OAS will further reduce and so will the effective duration.

While these calculations make some intuitive sense, there are various shortcomings to these short-rate models. We assumed a flat volatility structure which is not always the case. The local volatility might as well be stochastic and to capture that effect, we need a 2-factor model. Also, this model only captures the short (term) rate evolution but ignores the rest of the yield curve. There can certainly be more factors which can affect the term structure.

So, its clearly evident that a good grasp of these measures is necessary to apply algorithmic tradingstrategies on them. In further posts, we will focus on a few more similar measures and the strategies around them.

Hope you enjoyed the post ! Appreciations are welcome and criticisms are even more welcome !!!

DISCLAIMER:

1. Opinions expressed are solely my own and do not express the views or opinions any of my employers.

2. The information from the Site is based on financial models, and trading signals are generated mathematically. All of the calculations, signals, timing systems, and forecasts are the result of backtesting, and are therefore merely hypothetical. Trading signals or forecasts used to produce our results were derived from equations which were developed through hypothetical reasoning based on a variety of factors. Theoretical buy and sell methods were tested against the past to prove the profitability of those methods in the past. Performance generated through back testing has many and possibly serious limitations. We do not claim that the historical performance, signals or forecasts will be indicative of future results. There will be substantial and possibly extreme differences between historical performance and future performance. Past performance is no guarantee of future performance. There is no guarantee that out-of-sample performance will match that of prior in-sample performance. The website does not claim or warrant that its timing systems, signals, forecasts, opinions or analyses are consistent, logical or free from hindsight or other bias or that the data used to generate signals in the backtests was available to investors on the dates for which theoretical signals were generated.

References:

https://en.wikipedia.org/wiki/Black%E2%80%93Derman%E2%80%93Toy_model

https://en.wikipedia.org/wiki/Black%E2%80%93Karasinski_model

http://www.kalotay.com/sites/default/files/private/FAJ93.pdf

http://janroman.dhis.org/stud/AF1%20OAS.pdf

https://en.wikipedia.org/wiki/Root-finding_algorithm

https://en.wikipedia.org/wiki/Bond_duration

https://en.wikipedia.org/wiki/Bisection_method

# Time Series Analysis – Performance Benchmarks (I feel the need ! The need for Speed !!!)

Regarding my background, I have worked as a High Frequency Market Maker on the NSE (National Stock Exchange of India), MCX (multi-commodity exchange) and the CME (Chicago Mercentile Exchange) mainly in the Currency/Commodities F&O and thereby built a lot of strategies and Algorithms around it. I will also outline my experiences and the ways that I used to hack around it.
Feel free to connect with me on Linkedin: My Linkedin Profile . For further queries you can reach me at info@alphaticks.com .

In this post, we will compare some of the popular cutting-edge tools to see which can evaluate time series metrics faster cause as I said in the title:            “I feel the need ! The need for Speed !!!” . Whilst most of the HFT boxes are written in C++ into co-located servers (Please read my earlier post: Low Latency Systems [Decoded!]), which is usually the order routing part, the signal generation part is equally important which is the basis of these using these cutting-edge tools.

DATA : The data used here is a tick based time series data for EURUSD (currency futures) on a given date with 546072 time points. A quick snapshot of the file is as follows:

06/30/2013,16:00:36.831,1.30037,1.30074,SAXO,,CPH
06/30/2013,16:00:36.962,1.30019,1.30091,FXN,NAM,NYC
06/30/2013,16:00:36.962,1.3003,1.30082,FXN,,
06/30/2013,16:00:37.278,1.3003,1.30123,FXN,,
06/30/2013,16:00:39.152,1.3002,1.301,COMM,ASI,HKG
06/30/2013,16:00:43.155,1.301,1.3015,DABA,EUR,CPH

Platform/Architecture: Linux localhost.localdomain 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64  GNU/Linux (In Nutshell, Linux 64-bit, distribution: Redhat)

How are we going to do it ? We will do a simple test with basic computations like Mean, Volatility of the Bid-Ask spread for this asset class and record the (real)time taken to do it.

The execution time(s) will be calculated with the popular “time” command which is available on most of the flavors of Unix.

Real, User and Sys process time statistics:

Real is wall clock time – time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).

User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.

Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like ‘user’, this is only CPU time used by the process. See below for a brief description of kernel mode (also known as ‘supervisor’ mode) and the system call mechanism.

Real time is the amount of time, which one actually has to wait. Since we are considering the performance of the data structures and algorithms used by these languages, therefore, we would like to assess how much time they spend doing I/O for example. So I will consider the Real time to make judgements !

So, let’s get started ! The code snippets with their results and times of execution are given below with a screenshot of my Linux Terminal at the end of it.

1) PYTHON:

For this one, I used the Cython along with NumExpr (fast Vectorization Python Library) along with Pandas.

Cython code:

import pandas as pd
import numpy as np
cimport numpy as np
cimport cython
import numexpr as ne

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double f():
s = a.size
m = ne.evaluate(“(sum(a -b))/s”)
k = ne.evaluate(“(sqrt(sum((a – b – m) * (a -b – m))))/s”)
print m, k

f() #Call the Cythonized func
~

To compile this as a cython module:  python setup.py build_ext –inplace

This builds a libctest2.so as a python module which can be loaded using the import directive in the python shell as shown below !

[root@localhost cython]# time python <<< “import ctest2”
0.000195290209515 0.000137504095628

real    0m0.868s
user    0m0.787s
sys    0m0.084s
[root@localhost cython]#

Waiting Time: 0.868s (Hell of a Good Start !)

2) MYSQL: (MyISAM Engine)

+————————+————————+
+————————+————————+
| 0.00019528972468442314 | 0.00013750407297097067 |
+————————+————————+

real    0m0.701s
user    0m0.007s
sys    0m0.012s
[root@localhost l32]#

Waiting Time: 0.701s

3) R:

[root@localhost l32]# cat test2.R
library(data.table)
file = ‘EURUSD.csv’ # enable this line to read from file

stringsAsFactors=FALSE, colClasses=c(“Date”,”character”,”numeric”,”numeric”,”character”,”character”,”Character”) )

# Optionally parse date and time

# compute mean and sd
c(mean(trade$Ask – trade$Bid),sd(trade$Ask – trade$Bid)) # data.frame style
[root@localhost l32]#
[root@localhost l32]#
[root@localhost l32]# time Rscript test2.R
[1] 0.0001952902 0.0001375041

real    0m1.645s
user    0m1.580s
sys    0m0.064s
[root@localhost l32]#

Waiting Time: 1.645s

kdb+ : (Denied permission to share the benchmarks !)

The code:

[root@localhost l32]# time ./q <<< ‘trade:(“DSFFSSS”; enlist “,”) 0: EURUSD.csv; select avg (Ask – Bid), dev (Ask – Bid) from trade’
Mean          Std
————————-
0.0001952902 0.0001375041

So, here’re the ranks with decreasing order of performance (increasing order of time taken):
1. mysql => 0.701s
2. Python => 0.868s
3. R => 1.645s

Of course, I would like to mention one important point here, that the performance should also take into account the memory usage, which is definitely higher in some cases.

I am also assuming that since memory is cheaper nowadays (compared to earlier times), hence a lot of financial enterprises aren’t short of using huge amount of RAMs for their tick data infrastructures !

Why did I choose mean and standard deviation functions? Firstly they are simple and built-in functions compared to more complicated statistical metrics that I have done in my previous posts. Secondly, these functions force the tools to do the calculations from ground-up instead of blurting out pre-cached  results! Some of you may argue that I could have done more justice to how I create the data structures (tables) in mysql where I could have done indexing to make things faster, but remember I’m not doing any search/query functionality here. The algorithmic complexity of both mean and std is O(n)!

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” – Donald Knuth

Hope you have enjoyed the post ! Comments are welcome, have a nice day folks !

# Market Making Hacks! Part – 2

Regarding my background, I have worked as a High Frequency Market Maker on the NSE (National Stock Exchange of India), MCX (multi-commodity exchange) and the CME (Chicago Mercentile Exchange) mainly in the Currency and Commodities Futures and Options and thereby built a lot of strategies and Algorithms around it. I will also outline my experiences and the ways that I used to hack around it.

In the previous post, we looked at the basics Market Making strategy, but do you think it will suffice ? Just how much information can one instrument give? In the much earlier posts, we had stressed a lot on the relative pricing between two or more instruments and we would like to extend that methodology to our Market Making strategy to do optimal hedging of the exposed positions.

We have already seen that how a pair of stocks bear strong co-integration properties i.e. they have a common stochastic drift and due to this they have a long run equilibrium between their prices. We already did some tests to check the stationarity of the Pairs and found some interesting results. In the case of Mean reversion, we saw that the long run equilibrium can be affected by short run deviations and that’s exactly what we want to use in our Market Making strategy by quoting simultaneously on two or more instruments.

Since, I have limited High Frequency Data, I will be using two currency pairs with obvious correlations: EURUSD and EURAUD to see how it works out. Let’s start by plotting their Bid-Ask Time Series from their one day’s data:

A very important point to note here that previously (in our cointegration analysis of time series), we were considering daily closing prices from Yahoo Finance but here we were wrangling with High Frequency Data which is on a tick-to-tick basis. Know what I mean ? You will soon find out .

For EURUSD:
 Date Time Bid Ask X Y Z 0 07/01/2013 00:00:00.164 1.30229 1.30238 FXDC NAM NYC 1 07/01/2013 00:00:00.190 1.30229 1.30237 TDF NaN LAX 2 07/01/2013 00:00:00.240 1.30230 1.30240 DCFX ASI AKL 3 07/01/2013 00:00:00.252 1.30200 1.30280 COMM NaN HKG 4 07/01/2013 00:00:00.351 1.30227 1.30237 GOSP NaN SGP 5 07/01/2013 00:00:00.456 1.30227 1.30242 GACI NAM NYC 6 07/01/2013 00:00:00.540 1.30225 1.30244 GFT NaN MIC 7 07/01/2013 00:00:00.577 1.30233 1.30237 RTG EUR GVA 8 07/01/2013 00:00:01.014 1.30229 1.30244 GACI NAM NYC 

For EURAUD:
 Date Time Bid_ Ask_ X_ Y_ Z_ 0 07/01/2013 00:00:00.062 1.41865 1.41945 SAXO EUR CPH 1 07/01/2013 00:00:00.094 1.41895 1.41912 GOSP ASI SGP 2 07/01/2013 00:00:00.158 1.41857 1.41957 OTCX GLO NYC 3 07/01/2013 00:00:00.205 1.41895 1.41915 TDF NAM LAX 4 07/01/2013 00:00:00.220 1.41864 1.41944 SAXO EUR CPH 5 07/01/2013 00:00:00.225 1.41895 1.41915 TDF NAM LAX 6 07/01/2013 00:00:00.235 1.41896 1.41918 FXDC NaN NYC 7 07/01/2013 00:00:00.235 1.41900 1.41917 FXN NaN NaN 8 07/01/2013 00:00:00.387 1.41863 1.41943 SAXO EUR CPH 

Notice, the time stamps, they are not same ! Why ? Cause these instruments receive ticks at different time stamps. Thats what is more complicated about analysing the HFT data. So we need to somehow adjust these two time series before doing further analysis.

Let’s plot a part of their tick Time Series (because the data is simply huge):

So we see that part of their plot shows negative correlation which is also favourable to our strategy. We have to merge these two time series together so that they have the same time stamps. Pandas provides us with powerful data manipulation functions. (I’m sure you can do it in R too, but I haven’t tried this functionality in R. Will do that later and update the post for our R lovers 😉 ) All right, so we have a pandas.merge() function that can help us to do an outer join on the ‘Time’ column. And then, we gotta pad all the NAs with the previous values. Lets try that out. What do we get ?

 Date_x Time Bid Ask Date_y Bid_ Ask_ 0 NaN 00:00:00.062 NaN NaN 07/01/2013 1.41865 1.41945 1 NaN 00:00:00.094 NaN NaN 07/01/2013 1.41895 1.41912 2 NaN 00:00:00.158 NaN NaN 07/01/2013 1.41857 1.41957 3 07/01/2013 00:00:00.164 1.30229 1.30238 07/01/2013 1.41857 1.41957 4 07/01/2013 00:00:00.190 1.30229 1.30237 07/01/2013 1.41857 1.41957 5 07/01/2013 00:00:00.205 1.30229 1.30237 07/01/2013 1.41895 1.41915 6 07/01/2013 00:00:00.220 1.30229 1.30237 07/01/2013 1.41864 1.41944 

But we still got some NAs on the top. Lets get rid of them by using the function dropna() and finally,

 Date_x Time Bid Ask Date_y Bid_ Ask_ 3 07/01/2013 00:00:00.164 1.30229 1.30238 07/01/2013 1.41857 1.41957 4 07/01/2013 00:00:00.190 1.30229 1.30237 07/01/2013 1.41857 1.41957 5 07/01/2013 00:00:00.205 1.30229 1.30237 07/01/2013 1.41895 1.41915 6 07/01/2013 00:00:00.220 1.30229 1.30237 07/01/2013 1.41864 1.41944 7 07/01/2013 00:00:00.225 1.30229 1.30237 07/01/2013 1.41895 1.41915 8 07/01/2013 00:00:00.235 1.30229 1.30237 07/01/2013 1.41896 1.41918 9 07/01/2013 00:00:00.235 1.30229 1.30237 07/01/2013 1.41900 1.41917 10 07/01/2013 00:00:00.240 1.30230 1.30240 07/01/2013 1.41900 1.41917 11 07/01/2013 00:00:00.252 1.30200 1.30280 07/01/2013 1.41900 1.41917 12 07/01/2013 00:00:00.351 1.30227 1.30237 07/01/2013 1.41900 1.41917 13 07/01/2013 00:00:00.387 1.30227 1.30237 07/01/2013 1.41863 1.41943 14 07/01/2013 00:00:00.456 1.30227 1.30242 07/01/2013 1.41863 1.41943 15 07/01/2013 00:00:00.540 1.30225 1.30244 07/01/2013 1.41863 1.41943 16 07/01/2013 00:00:00.577 1.30233 1.30237 07/01/2013 1.41863 1.41943 17 07/01/2013 00:00:01.014 1.30229 1.30244 07/01/2013 1.41863 1.41943 

Looks good enough ? So, now we have two merged time series that we can do our analysis upon and make our strategy.
The first step is to run our regression on the Mid of the two series and run ADF test on the residuals:

The regression tests gives us:
intercept: 1.2955
coefficient: 0.0041

(-9.5166452644860904, 3.1441600258232791e-16, 1, 153700, {‘5%’: -2.8615588049938121, ‘1%’: -3.4303925465791547, ‘10%’: -2.5667800092275592}, -401164.72508536989). Cool !

Let’s plot the histogram of their residuals:

So, we know the Mean now i.e. whats the average deviation of EURUSD and EURAUD from eachother. For framing our strategy, we will assume that no statistical arbitrage exists between the two instruments.

Lets find what’s the usual spread between their “Mids” (Bid/2 + Offer/2) and get their mean and standard deviations which comes out to be:
Mean: 0.122787585718
Volatility: 0.0013481503038

The Mean will help us decide the benchmark and the standard deviation will decide our position taking. So lets say when the
| mid(EURAUD) – mid(EURUSD) | >= 0.1227 + 2*0.0013 (i.e. deviates in either directions by atleast 2 standard deviations), then we should be long one and short another and vice versa (depends on the direction of deviation).

So, the Pseudo code will be like this:

 if (mid(EURAUD) - mid(EURUSD)) > 0.1227 + 2*0.0013: goShort(EURAUD) goLong(EURUSD) else if (mid(EURAUD) - mid(EURUSD)) < -1 * ( 0.1227 + 2*0.0013): goShort(EURUSD) goLong(EURAUD) else if stop_loss(): 

And here’s the running Python Code. I did run a backtest for our illustrative purpose.
 print "Running the Strategy\n" for index, row in mergedSeries.iterrows(): if (row['midSprd'] > (0.1227 + 2*0.0013)): #if its greater than 2 stddev audpos = audpos - 1 # Go short on EURAUD usdpos = usdpos + 1 # Go long on EURUSD pnl = pnl + row['Bid_'] - row['Ask'] #Total Cost elif (row['midSprd'] < -1*(0.1227 + 2*0.0013)): #if its lesser than 2 stddev audpos = audpos + 1 # Go long EURAUD usdpos = usdpos - 1 # Go short EURUSD pnl = pnl + row['Bid'] + row['Ask_'] #Total Cost #Finally Close all open Positions at Market Price if usdpos > 0:usdval = usdpos * row['Bid'] elif usdpos < 0: usdval = -1 * (usdpos * row['Ask']) if audpos > 0: audval = audpos * row['Bid_'] elif audpos < 0: audval = -1 * (audpos * row['Ask_']) pnl = pnl + usdval + audval print "total PNL: ", pnl 

Running the Strategy
total PNL: 5860.06107

So we seem to be making a profit of 5860.06107 apparently. Thats not Bad !!! But no need to feel very excited about it because there were a lot of assumptions made in this example:

1. We always hit market orders (assuming there’s no slippage)
2. We kept accumulating positions and we closed them in the end (so no risk management)
3. We again closed all the positions at the end without having any slippage.
4. We did not consider the Order book (cause we did not have any!)
5. We didn’t care about how wide or narrow the spread was !
6. This was only a single day’s data so might not always work.
7. Plus there are many caveats that I will leave it for you to figure out.

But this was just an example of what were trying to achieve through the time series analysis of the tick-by-tick (HFT) data building upon our previous analysis. When we do the real Market Making, we would be needing the order book depth data to do a lot more and also would be a better option to backtest it on a simulator from direct Market Feeds to get better results. I know I have missed out a lot of things about this post, but the queries are always welcome from the curious readers.

Hope you have enjoyed the post. Comments are Most Welcome ! Have a Nice Day !!!

Alphaticks is a Non-Profit blog. If you enjoyed what you read, please support by donating something: [wp_paypal_payment]

# Portfolio Optimization with Mean Reverting Spreads Using Python

Regarding my background, I have worked as a High Frequency Market Maker on the NSE (National Stock Exchange of India), MCX (multi-commodity exchange) and the CME (Chicago Mercentile Exchange) mainly in the Currency/Commodities F&O and thereby built a lot of strategies and Algorithms around it. I will also outline my experiences and the ways that I used to hack around it. Feel free to connect with me on Linkedin: My Linkedin Profile . For further queries you can reach me at info@alphaticks.com .

To those of you who’ve been following up with the last post, its time to move beyond just a pair of stocks. To those who haven’t, I’d suggest you take a look at the previous post before proceeding if you are not familiar with Python/Pandas and cointegration analysis of the Time Series.

Moving on, in the last post, we did time series cointegration analysis on a pair of stocks and figured out that both had similar stochastic drift, i.e. the combination existed as a stationary time series (even if they don’t individually). To build upon that, we would be further exploring how can we construct a stationary portfolio out of the listed symbols on the NASDAQ and how to trade as and when the stationarity of the portfolio is violated so as to always achieve a market neutral condition (i.e. a zero-value portfolio).

To start with, we need all the symbols from NASDAQ to run our tests to figure out which ones we can consider to construct our portfolio. These are available on NASDAQ’s FTP server. Here are the links:

. (You can login as anonymous as they have a read-only permission for others). The format of the file is like this: AAL|American Airlines Group, Inc. – Common Stock|Q|N|N|100 . Note: We only need the “AAL” part from this to query the Yahoo Finance using our Python Script as we were doing before. So how to grab the first column delimited with a “|” (pipe). A quick perl script with power crunching Regular Expressions will do this trick for us. Like this:

At my Linux Terminal
 [root@localhost ticker]# echo "AAME|Atlantic American Corporation - Common Stock|G|N|N|100" | perl -wlp -e 's/(.*?)\|.*$/$1/' # => AAME 
(Piece of Cake 😉 )

So, lets assemble all the symbols in another flat file, one in every line. We have a total of 8138 symbols (out of which Yahoo Finance provides correct data for only 5290 symbols. I don’t know why) and all we want to do is find the stationary pairs amongst them to construct an altogether a stationary portfolio. Let’s categorize them in two ways. List of Pairs where the p-value (from ADF test) < 0.05 (strong stationarity. Know what I mean ? ) and where the 0.05 >= p-value > 0.10 (weak stationarity) cause we will have different strategies for taking positions in either of these cases. Lets first generate the list of pairs which is fairly easy using the python package itertools like this:

 list(itertools.combinations(symList, 2)) and we get a list of pairs => ('ADVS', 'MBWM'), ('ADVS', 'MCBC'), ('ADVS', 'MCBK'), ('ADVS', 'MCEP') `
…. (Cool? eh 😉 )

Proceeding ahead, lets grab their respective Time Series from Yahoo Finance and run our previous tests to see what we got. Note: We will be using 4 years daily closing prices (10-14) with approx 1006 Data Points. Running our python script for the tests, we obtain both less (<0.05) and more(<0.10), which are as follows. The symbols can be looked up in the Nasdaq files.

Some of the less Pairs (<0.05) reported:
Pairs: | P-Value
1. GAME|Shanda Games Limited Shares AND ACOR|Acorda Therapeutics, Inc | 0.000612
2. OIL|Barclays Bank Plc iPath Exchange Traded Notes Linked to Goldman Sachs Crude Oil Total Return AND ABCO|The Advisory Board Company | 0.00992
3. ABCB|Ameris Bancorp && DNB|Dun & Bradstreet Corporation | 0.00080

Let’s plot these Pairs Residuals to see how their Time Series look like:

So we do notice that in each of these category, the residuals explode at some point but they have a strong tendency to revert back towards their mean which is essentially what we are looking for.

Let’s come back to our strong stationary Pairs and plot their respective histograms to see how their distributions look like:

What do you notice ? That these all look very close to normal distribution, which verifies our tests.

If X and Y are independent random variables that are normally distributed (and therefore also jointly so), then their sum is also normally distributed. i.e., if

X ~ N(mu1, sigma1)
Y ~ N(mu2, sigma2)
Z=X+Y,
then

Z ~ N(mu1 + mu2, sigma1^2 + sigma2^2).

This means that the sum of two independent normally distributed random variables is normal, with its mean being the sum of the two means, and its variance being the sum of the two variances (i.e., the square of the standard deviation is the sum of the squares of the standard deviations).

This also satisfies that the entire residual (sum of all the residuals of the pairs) is stationary as well. Remember, that we did a Linear Regression of these pairs before getting the residuals, which gives us a clear relationship between the pairs in terms of coefficients. For example: x * StockA = y * stockB. Lets coin this terms as “Delta”. Another way of saying this can be Stock A = y/x Deltas of Stock B, or vice versa. We are doing this primarily to figure out how many deltas of deviations we get when the stationarity of the Portfolio is violated. An important thing to note is that these Deltas are relative not absolute (like the Greeks in the Black-Scholes Model). So how do we hedge ? The simplified answer is, with the pairs in which all the stocks are strongly cointegrated with eachother. For example, if you have pairs: (A, B), (C, D) and (X, Y), then ‘A’ should bear cointegration with [B, C, D, X, Y] and so on, so when a changes, it can be optimally hedged with either of them or a group of them which we intend to optimize (Although, I have not taken care of such pairs in my examples above, if you are a portfolio manager then you should !).

So how are we gonna find the optimum combination to Hedge our portfolio? The Answer: Linear Programming.
Again, Python comes to our rescue with this package called : A Linear Programming modeler

Using this, one can find the optimal hedging condition creating a multi-legged scenario where one can manage risk, control the exposed position and reduce cost/maximize Profitablity. We will explore more on this in our next post. So stay tuned !

Hope you enjoyed this post. Comments are welcome. Have a Nice Day!