IEX Order Book - Reading the Data

The IEX Order Book - Reading the Data

I recently gave a talk to the Python Frederick Meetup about Pandas and wanted to show some financial data since that’s what I’ve been working with lately. Specifically, I wanted something that had real time quotes and trades, but at best most sources give you the high, low, and median price aggregated over something like an hour or a day. There are a lot of sources of market data, and they have most definitely put a price on that data, a high price.

NYSE TAQ is about \$1000/month for the first 12 months and $500/month to go farther back in time. NASDAQ TotalView and similar products have a pretty complex fee structure, so I’m not entirely sure how much it costs. You can at least see the real time data on your RobinHood (Gold account) or Charles Schwab account, but historical data’s going to cost you.

Which brings us to the Investor’s Exchange, or IEX. Featured in the Michael Lewis book Flash Boys, they do their best to offer access to the market with protections in place that favor long-term over short-term strategies, keeping in mind ‘long term’ might still mean only holding a stock for a millisecond. They’re a pretty small compared to NYSE or NASDAQ, but they allow you to download pretty detailed historical data for free…as a dump of network packets.

What is an order book?

Why on earth do I want to see the order book in the first place? What is it? Every time a broker wants to make a trade, they put an an order (I want to buy 200 shares of stock A for $100 each). Until the order is canceled, or a seller is found that is willing to meet the buyer at their price, the order ‘rests on the book.’ Most heavily traded stocks do not have orders that rest long, unless the number of shares is exceptionally large or the price point is completely unreasonable. Here’s an extremely simplified view of what an order book might look like.

time	side	quantity	price
9:30	B	100	45.00
9:31	B	200	45.01
9:31	S	100	45.05
9:32	B	50	45.00

What do you get from the IEX feed?

In the DEEP data product, you get a number of message types, notably, when there is a trade, and when there is a price level update. This isn’t quite the order book, in that there could be many orders combined together to create one price level update, but you do see how many shares have a bid (I will buy your stock) or offer (I want to sell you my stock) out at each dollar amount, which is pretty darn good for free. Important to note again, IEX is only a few percent of the total market volume, so you’re not seeing the whole market, but you get a neat view that you would typically have to pay a lot to see. Here, you only pay with parsing pain.

!xxd -l 112 -s 20000 ~/Downloads/data_feeds_20210122_20210122_IEXTP1_DEEP1.0.pcap

00004e20: 7cb9 0500 ed64 2f58 e805 0000 e805 0000  |....d/X........
00004e30: 0100 5e57 1504 b859 9ff9 2d53 0800 4500  ..^W...Y..-S..E.
00004e40: 05da e340 4000 4011 9f90 17e2 9b84 e9d7  ...@@.@.........
00004e50: 1504 288a 288a 05c6 0402 0100 0480 0100  ..(.(...........
00004e60: 0000 0000 d948 9605 4200 fc42 0000 0000  .....H..B..B....
00004e70: 0000 1903 0000 0000 0000 f76d ff74 b88d  ...........m.t..
00004e80: 5c16 1600 4854 c51f ff74 b88d 5c16 4149  \...HT...t..\.AI

And that’s what you get. A lot of hex digits and a specifications document. Fortunately, there is a super cool person on the internet who made a parser that can get all this data into a json format based on the specifications document.

#A lovely command line conversion of the binary to a json file
!pip install iex_parser
!iex_to_json -i iex_file.pcap.gz -o iex_file.json.gz -t 'GME' -s

import json
with open('iex_deep_quotes_and_trades.json') as file:
    line = file.readline()
    print(line[:500])

[{"type":"trade_report","event":null,"timestamp":"2021-01-22T13:02:31.215300+00:00","status":null,"symbol":"GME","detail":null,"halt_status":null,"reason":null,"flags":96.0,"size":39.0,"price":45.19,"trade_id":2067095.0,"side":null,"security_event":null},{"type":"trade_report","event":null,"timestamp":"2021-01-22T13:08:14.700160+00:00","status":null,"symbol":"GME","detail":null,"halt_status":null,"reason":null,"flags":96.0,"size":50.0,"price":44.87,"trade_id":2639914.0,"side":null,"security_even

It’s still hard to understand without reading the documentation and a little knowledge of how order books work, but it’s a heck of a lot more readable, and easy to turn into a dataframe.

#Read in json file, many json files that have a schema without 
#a lot of nesting and variation can be read safely as records
from pandas.io.json import read_json

json_df = read_json('iex_deep_quotes_and_trades.json',orient='records')
display(json_df)

	type	event	timestamp	...	trade_id	side	security_event
0	trade_report	NaN	2021-01-22 13:02:31.215300+00:00	...	2.067095e+06	None	NaN
1	trade_report	NaN	2021-01-22 13:08:14.700160+00:00	...	2.639914e+06	None	NaN
2	trade_report	NaN	2021-01-22 13:11:52.294756+00:00	...	3.063945e+06	None	NaN
3	trade_report	NaN	2021-01-22 13:18:22.383301+00:00	...	3.669247e+06	None	NaN
4	trade_report	NaN	2021-01-22 13:19:09.002873+00:00	...	3.739332e+06	None	NaN
...	...	...	...	...	...	...	...
490763	price_level_update	NaN	2021-01-28 21:59:06.793899+00:00	...	NaN	S	NaN
490764	price_level_update	NaN	2021-01-28 21:59:06.797557+00:00	...	NaN	S	NaN
490765	trade_report	NaN	2021-01-28 21:59:08.241677+00:00	...	3.027362e+09	None	NaN
490766	price_level_update	NaN	2021-01-28 21:59:08.241677+00:00	...	NaN	S	NaN
490767	price_level_update	NaN	2021-01-28 22:00:00.018339+00:00	...	NaN	S	NaN

490768 rows × 14 columns

#see what the na situation is
json_df.isna().sum()

type                   0
event             490768
timestamp              0
status            490768
symbol                 0
                   ...  
size                   0
price                  0
trade_id          356056
side              134712
security_event    490768
Length: 14, dtype: int64

#get rid of columns that are entirely null
json_df = json_df.dropna(axis = 1,how='all')
display(json_df)

	type	timestamp	symbol	...	price	trade_id	side
0	trade_report	2021-01-22 13:02:31.215300+00:00	GME	...	45.19	2.067095e+06	None
1	trade_report	2021-01-22 13:08:14.700160+00:00	GME	...	44.87	2.639914e+06	None
2	trade_report	2021-01-22 13:11:52.294756+00:00	GME	...	44.58	3.063945e+06	None
3	trade_report	2021-01-22 13:18:22.383301+00:00	GME	...	44.04	3.669247e+06	None
4	trade_report	2021-01-22 13:19:09.002873+00:00	GME	...	43.78	3.739332e+06	None
...	...	...	...	...	...	...	...
490763	price_level_update	2021-01-28 21:59:06.793899+00:00	GME	...	262.00	NaN	S
490764	price_level_update	2021-01-28 21:59:06.797557+00:00	GME	...	262.00	NaN	S
490765	trade_report	2021-01-28 21:59:08.241677+00:00	GME	...	262.00	3.027362e+09	None
490766	price_level_update	2021-01-28 21:59:08.241677+00:00	GME	...	262.00	NaN	S
490767	price_level_update	2021-01-28 22:00:00.018339+00:00	GME	...	988.00	NaN	S

490768 rows × 8 columns

#What are we left with?
json_df.isna().sum()

type              0
timestamp         0
symbol            0
flags             0
size              0
price             0
trade_id     356056
side         134712
dtype: int64

#What data types are we working with?
json_df.dtypes

type                      object
timestamp    datetime64[ns, UTC]
symbol                    object
flags                      int64
size                       int64
price                    float64
trade_id                 float64
side                      object
dtype: object

#The objects really should be strings
json_df = json_df.astype({'type':'string','symbol':'string','side':'string'})
json_df.dtypes

type                      string
timestamp    datetime64[ns, UTC]
symbol                    string
flags                      int64
size                       int64
price                    float64
trade_id                 float64
side                      string
dtype: object

#Fill in nulls on the side, since that may cause trouble plotting trades
#create a date column for filtering purposes, and change from UTC to EST
from datetime import timezone
json_df = json_df.fillna({'side':'X'})#replace nulls

json_df['date'] = (json_df
                   .apply({'timestamp':lambda x: x.date}))

json_df['timestamp'] = (json_df['timestamp']
                        .apply(lambda x: x.astimezone(tz='EST')
                               .replace(tzinfo=None)))#change to local time

display(json_df)

	type	timestamp	symbol	...	trade_id	side	date
0	trade_report	2021-01-22 08:02:31.215300	GME	...	2.067095e+06	X	2021-01-22
1	trade_report	2021-01-22 08:08:14.700160	GME	...	2.639914e+06	X	2021-01-22
2	trade_report	2021-01-22 08:11:52.294756	GME	...	3.063945e+06	X	2021-01-22
3	trade_report	2021-01-22 08:18:22.383301	GME	...	3.669247e+06	X	2021-01-22
4	trade_report	2021-01-22 08:19:09.002873	GME	...	3.739332e+06	X	2021-01-22
...	...	...	...	...	...	...	...
490763	price_level_update	2021-01-28 16:59:06.793899	GME	...	NaN	S	2021-01-28
490764	price_level_update	2021-01-28 16:59:06.797557	GME	...	NaN	S	2021-01-28
490765	trade_report	2021-01-28 16:59:08.241677	GME	...	3.027362e+09	X	2021-01-28
490766	price_level_update	2021-01-28 16:59:08.241677	GME	...	NaN	S	2021-01-28
490767	price_level_update	2021-01-28 17:00:00.018339	GME	...	NaN	S	2021-01-28

490768 rows × 9 columns

#Create subset that is just the order book updates
mask = json_df['type']=='price_level_update'
select_cols = ['timestamp','size','price','side']
sort_cols = ['timestamp','price']
order_df = json_df.loc[mask,select_cols].sort_values(sort_cols)

I’m not sure if this is the most effective way to accomplish the task, but I felt like making a class to do the data manipulation necessary to carry the information forward from one row to the next. The goal is for each timestep to specify the correct number of shares at each price level at each point in time. This basically accumulates all the different price levels forward in time. If there were only a few price levels, breaking them out into their own columns and doing a cumulative sum would be the most effective. However:

print('There are ',len(order_df.price.unique()),' price levels.')

There are  11848  price levels.

class PriceLevels(dict):
    def ignore_item(self,item):
        return self
    
    def add_or_discard(self,size,price,side,quote_side):
        if (size > 0)&(side==quote_side):
            self.update({price:size})
        elif (size == 0)&(side==quote_side):
            self.pop(price)
        else:
            self.ignore_item
        return self
    
    def get_bbo(self,side):
        if (side == 'B')&(len(self)>0):
            return max(self.keys())
        elif (side == 'S')&(len(self)>0):
            return min(self.keys())
        else:
            return None
    
    def get_vwap(self):
        if len(self)==0:
            return None
        volume = self
        return sum([k*v for k,v in self.items()])/sum([v for v in self.values()])
            
    def update_prices(self,size,price,side,quote_side):
        self.add_or_discard(size,price,side,quote_side)
        return PriceLevels(self.copy())

#Some examples of how the object works
a = PriceLevels({45.45:100,50:100,55:50})
print('set price levels: ',a)
a.update_prices(100,46.05,'B','B')
print('update price level same side: ',a)
a.update_prices(100,46.10,'S','B')
print('update price level opposite side: ',a)
a.update_prices(0,50,'B','B')
print('remove price same side: ',a)
print('get the best bid: ',a.get_bbo('B'))
print('get vwap: ',a.get_vwap())

set price levels:  {45.45: 100, 50: 100, 55: 50}
update price level same side:  {45.45: 100, 50: 100, 55: 50, 46.05: 100}
update price level opposite side:  {45.45: 100, 50: 100, 55: 50, 46.05: 100}
remove price same side:  {45.45: 100, 55: 50, 46.05: 100}
get the best bid:  55
get vwap:  47.6

#For each timestamp, find current sizes of each available price, the best bid and offer,
#as well as the VWAP (value weighted average price) or the buy orders, sell orders, and all orders
bid = PriceLevels()
ofr = PriceLevels()
quotes = dict()
#Use iterrows()
for row in order_df.iterrows():
    timestamp,size,price,side = row[1]
    quotes[timestamp] = {'bid':bid.update_prices(size,price,side,'B'),
                         'ofr':ofr.update_prices(size,price,side,'S'),
                         'best_bid':bid.get_bbo('B'),
                         'bid_vwap':bid.get_vwap(),
                         'best_ofr':ofr.get_bbo('S'),
                         'ofr_vwap':ofr.get_vwap(),
                         'avg_vwap':PriceLevels({**bid,**ofr}).get_vwap(),
                         'date':timestamp.date()}

list(quotes.items())[500:504]

[(Timestamp('2021-01-22 09:31:48.205629'),
  {'bid': {24.98: 100},
   'ofr': {43.61: 300},
   'best_bid': 24.98,
   'bid_vwap': 24.98,
   'best_ofr': 43.61,
   'ofr_vwap': 43.61,
   'avg_vwap': 38.9525,
   'date': datetime.date(2021, 1, 22)}),
 (Timestamp('2021-01-22 09:31:48.320642'),
  {'bid': {24.98: 100},
   'ofr': {43.61: 300, 43.59: 300},
   'best_bid': 24.98,
   'bid_vwap': 24.98,
   'best_ofr': 43.59,
   'ofr_vwap': 43.6,
   'avg_vwap': 40.94,
   'date': datetime.date(2021, 1, 22)}),
 (Timestamp('2021-01-22 09:31:48.320661'),
  {'bid': {24.98: 100},
   'ofr': {43.59: 300},
   'best_bid': 24.98,
   'bid_vwap': 24.98,
   'best_ofr': 43.59,
   'ofr_vwap': 43.59,
   'avg_vwap': 38.93750000000001,
   'date': datetime.date(2021, 1, 22)}),
 (Timestamp('2021-01-22 09:31:48.325218'),
  {'bid': {24.98: 100},
   'ofr': {43.6: 300},
   'best_bid': 24.98,
   'bid_vwap': 24.98,
   'best_ofr': 43.6,
   'ofr_vwap': 43.6,
   'avg_vwap': 38.945,
   'date': datetime.date(2021, 1, 22)})]

#Create a dataframe of current in the order book
quote_df = (pd.DataFrame
            .from_dict(quotes,orient='index')
            .reset_index()
            .rename(columns={'index':'timestamp'})
            .dropna(subset=['best_bid','best_ofr'],how='all'))

display(quote_df)

	timestamp	bid	ofr	...	ofr_vwap	avg_vwap	date
0	2021-01-22 08:25:48.218283	{44.45: 1000}	{}	...	NaN	44.45000	2021-01-22
2	2021-01-22 08:44:23.578790	{44.04: 1000}	{}	...	NaN	44.04000	2021-01-22
4	2021-01-22 08:49:02.526710	{}	{43.66: 100}	...	43.66000	43.66000	2021-01-22
6	2021-01-22 08:49:28.189795	{43.34: 100}	{}	...	NaN	43.34000	2021-01-22
8	2021-01-22 08:51:11.786607	{43.41: 100}	{}	...	NaN	43.41000	2021-01-22
...	...	...	...	...	...	...	...
342613	2021-01-28 16:56:26.486624	{}	{988.0: 112}	...	988.00000	988.00000	2021-01-28
342614	2021-01-28 16:59:06.409716	{}	{988.0: 112, 262.0: 100}	...	645.54717	645.54717	2021-01-28
342615	2021-01-28 16:59:06.793899	{}	{988.0: 112}	...	988.00000	988.00000	2021-01-28
342616	2021-01-28 16:59:06.797557	{}	{988.0: 112, 262.0: 100}	...	645.54717	645.54717	2021-01-28
342617	2021-01-28 16:59:08.241677	{}	{988.0: 112}	...	988.00000	988.00000	2021-01-28

342023 rows × 9 columns

#Get just the trades from the original dataframe
trade_df = json_df.loc[json_df['type']=='trade_report',:]
trade_df

	type	timestamp	symbol	...	trade_id	side	date
0	trade_report	2021-01-22 08:02:31.215300	GME	...	2.067095e+06	X	2021-01-22
1	trade_report	2021-01-22 08:08:14.700160	GME	...	2.639914e+06	X	2021-01-22
2	trade_report	2021-01-22 08:11:52.294756	GME	...	3.063945e+06	X	2021-01-22
3	trade_report	2021-01-22 08:18:22.383301	GME	...	3.669247e+06	X	2021-01-22
4	trade_report	2021-01-22 08:19:09.002873	GME	...	3.739332e+06	X	2021-01-22
...	...	...	...	...	...	...	...
490754	trade_report	2021-01-28 16:54:05.085758	GME	...	3.026635e+09	X	2021-01-28
490755	trade_report	2021-01-28 16:54:05.592947	GME	...	3.026636e+09	X	2021-01-28
490759	trade_report	2021-01-28 16:56:26.486624	GME	...	3.026962e+09	X	2021-01-28
490762	trade_report	2021-01-28 16:59:06.793899	GME	...	3.027353e+09	X	2021-01-28
490765	trade_report	2021-01-28 16:59:08.241677	GME	...	3.027362e+09	X	2021-01-28

134712 rows × 9 columns

#Combine the two dataframes so that each trade report is associated 
#with a set of 'prevailing quotes.'
plot_df = (pd.concat([quote_df,trade_df],ignore_index=False)
             .fillna({'type':'quote','symbol':'GME'})
             .sort_values(['timestamp','type'])
             .drop('side',axis=1)
             .ffill()
             .dropna(subset=['best_bid','best_ofr'])
             .astype({'date':'string'}))

plot_df = plot_df.loc[plot_df['type']=='trade_report',:]

plot_df['midpt'] = (plot_df['best_bid']+plot_df['best_ofr'])/2

display(plot_df.loc[:,['timestamp','price','best_bid','best_ofr','midpt']])

	timestamp	price	best_bid	best_ofr	midpt
27	2021-01-22 08:53:12.048708	43.85	43.55	43.66	43.605
28	2021-01-22 08:54:24.385311	44.09	43.55	43.66	43.605
31	2021-01-22 08:57:36.794024	44.22	43.55	44.30	43.925
32	2021-01-22 08:57:39.852032	44.20	43.55	44.30	43.925
33	2021-01-22 08:57:41.975903	44.28	43.55	44.30	43.925
...	...	...	...	...	...
490754	2021-01-28 16:54:05.085758	257.58	257.58	988.00	622.790
490755	2021-01-28 16:54:05.592947	257.58	257.58	988.00	622.790
490759	2021-01-28 16:56:26.486624	264.00	257.58	988.00	622.790
490762	2021-01-28 16:59:06.793899	262.00	257.58	988.00	622.790
490765	2021-01-28 16:59:08.241677	262.00	257.58	988.00	622.790

134697 rows × 5 columns

I’d like to make a special note that concatenating two dataframes and then sorting them and using the forward fill option to fill in a prevailing value is a nice trick. The logically obvious way to do this is you only wanted the trades is to do some kind of time range join, which would take waaaaaay longer. Concatenating is fast, sorting is fast, forward filling is fast, and throwing away a ton of rows is fast. Joins are slow.

Market Manipulation

The Stardew Valley Grange Display - Application

Stardew Valley Grange Display - Application

Now that I’ve got an ‘ideal’ display made up, and I made lofty claims that I had a lot of it on hand, let’s see the reality. I want to look at my save file to see what 8 point items I have lying around and which I need to actually aquire. In my current game, I just got the letter from the mayor saying the fair is in one week, so if I bother to figure this out now, I’ve got 125 points in the bag.

The Stardew Valley Grange Display - Analysis

According to the Stardew Valley wiki, the grange display’s score is calculated according to the formula:

The Stardew Valley Grange Display - Basic Web Scraping

Lately, I’ve been pretty obsessed with Stardew Valley. It’s a lovely little farming game, with a lot more to do besides grow crops. For the purpose of the next couple posts, I’d like to focus on an event that happens every fall, the Stardew Valley Fair. Specifically, you have the option of creating a Grange Display, and winning gets you 1000 star points. The star points aren’t the most useful thing in the world, but like many things in the game, that’s hardly the point. I’d just like to figure out how to get a perfect score with the least amount of work.

Gradients and You - Backpropagation and Neural Nets

This is a post in a series about gradients.

Gradients and You - Softmax

The Gradient for Softmax Loss

Gradients and You- SVM hinge loss

This is the first gradient in a series, for general notes on notation read this post

Gradients and You- Introduction and Notation

Finding analytic gradients can be a daunting task for many. The good news is that most analytic gradients used routinely in machine learning have already been solved and implemented at blazing fast speeds. However, if you want to ‘get under the hood’ of machine learning algorithms, you need to become familiar with how to calculate these objects. Between this and subsequent posts, I will try to walk you through the math behind the gradients.

Welcome!

Check out my notebooks and other data science projects. Currently, I am working on a statistical model to predict students likely to drop out of high school off of NCES data. See the notebooks for this project here