IEX Order Book - Reading the Data
The IEX Order Book - Reading the Data
I recently gave a talk to the Python Frederick Meetup about Pandas and wanted to show some financial data since that’s what I’ve been working with lately. Specifically, I wanted something that had real time quotes and trades, but at best most sources give you the high, low, and median price aggregated over something like an hour or a day. There are a lot of sources of market data, and they have most definitely put a price on that data, a high price.
NYSE TAQ is about \$1000/month for the first 12 months and $500/month to go farther back in time. NASDAQ TotalView and similar products have a pretty complex fee structure, so I’m not entirely sure how much it costs. You can at least see the real time data on your RobinHood (Gold account) or Charles Schwab account, but historical data’s going to cost you.
Which brings us to the Investor’s Exchange, or IEX. Featured in the Michael Lewis book Flash Boys, they do their best to offer access to the market with protections in place that favor long-term over short-term strategies, keeping in mind ‘long term’ might still mean only holding a stock for a millisecond. They’re a pretty small compared to NYSE or NASDAQ, but they allow you to download pretty detailed historical data for free…as a dump of network packets.
What is an order book?
Why on earth do I want to see the order book in the first place? What is it? Every time a broker wants to make a trade, they put an an order (I want to buy 200 shares of stock A for $100 each). Until the order is canceled, or a seller is found that is willing to meet the buyer at their price, the order ‘rests on the book.’ Most heavily traded stocks do not have orders that rest long, unless the number of shares is exceptionally large or the price point is completely unreasonable. Here’s an extremely simplified view of what an order book might look like.
time | side | quantity | price |
---|---|---|---|
9:30 | B | 100 | 45.00 |
9:31 | B | 200 | 45.01 |
9:31 | S | 100 | 45.05 |
9:32 | B | 50 | 45.00 |
What do you get from the IEX feed?
In the DEEP data product, you get a number of message types, notably, when there is a trade, and when there is a price level update. This isn’t quite the order book, in that there could be many orders combined together to create one price level update, but you do see how many shares have a bid (I will buy your stock) or offer (I want to sell you my stock) out at each dollar amount, which is pretty darn good for free. Important to note again, IEX is only a few percent of the total market volume, so you’re not seeing the whole market, but you get a neat view that you would typically have to pay a lot to see. Here, you only pay with parsing pain.
!xxd -l 112 -s 20000 ~/Downloads/data_feeds_20210122_20210122_IEXTP1_DEEP1.0.pcap
00004e20: 7cb9 0500 ed64 2f58 e805 0000 e805 0000 |....d/X........
00004e30: 0100 5e57 1504 b859 9ff9 2d53 0800 4500 ..^W...Y..-S..E.
00004e40: 05da e340 4000 4011 9f90 17e2 9b84 e9d7 ...@@.@.........
00004e50: 1504 288a 288a 05c6 0402 0100 0480 0100 ..(.(...........
00004e60: 0000 0000 d948 9605 4200 fc42 0000 0000 .....H..B..B....
00004e70: 0000 1903 0000 0000 0000 f76d ff74 b88d ...........m.t..
00004e80: 5c16 1600 4854 c51f ff74 b88d 5c16 4149 \...HT...t..\.AI
And that’s what you get. A lot of hex digits and a specifications document. Fortunately, there is a super cool person on the internet who made a parser that can get all this data into a json format based on the specifications document.
#A lovely command line conversion of the binary to a json file
!pip install iex_parser
!iex_to_json -i iex_file.pcap.gz -o iex_file.json.gz -t 'GME' -s
import json
with open('iex_deep_quotes_and_trades.json') as file:
line = file.readline()
print(line[:500])
[{"type":"trade_report","event":null,"timestamp":"2021-01-22T13:02:31.215300+00:00","status":null,"symbol":"GME","detail":null,"halt_status":null,"reason":null,"flags":96.0,"size":39.0,"price":45.19,"trade_id":2067095.0,"side":null,"security_event":null},{"type":"trade_report","event":null,"timestamp":"2021-01-22T13:08:14.700160+00:00","status":null,"symbol":"GME","detail":null,"halt_status":null,"reason":null,"flags":96.0,"size":50.0,"price":44.87,"trade_id":2639914.0,"side":null,"security_even
It’s still hard to understand without reading the documentation and a little knowledge of how order books work, but it’s a heck of a lot more readable, and easy to turn into a dataframe.
#Read in json file, many json files that have a schema without
#a lot of nesting and variation can be read safely as records
from pandas.io.json import read_json
json_df = read_json('iex_deep_quotes_and_trades.json',orient='records')
display(json_df)
type | event | timestamp | ... | trade_id | side | security_event | |
---|---|---|---|---|---|---|---|
0 | trade_report | NaN | 2021-01-22 13:02:31.215300+00:00 | ... | 2.067095e+06 | None | NaN |
1 | trade_report | NaN | 2021-01-22 13:08:14.700160+00:00 | ... | 2.639914e+06 | None | NaN |
2 | trade_report | NaN | 2021-01-22 13:11:52.294756+00:00 | ... | 3.063945e+06 | None | NaN |
3 | trade_report | NaN | 2021-01-22 13:18:22.383301+00:00 | ... | 3.669247e+06 | None | NaN |
4 | trade_report | NaN | 2021-01-22 13:19:09.002873+00:00 | ... | 3.739332e+06 | None | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
490763 | price_level_update | NaN | 2021-01-28 21:59:06.793899+00:00 | ... | NaN | S | NaN |
490764 | price_level_update | NaN | 2021-01-28 21:59:06.797557+00:00 | ... | NaN | S | NaN |
490765 | trade_report | NaN | 2021-01-28 21:59:08.241677+00:00 | ... | 3.027362e+09 | None | NaN |
490766 | price_level_update | NaN | 2021-01-28 21:59:08.241677+00:00 | ... | NaN | S | NaN |
490767 | price_level_update | NaN | 2021-01-28 22:00:00.018339+00:00 | ... | NaN | S | NaN |
490768 rows × 14 columns
#see what the na situation is
json_df.isna().sum()
type 0
event 490768
timestamp 0
status 490768
symbol 0
...
size 0
price 0
trade_id 356056
side 134712
security_event 490768
Length: 14, dtype: int64
#get rid of columns that are entirely null
json_df = json_df.dropna(axis = 1,how='all')
display(json_df)
type | timestamp | symbol | ... | price | trade_id | side | |
---|---|---|---|---|---|---|---|
0 | trade_report | 2021-01-22 13:02:31.215300+00:00 | GME | ... | 45.19 | 2.067095e+06 | None |
1 | trade_report | 2021-01-22 13:08:14.700160+00:00 | GME | ... | 44.87 | 2.639914e+06 | None |
2 | trade_report | 2021-01-22 13:11:52.294756+00:00 | GME | ... | 44.58 | 3.063945e+06 | None |
3 | trade_report | 2021-01-22 13:18:22.383301+00:00 | GME | ... | 44.04 | 3.669247e+06 | None |
4 | trade_report | 2021-01-22 13:19:09.002873+00:00 | GME | ... | 43.78 | 3.739332e+06 | None |
... | ... | ... | ... | ... | ... | ... | ... |
490763 | price_level_update | 2021-01-28 21:59:06.793899+00:00 | GME | ... | 262.00 | NaN | S |
490764 | price_level_update | 2021-01-28 21:59:06.797557+00:00 | GME | ... | 262.00 | NaN | S |
490765 | trade_report | 2021-01-28 21:59:08.241677+00:00 | GME | ... | 262.00 | 3.027362e+09 | None |
490766 | price_level_update | 2021-01-28 21:59:08.241677+00:00 | GME | ... | 262.00 | NaN | S |
490767 | price_level_update | 2021-01-28 22:00:00.018339+00:00 | GME | ... | 988.00 | NaN | S |
490768 rows × 8 columns
#What are we left with?
json_df.isna().sum()
type 0
timestamp 0
symbol 0
flags 0
size 0
price 0
trade_id 356056
side 134712
dtype: int64
#What data types are we working with?
json_df.dtypes
type object
timestamp datetime64[ns, UTC]
symbol object
flags int64
size int64
price float64
trade_id float64
side object
dtype: object
#The objects really should be strings
json_df = json_df.astype({'type':'string','symbol':'string','side':'string'})
json_df.dtypes
type string
timestamp datetime64[ns, UTC]
symbol string
flags int64
size int64
price float64
trade_id float64
side string
dtype: object
#Fill in nulls on the side, since that may cause trouble plotting trades
#create a date column for filtering purposes, and change from UTC to EST
from datetime import timezone
json_df = json_df.fillna({'side':'X'})#replace nulls
json_df['date'] = (json_df
.apply({'timestamp':lambda x: x.date}))
json_df['timestamp'] = (json_df['timestamp']
.apply(lambda x: x.astimezone(tz='EST')
.replace(tzinfo=None)))#change to local time
display(json_df)
type | timestamp | symbol | ... | trade_id | side | date | |
---|---|---|---|---|---|---|---|
0 | trade_report | 2021-01-22 08:02:31.215300 | GME | ... | 2.067095e+06 | X | 2021-01-22 |
1 | trade_report | 2021-01-22 08:08:14.700160 | GME | ... | 2.639914e+06 | X | 2021-01-22 |
2 | trade_report | 2021-01-22 08:11:52.294756 | GME | ... | 3.063945e+06 | X | 2021-01-22 |
3 | trade_report | 2021-01-22 08:18:22.383301 | GME | ... | 3.669247e+06 | X | 2021-01-22 |
4 | trade_report | 2021-01-22 08:19:09.002873 | GME | ... | 3.739332e+06 | X | 2021-01-22 |
... | ... | ... | ... | ... | ... | ... | ... |
490763 | price_level_update | 2021-01-28 16:59:06.793899 | GME | ... | NaN | S | 2021-01-28 |
490764 | price_level_update | 2021-01-28 16:59:06.797557 | GME | ... | NaN | S | 2021-01-28 |
490765 | trade_report | 2021-01-28 16:59:08.241677 | GME | ... | 3.027362e+09 | X | 2021-01-28 |
490766 | price_level_update | 2021-01-28 16:59:08.241677 | GME | ... | NaN | S | 2021-01-28 |
490767 | price_level_update | 2021-01-28 17:00:00.018339 | GME | ... | NaN | S | 2021-01-28 |
490768 rows × 9 columns
#Create subset that is just the order book updates
mask = json_df['type']=='price_level_update'
select_cols = ['timestamp','size','price','side']
sort_cols = ['timestamp','price']
order_df = json_df.loc[mask,select_cols].sort_values(sort_cols)
I’m not sure if this is the most effective way to accomplish the task, but I felt like making a class to do the data manipulation necessary to carry the information forward from one row to the next. The goal is for each timestep to specify the correct number of shares at each price level at each point in time. This basically accumulates all the different price levels forward in time. If there were only a few price levels, breaking them out into their own columns and doing a cumulative sum would be the most effective. However:
print('There are ',len(order_df.price.unique()),' price levels.')
There are 11848 price levels.
class PriceLevels(dict):
def ignore_item(self,item):
return self
def add_or_discard(self,size,price,side,quote_side):
if (size > 0)&(side==quote_side):
self.update({price:size})
elif (size == 0)&(side==quote_side):
self.pop(price)
else:
self.ignore_item
return self
def get_bbo(self,side):
if (side == 'B')&(len(self)>0):
return max(self.keys())
elif (side == 'S')&(len(self)>0):
return min(self.keys())
else:
return None
def get_vwap(self):
if len(self)==0:
return None
volume = self
return sum([k*v for k,v in self.items()])/sum([v for v in self.values()])
def update_prices(self,size,price,side,quote_side):
self.add_or_discard(size,price,side,quote_side)
return PriceLevels(self.copy())
#Some examples of how the object works
a = PriceLevels({45.45:100,50:100,55:50})
print('set price levels: ',a)
a.update_prices(100,46.05,'B','B')
print('update price level same side: ',a)
a.update_prices(100,46.10,'S','B')
print('update price level opposite side: ',a)
a.update_prices(0,50,'B','B')
print('remove price same side: ',a)
print('get the best bid: ',a.get_bbo('B'))
print('get vwap: ',a.get_vwap())
set price levels: {45.45: 100, 50: 100, 55: 50}
update price level same side: {45.45: 100, 50: 100, 55: 50, 46.05: 100}
update price level opposite side: {45.45: 100, 50: 100, 55: 50, 46.05: 100}
remove price same side: {45.45: 100, 55: 50, 46.05: 100}
get the best bid: 55
get vwap: 47.6
#For each timestamp, find current sizes of each available price, the best bid and offer,
#as well as the VWAP (value weighted average price) or the buy orders, sell orders, and all orders
bid = PriceLevels()
ofr = PriceLevels()
quotes = dict()
#Use iterrows()
for row in order_df.iterrows():
timestamp,size,price,side = row[1]
quotes[timestamp] = {'bid':bid.update_prices(size,price,side,'B'),
'ofr':ofr.update_prices(size,price,side,'S'),
'best_bid':bid.get_bbo('B'),
'bid_vwap':bid.get_vwap(),
'best_ofr':ofr.get_bbo('S'),
'ofr_vwap':ofr.get_vwap(),
'avg_vwap':PriceLevels({**bid,**ofr}).get_vwap(),
'date':timestamp.date()}
list(quotes.items())[500:504]
[(Timestamp('2021-01-22 09:31:48.205629'),
{'bid': {24.98: 100},
'ofr': {43.61: 300},
'best_bid': 24.98,
'bid_vwap': 24.98,
'best_ofr': 43.61,
'ofr_vwap': 43.61,
'avg_vwap': 38.9525,
'date': datetime.date(2021, 1, 22)}),
(Timestamp('2021-01-22 09:31:48.320642'),
{'bid': {24.98: 100},
'ofr': {43.61: 300, 43.59: 300},
'best_bid': 24.98,
'bid_vwap': 24.98,
'best_ofr': 43.59,
'ofr_vwap': 43.6,
'avg_vwap': 40.94,
'date': datetime.date(2021, 1, 22)}),
(Timestamp('2021-01-22 09:31:48.320661'),
{'bid': {24.98: 100},
'ofr': {43.59: 300},
'best_bid': 24.98,
'bid_vwap': 24.98,
'best_ofr': 43.59,
'ofr_vwap': 43.59,
'avg_vwap': 38.93750000000001,
'date': datetime.date(2021, 1, 22)}),
(Timestamp('2021-01-22 09:31:48.325218'),
{'bid': {24.98: 100},
'ofr': {43.6: 300},
'best_bid': 24.98,
'bid_vwap': 24.98,
'best_ofr': 43.6,
'ofr_vwap': 43.6,
'avg_vwap': 38.945,
'date': datetime.date(2021, 1, 22)})]
#Create a dataframe of current in the order book
quote_df = (pd.DataFrame
.from_dict(quotes,orient='index')
.reset_index()
.rename(columns={'index':'timestamp'})
.dropna(subset=['best_bid','best_ofr'],how='all'))
display(quote_df)
timestamp | bid | ofr | ... | ofr_vwap | avg_vwap | date | |
---|---|---|---|---|---|---|---|
0 | 2021-01-22 08:25:48.218283 | {44.45: 1000} | {} | ... | NaN | 44.45000 | 2021-01-22 |
2 | 2021-01-22 08:44:23.578790 | {44.04: 1000} | {} | ... | NaN | 44.04000 | 2021-01-22 |
4 | 2021-01-22 08:49:02.526710 | {} | {43.66: 100} | ... | 43.66000 | 43.66000 | 2021-01-22 |
6 | 2021-01-22 08:49:28.189795 | {43.34: 100} | {} | ... | NaN | 43.34000 | 2021-01-22 |
8 | 2021-01-22 08:51:11.786607 | {43.41: 100} | {} | ... | NaN | 43.41000 | 2021-01-22 |
... | ... | ... | ... | ... | ... | ... | ... |
342613 | 2021-01-28 16:56:26.486624 | {} | {988.0: 112} | ... | 988.00000 | 988.00000 | 2021-01-28 |
342614 | 2021-01-28 16:59:06.409716 | {} | {988.0: 112, 262.0: 100} | ... | 645.54717 | 645.54717 | 2021-01-28 |
342615 | 2021-01-28 16:59:06.793899 | {} | {988.0: 112} | ... | 988.00000 | 988.00000 | 2021-01-28 |
342616 | 2021-01-28 16:59:06.797557 | {} | {988.0: 112, 262.0: 100} | ... | 645.54717 | 645.54717 | 2021-01-28 |
342617 | 2021-01-28 16:59:08.241677 | {} | {988.0: 112} | ... | 988.00000 | 988.00000 | 2021-01-28 |
342023 rows × 9 columns
#Get just the trades from the original dataframe
trade_df = json_df.loc[json_df['type']=='trade_report',:]
trade_df
type | timestamp | symbol | ... | trade_id | side | date | |
---|---|---|---|---|---|---|---|
0 | trade_report | 2021-01-22 08:02:31.215300 | GME | ... | 2.067095e+06 | X | 2021-01-22 |
1 | trade_report | 2021-01-22 08:08:14.700160 | GME | ... | 2.639914e+06 | X | 2021-01-22 |
2 | trade_report | 2021-01-22 08:11:52.294756 | GME | ... | 3.063945e+06 | X | 2021-01-22 |
3 | trade_report | 2021-01-22 08:18:22.383301 | GME | ... | 3.669247e+06 | X | 2021-01-22 |
4 | trade_report | 2021-01-22 08:19:09.002873 | GME | ... | 3.739332e+06 | X | 2021-01-22 |
... | ... | ... | ... | ... | ... | ... | ... |
490754 | trade_report | 2021-01-28 16:54:05.085758 | GME | ... | 3.026635e+09 | X | 2021-01-28 |
490755 | trade_report | 2021-01-28 16:54:05.592947 | GME | ... | 3.026636e+09 | X | 2021-01-28 |
490759 | trade_report | 2021-01-28 16:56:26.486624 | GME | ... | 3.026962e+09 | X | 2021-01-28 |
490762 | trade_report | 2021-01-28 16:59:06.793899 | GME | ... | 3.027353e+09 | X | 2021-01-28 |
490765 | trade_report | 2021-01-28 16:59:08.241677 | GME | ... | 3.027362e+09 | X | 2021-01-28 |
134712 rows × 9 columns
#Combine the two dataframes so that each trade report is associated
#with a set of 'prevailing quotes.'
plot_df = (pd.concat([quote_df,trade_df],ignore_index=False)
.fillna({'type':'quote','symbol':'GME'})
.sort_values(['timestamp','type'])
.drop('side',axis=1)
.ffill()
.dropna(subset=['best_bid','best_ofr'])
.astype({'date':'string'}))
plot_df = plot_df.loc[plot_df['type']=='trade_report',:]
plot_df['midpt'] = (plot_df['best_bid']+plot_df['best_ofr'])/2
display(plot_df.loc[:,['timestamp','price','best_bid','best_ofr','midpt']])
timestamp | price | best_bid | best_ofr | midpt | |
---|---|---|---|---|---|
27 | 2021-01-22 08:53:12.048708 | 43.85 | 43.55 | 43.66 | 43.605 |
28 | 2021-01-22 08:54:24.385311 | 44.09 | 43.55 | 43.66 | 43.605 |
31 | 2021-01-22 08:57:36.794024 | 44.22 | 43.55 | 44.30 | 43.925 |
32 | 2021-01-22 08:57:39.852032 | 44.20 | 43.55 | 44.30 | 43.925 |
33 | 2021-01-22 08:57:41.975903 | 44.28 | 43.55 | 44.30 | 43.925 |
... | ... | ... | ... | ... | ... |
490754 | 2021-01-28 16:54:05.085758 | 257.58 | 257.58 | 988.00 | 622.790 |
490755 | 2021-01-28 16:54:05.592947 | 257.58 | 257.58 | 988.00 | 622.790 |
490759 | 2021-01-28 16:56:26.486624 | 264.00 | 257.58 | 988.00 | 622.790 |
490762 | 2021-01-28 16:59:06.793899 | 262.00 | 257.58 | 988.00 | 622.790 |
490765 | 2021-01-28 16:59:08.241677 | 262.00 | 257.58 | 988.00 | 622.790 |
134697 rows × 5 columns
I’d like to make a special note that concatenating two dataframes and then sorting them and using the forward fill option to fill in a prevailing value is a nice trick. The logically obvious way to do this is you only wanted the trades is to do some kind of time range join, which would take waaaaaay longer. Concatenating is fast, sorting is fast, forward filling is fast, and throwing away a ton of rows is fast. Joins are slow.