A Better Way To Load Data into Microsoft SQL Server from Pandas

July 13, 2016July 29, 2016 / pindash91

This post is not going to have much theory.

Here was my problem. Python and Pandas are excellent tools for munging data but if you want to store it long term a DataFrame is not the solution, especially if you need to do reporting. That’s why Edgar Codd discovered, and Michael Stonebreaker implemented, relational databases.

Obviously, if I had the choice I wouldn’t be using Microsoft SQL Server [MSS]. If I could, I would do this whole exercise in KDB, but KDB is expensive and not that good at storing long strings. Other relational databases might have better integration with Python, but at an enterprise MSS is the standard, and it supports all sorts of reporting.

So my task was to load a bunch of data about twenty thousand rows — in the long term we were going to load one hundred thousand rows an hour — into MSS.

Pandas is an amazing library built on top of numpy, a pretty fast C implementation of arrays.

Pandas has a built-in to_sql
method which allows anyone with a pyodbc engine to send their DataFrame into sql. Unfortunately, this method is really slow. It creates a transaction for every row. This means that every insert locks the table. This leads to poor performance (I got about 25 records a second.)

So I thought I would just use the pyodbc driver directly. After all, it has a special method for inserting many values called executemany. The MSS implementation of the pyodbc execute many also creates a transaction per row. So does pymssql.
I looked on stack overflow, but they pretty much recommended using bulk insert.Which is still the fastest way to copy data into MSS. But it has some serious drawbacks.

For one, bulk insert needs to have a way to access the created flat file. It works best if that access path is actually a local disk and not a network drive. It also is a very primitive tool, so there is very little that you can do in terms of rollback and it’s not easily monitored. Lastly, transferring flat files, means that you are doing data munging writing to disk, then copying to another remote disk then putting the data back in memory. It might be the fastest method, but all those operations have overhead and it creates a fragile pipeline.

So if I can’t do bulk insert, and I can’t use a library. I have to roll my own.

It’s actually pretty simple:

You create your connection, I did this with sqlalchemy but you can use whatever:


import sqlalchemy
import urllib
cxn_str = &quot;DRIVER={SQL Server Native Client 11.0};SERVER=server,port;DATABASE=mydb;UID=user;PWD=pwd&quot;
params = urllib.quote_plus(cxn_str)
engine = sqlalchemy.create_engine(&quot;mssql+pyodbc:///?odbc_connect=%s&quot; % params)
conn = engine.connect().connection
cursor = conn.cursor()

You take your df.
I’m going to assume that all of your values in the df are strings, if they are not.
And you have longs, you are going to need to do some munging to make sure that when we stringfy the values sql knows what’s up.
(so if you have a long, then you need to chop the L off because MSS doesn’t know that 12345678910111234L is a number. here is the regex to get rid of it: re.sub(r”(?<=\d)L”,”,s ) credit to my coworker who is a regex wiz.

We need to tuplify our records, Wes Mckinney says:

records = [tuple(x) for x in df.values]

Which is good enough for me. But we also need to stringfy them as well.

So that becomes:

records = [str(tuple(x)) for x in df.values]

MSS has a batch insert mode that supports up to 1000 rows at a time. Which means 1000 rows per transaction.

We need the sql script that does batch insertion: we’ll call that insert_

insert_ = &quot;&quot;&quot;

INSERT INTO mytable
(col1
,col2
,col3
...)
VALUES

&quot;&quot;&quot;

Since we can only insert 1000 rows. We need to break up the rows into 1000 row batches. So here is a way of batching the records.
My favorite solution comes from here.

def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Now all we need to do is have a way of creating the rows and batching.
Which becomes:

for batch in chunker(records, 1000):
rows = ','.join(batch)
insert_rows = insert_ + rows
cursor.execute(insert_rows)
conn.commit()

That’s it.
This solution got me inserting around 1300 rows a second. A 1.5 orders of magnitude increase.

There are further improvements for sure, but this will easily get me past a hundred thousand rows an hour.

18 thoughts on “A Better Way To Load Data into Microsoft SQL Server from Pandas”

Michael

April 27, 2017 at 10:37 pm

Thanks! I got about a 1.5x speed boost as well using your trick. I would like to note I used Python 3.6 so I switched from using ‘xrange’ to ‘range’ in chunker() and that forced me to change the below:
rows=’,’.join(batch) –> rows=str(batch).strip(‘[]’)

LikeLiked by 1 person

Reply
- pindash91
  
  April 28, 2017 at 12:07 am
  
  Glad it worked.
  
  LikeLike
  
  Reply
  - sergey (@sergei3000)
    
    June 29, 2017 at 4:57 pm
    
    “1.5x speed boost” is 10 times less than “1.5 orders of magnitude increase”.
    What does this depend on? I have the same problem, but only an order of magnitude speedup is worth trying in my case.
    
    LikeLike
- jballoonist
  
  February 1, 2018 at 6:28 pm
  
  Thanks to both of you for this. I had some other objects that weren’t strings which took some time to figure out, but this was extremely helpful, and obviously much faster than using the built in df.to_sql.
  
  – Rob
  
  LikeLike
  
  Reply
Vincent Castuche

April 3, 2018 at 12:02 pm

Hello,

You’re method seem perfect if you want to avoid BULK INSERT ! But i don’t understand one thing, where do you implement this code :

insert_ = “””
INSERT INTO mytable
(col1
,col2
,col3
…)
VALUES
“””

You put this in a stored procedure, in a py function ? “”” is an exemple or is the parameter where rows go ?

I really want to use your code but that part is dark to me.

thanks 🙂 🙂
Have a good day

LikeLike

Reply
pindash91

April 4, 2018 at 4:09 am

Hi Vincent, thanks for checking out my blog.
The insert is the MSS sql command that allows bulk inserts. So you are concatenating this string with the stringified records and executing that entire string as one transaction. As an example the fully formed string would look like:
INSERT INTO mytable (a,b,c) VALUES (1,2,3), (4,5,6), (7,8,9); more can be found here: https://stackoverflow.com/questions/6889065/inserting-multiple-rows-in-mysql

Hope that helps

LikeLike

Reply
Vincent Castuche

April 4, 2018 at 8:10 am

ok thanks I manage to use your solution but unfortunately performance are not crazy on my computer, I think because I have a lot of column (40 000 rows * 170 columns).

LikeLike

Reply
- pindash91
  
  April 4, 2018 at 10:06 am
  
  You might try experimenting with the batch size if you have a lot of data per row, than a smaller batch size could improve the performance.
  It also sounds like your data could use some normalization if you have 170 columns, or else you might want to switch to a column store. Since you are most likely not looking at all those columns each time.
  
  LikeLike
  
  Reply
Stuart

April 30, 2018 at 11:46 am

Hi Pindash,

Thanks for the post. I’m getting stuck early in the process. What does the ‘&quot’ part of the code do? It’s first used when you create the connection, then later with the INSERT.

Thanks,

LikeLike

Reply
- pindash91
  
  April 30, 2018 at 7:08 pm
  
  I believe that’s just a WordPress formatting issue.
  
  LikeLike
  
  Reply
- Lachlan Dickson
  
  May 21, 2018 at 6:09 am
  
  To clarify – the three ‘&quot’ symbols represent a triple quoted string.
  
  LikeLike
  
  Reply
Lachlan Dickson

May 21, 2018 at 2:10 am

Thankyou for writing this up – I implemented this method and achieved the same order of speed increase.

LikeLiked by 1 person

Reply
Darshith Badiyani

May 23, 2019 at 4:27 am

Hello! This is a brilliant solution. It almost worked for me. My data has an apostrophe and I am working on escaping it. But, overall it works. Good day.

LikeLike

Reply
- orfeu65Cristina
  
  July 4, 2019 at 3:35 pm
  
  Hi Darshith,
  I have the same problem, my data has single quotes, and the whole batch fails. Did you find a solution?
  
  LikeLike
  
  Reply
  - Darshith Badiyani
    
    July 8, 2019 at 9:24 am
    
    Hello! Yeah, I was able to find a workaround. We actually removed the apostrophe and were loading the data into the SQL server tables. As the data provided was not editable, we were removing the apostrophe in Python.
    
    LikeLike
Mitch Marek

June 5, 2019 at 2:54 pm

This is such a neat solution, thank you. It’s definitely sped things up. I was using

for index,row in df.iterrows():
cursor.execute(insert_,
row[‘col1’],
row[‘col2’],
row[‘col3’],
……
)
conn.commit()

but the speed your soluton achieves is a lifesaver. Thanks!

LikeLike

Reply
Andrew Stanton (@andrew_stanton)

July 14, 2019 at 4:51 am

FWIW, MSSQL has BULK INSERT and its not the standard INSERT clause that you are using with multiple VALUES tuples. There are a series of BULK specific operations that are hands down the fastest way to get data into SQL Server. Search for “BCP” or “Bulk Copy Program”.

MSSQL also hasn’t table locked (TABLOCK) on inserts (even in a TRANSACTION) for a long time (if ever), unless you are forcing it to by using TABLOCK or something. It will by default issue a ROWLOCK, which only blocks access to the row being inserted until its insertion is complete. If you are actually seeing TABLOCK, you probably have some setting enabled on the SQL server like TRANSACTION ISOLATION LEVEL SERIALIZABLE thats useful in some cases, but generally creates more problems than it solves.

If you want the fastest non-BCP option, I’ve found that its the use of a Table Valued Parameter with all your DataFrame info passed into a stored procedure that uses a MERGE statement to do the INSERT and UPDATE (or just insert) to place that incoming TVP data. Just make sure to use a HOLDLOCK on the target table as a MERGE is not an atomic operation and you will get races/deadlocks without it. Tweet at me if you cant find a good example, but be aware I dont check twitter often.

LikeLike

Reply
- pindash91
  
  July 14, 2019 at 6:32 pm
  
  I mention bulk insert in the article and point out the drawbacks of that approach.
  Secondly, usually in a large company you don’t get to have access to stored procedures or the DB settings those are in the hands of the DBA. If you do have control your solutions are indeed faster.
  
  LikeLike
  
  Reply

	pt.instafollowfast.c… on Backtracking In Q/ word l…
	Alex on The Is-Ought Distinction
	Fred Mleczko on How the Mighty have Fallen, or…
	pindash91 on A Better Way To Load Data into…
	Andrew Stanton (@and… on A Better Way To Load Data into…

iabdb

It's All Been Done Before.

A Better Way To Load Data into Microsoft SQL Server from Pandas

18 thoughts on “A Better Way To Load Data into Microsoft SQL Server from Pandas”

Leave a comment Cancel reply

Share this:

Related

18 thoughts on “A Better Way To Load Data into Microsoft SQL Server from Pandas”

Leave a comment Cancel reply