A Better Way To Load Data into Microsoft SQL Server from Pandas

This post is not going to have much theory.

Here was my problem. Python and Pandas are excellent tools for munging data but if you want to store it long term a DataFrame is not the solution, especially if you need to do reporting. That’s why Edgar Codd discovered, and Michael Stonebreaker implemented, relational databases.

Obviously, if I had the choice I wouldn’t be using Microsoft SQL Server [MSS]. If I could, I would do this whole exercise in KDB, but KDB is expensive and not that good at storing long strings. Other relational databases might have better integration with Python, but at an enterprise MSS is the standard, and it supports all sorts of reporting.

So my task was to load a bunch of data about twenty thousand rows — in the long term we were going to load one hundred thousand rows an hour — into MSS.

Pandas is an amazing library built on top of numpy, a pretty fast C implementation of arrays.

Pandas has a built-in to_sql
method which allows anyone with a pyodbc engine to send their DataFrame into sql. Unfortunately, this method is really slow. It creates a transaction for every row. This means that every insert locks the table. This leads to poor performance (I got about 25 records a second.)

So I thought I would just use the pyodbc driver directly. After all, it has a special method for inserting many values called executemany. The MSS implementation of the pyodbc execute  many also creates a transaction per row. So does pymssql.
I looked on stack overflow, but they pretty much recommended using bulk insert.Which is still the fastest way to copy data into MSS. But it has some serious drawbacks.

For one, bulk insert needs to have a way to access the created flat file. It works best if that access path is actually a local disk and not a network drive. It also is a very primitive tool, so there is very little that you can do in terms of rollback and it’s not easily monitored. Lastly, transferring flat files, means that you are doing data munging writing to disk, then copying to another remote disk then putting the data back in memory. It might be the fastest method, but all those operations have overhead and it creates a fragile pipeline.

So if I can’t do bulk insert, and I can’t use a library. I have to roll my own.

It’s actually pretty simple:

 

 

You create your connection, I did this with sqlalchemy but you can use whatever:


import sqlalchemy
import urllib
cxn_str = "DRIVER={SQL Server Native Client 11.0};SERVER=server,port;DATABASE=mydb;UID=user;PWD=pwd"
params = urllib.quote_plus(cxn_str)
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect().connection
cursor = conn.cursor()

You take your df.
I’m going to assume that all of your values in the df are strings, if they are not.
And you have longs, you are going to need to do some munging to make sure that when we stringfy the values sql knows what’s up.
(so if you have a long, then you need to chop the L off because MSS doesn’t know that 12345678910111234L is a number. here is the regex to get rid of it: re.sub(r”(?<=\d)L”,”,s ) credit to my coworker who is a regex wiz.

We need to tuplify our records, Wes Mckinney says:

records = [tuple(x) for x in df.values]

Which is good enough for me. But we also need to stringfy them as well.

So that becomes:

records = [str(tuple(x)) for x in df.values]

MSS has a batch insert mode that supports up to 1000 rows at a time. Which means 1000 rows per transaction.

We need the sql script that does batch insertion: we’ll call that insert_

insert_ = &quot;&quot;&quot;

INSERT INTO mytable
(col1
,col2
,col3
...)
VALUES

&quot;&quot;&quot;

Since we can only insert 1000 rows. We need to break up the rows into 1000 row batches. So here is a way of batching the records.
My favorite solution comes from here.

def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Now all we need to do is have a way of creating the rows and batching.
Which becomes:

for batch in chunker(records, 1000):
rows = ','.join(batch)
insert_rows = insert_ + rows
cursor.execute(insert_rows)
conn.commit()

That’s it.
This solution got me inserting around 1300 rows a second. A 1.5 orders of magnitude increase.

There are further improvements for sure, but this will easily get me past a hundred thousand rows an hour.

18 thoughts on “A Better Way To Load Data into Microsoft SQL Server from Pandas

  1. Thanks! I got about a 1.5x speed boost as well using your trick. I would like to note I used Python 3.6 so I switched from using ‘xrange’ to ‘range’ in chunker() and that forced me to change the below:
    rows=’,’.join(batch) –> rows=str(batch).strip(‘[]’)

    Liked by 1 person

  2. Hello,

    You’re method seem perfect if you want to avoid BULK INSERT ! But i don’t understand one thing, where do you implement this code :

    insert_ = “””
    INSERT INTO mytable
    (col1
    ,col2
    ,col3
    …)
    VALUES
    “””

    You put this in a stored procedure, in a py function ? “”” is an exemple or is the parameter where rows go ?

    I really want to use your code but that part is dark to me.

    thanks 🙂 🙂
    Have a good day

    Like

  3. Hi Vincent, thanks for checking out my blog.
    The insert is the MSS sql command that allows bulk inserts. So you are concatenating this string with the stringified records and executing that entire string as one transaction. As an example the fully formed string would look like:
    INSERT INTO mytable (a,b,c) VALUES (1,2,3), (4,5,6), (7,8,9); more can be found here: https://stackoverflow.com/questions/6889065/inserting-multiple-rows-in-mysql

    Hope that helps

    Like

  4. ok thanks I manage to use your solution but unfortunately performance are not crazy on my computer, I think because I have a lot of column (40 000 rows * 170 columns).

    Like

    • You might try experimenting with the batch size if you have a lot of data per row, than a smaller batch size could improve the performance.
      It also sounds like your data could use some normalization if you have 170 columns, or else you might want to switch to a column store. Since you are most likely not looking at all those columns each time.

      Like

  5. Hi Pindash,

    Thanks for the post. I’m getting stuck early in the process. What does the ‘&quot’ part of the code do? It’s first used when you create the connection, then later with the INSERT.

    Thanks,

    Like

  6. This is such a neat solution, thank you. It’s definitely sped things up. I was using

    for index,row in df.iterrows():
    cursor.execute(insert_,
    row[‘col1’],
    row[‘col2’],
    row[‘col3’],
    ……
    )
    conn.commit()

    but the speed your soluton achieves is a lifesaver. Thanks!

    Like

  7. FWIW, MSSQL has BULK INSERT and its not the standard INSERT clause that you are using with multiple VALUES tuples. There are a series of BULK specific operations that are hands down the fastest way to get data into SQL Server. Search for “BCP” or “Bulk Copy Program”.

    MSSQL also hasn’t table locked (TABLOCK) on inserts (even in a TRANSACTION) for a long time (if ever), unless you are forcing it to by using TABLOCK or something. It will by default issue a ROWLOCK, which only blocks access to the row being inserted until its insertion is complete. If you are actually seeing TABLOCK, you probably have some setting enabled on the SQL server like TRANSACTION ISOLATION LEVEL SERIALIZABLE thats useful in some cases, but generally creates more problems than it solves.

    If you want the fastest non-BCP option, I’ve found that its the use of a Table Valued Parameter with all your DataFrame info passed into a stored procedure that uses a MERGE statement to do the INSERT and UPDATE (or just insert) to place that incoming TVP data. Just make sure to use a HOLDLOCK on the target table as a MERGE is not an atomic operation and you will get races/deadlocks without it. Tweet at me if you cant find a good example, but be aware I dont check twitter often.

    Like

    • I mention bulk insert in the article and point out the drawbacks of that approach.
      Secondly, usually in a large company you don’t get to have access to stored procedures or the DB settings those are in the hands of the DBA. If you do have control your solutions are indeed faster.

      Like

Leave a reply to Andrew Stanton (@andrew_stanton) Cancel reply