SQLAlchemy is a very sophisticated toolkit for databases in Python. It has many layers that have different levels of control and ease-of-use. SQLAlchemy is designed to provide superb performance, but it is sometimes too difficult to harness all the power. In this article I’ll explore what’s possible and demonstrate some methods to optimize database access when using SQLAlchemy.
Setup
Let’s set up the stage for a few experiments. First you need to install SQLAlchemy. This is as easy as: ‘pip install sqlalchemy’.
I’ll use a SQLite database with a very simple table called ‘person’. A person has an id and a name. That’s it. Here we go:
import timeimport sqlite3from sqlalchemy.ext.declarative import declarative_basefrom sqlalchemy import Column, Integer, String, create_enginefrom sqlalchemy.orm import scoped_session, sessionmakerBase = declarative_base()metadata = Base.metadatasession = scoped_session(sessionmaker())engine = Noneclass Person(Base): __tablename__ = 'person' id = Column(Integer, primary_key=True) name = Column(String(512))
Now let’s define a function to initialize the database. It creates an engine and assigns it to the global variable ‘engine’. Then, configures a Session class and resets the database.
def init_db(db_name='sqlite:///test.db'): global engine engine = create_engine(db_name, echo=False) session.remove() session.configure(bind=engine, autoflush=False, expire_on_commit=False) metadata.drop_all(engine) metadata.create_all(engine)
Armed with this init_db() function, let’s measure how long it takes to insert 100,000 records to the Person table using various methods.
SQLAlchemy ORM
The SQLAlchemy ORM is the most high-level interface. It follows an ActiveRecord paradigm and creates a Person object for each row.
init_db()start = time.time()for i in xrange(100000): p = Person(name='name-{}'.format(i)) session.add(p) if i % 1000 == 0: session.flush()session.commit()duration = time.time() - startprint('SQLAlchemy ORM - total time: {:.2f} seconds'.format(duration))
SQLAlchemy ORM with pre-computed Primary Key
Here I still use the ORM, but instead of letting the ORM compute the primary key I provide it myself.
init_db()start = time.time()for i in xrange(100000): p = Person(id=i + 1, name='name-{}'.format(i)) session.add(p) if i % 1000 == 0: session.flush()session.commit()duration = time.time() - startprint('SQLAlchemy ORM with PK - total time: {:.2f} seconds'.format(duration))
SQLAlchemy ORM with Bulk Insert
The ORM provides a special method for doing bulk inserts. With bulk insert you don’t create a Person object for each row, but instead provide a list of dictionaries that contain each row’s columns.
init_db()start = time.time()count = 100000while count > 0: count -= 10000 session.bulk_insert_mappings( Person, [ dict(name='name-{}'.format(i)) for i in xrange(min(10000, count)) ])session.commit()duration = time.time() - startprint('SA ORM Bulk insert - total time: {:.2f} seconds'.format(duration))
SQLAlchemy Core
SQLAlchemy Core is the low-level interface to SQLAlchemy’s goodies. You operate directly on the engine and the tables and execute commands. There are no sessions. Note, that you’re still a little higher-level then straight SQL, which means your code will be cross-platform and not DB engine specific.
Let’s see how long it takes to populate the person table with 100,000 people.
init_db()start = time.time()engine.execute(Person.__table__.insert(), [dict(name='name-{}'.format(i)) for i in xrange(100000)])duration = time.time() - startprint('SQLAlchemy Core - total time: {:.2f} seconds'.format(duration))
Straight SQLite3
Finally, to get an overall sense how much overhead SQLAlchemy imposes for ll its abstraction here the same operation of inserting 100,000 rows to the ‘person’ table done directly using the sqlite3 package (part of Python’s standard library).
conn = sqlite3.connect('test.db')c = conn.cursor()c.execute('DROP TABLE IF EXISTS person')c.execute( 'CREATE TABLE person (id INTEGER NOT NULL, ' ' name VARCHAR(255),' ' PRIMARY KEY (id))')conn.commit()start = time.time()for i in xrange(100000): row = ('name-{}'.format(i),) c.execute('INSERT INTO person (name) VALUES (?)', row)conn.commit()duration = time.time() - startprint('Direct sqlite3 - total time: {:.2f} seconds'.format(duration))
Execution Time Comparison
Here are the runtimes of the different methods:
SA ORM - total time: 12.31 secondsSA ORM with PK - total time: 7.57 secondsSA ORM Bulk insert - total time: 0.78 secondsSA Core - total time: 0.09 secondsDirect sqlite3 - total time: 0.46 seconds
I ran the measurements multiple times and the measurements remain very stable. This is very surprising. The ORM method is quite slow (around 12 seconds). Surprisingly, providing the primary key shaves 40%! The bulk insert method is clearly well optimized and takes just 0.78 seconds. But, using the core directly takes just 0.09 seconds. That’s 136X better than the ORM method!
Another small surprise is that using the core is faster than using SQLite3 directly. This could be due to the fact that in the SQLite3 method I used a loop vs. a list comprehension.
Conclusion
The bottom line is that SQLAlchemy is a mature and well-designed framework for working with databases. It provides a lot of abstractions, but is also very aware of performance and makes sure there is a way to squeeze the best performance out of your database. Profile your code and if you discover slowness related to SQLAlchemy start reading. You’re probably doing something wrong and SQLAlchemy has a solution waiting for you.