Better SQLAlchemy

2021-02-16

Python Databases Development

Introduction

We heavily use SQLAlchemy behind the scenes at Tiqets . As one of the most powerful and well-featured ORMs out there, it works well for small and big projects alike. However, as projects start to grow, the abstractions provided by the helpful ORM affect performance. The Tiqets back-end is persisted in Postgres across hundreds of tables. Among all these tables and their relationships, it’s easy to end up writing inefficient ORM queries, or reinventing the wheel where the ORM could have been used in a cleaner way. In this article, I want to walk through some basic to advanced features of SQLAlchemy that I have come across and everyone might find useful, keeping these things in mind:

Use efficient code patterns to load data into Python, and checks to detect inefficient queries
Improve readability of code, by using less-common SQLAlchemy patterns

A note before starting off — Tiqets uses Flask-SQLAlchemy, so some code samples are specific to that extension, but the underlying concepts still apply to SQLAlchemy.

Understand .get, .first and .scalar

Always fetch single records referencing a primary key using .get. While both emit the same SQL queries, the advantage of .get is that if the object is already in the session .get will not make a new query to fetch it.

# Bad
Product.query.filter(Product.id == 1234).first()

# Good
Product.query.get(1234)

All three first, scalar, and one return one row. The difference is:

.first applies a LIMIT 1 to the query, and returns the single row.
.scalar fetches all rows matching the query, and throws an exception if the query finds more than one row.
.one works like scalar, except that it throws an exception if the query finds zero rows.

Here’s a handy cheat-sheet:

Use .get for queries involving the primary key.
Use .first only when you don’t actually care about other rows.
Use .one when you expect there to be exactly one row.
Use .scalar where you expect there to be at most one row.

Always fetch single records referencing a primary key using .get.

Specify an explicit loading strategy

One of the first things that trips newcomers to SQLAlchemy (or ORMs in general) is that relationships are loaded lazily by default. SQLAlchemy calls it the n+1 problem .

As an example, in Tiqets we have a following simplified relationship between Product and ProductImage:

class Product(db.Model):
	id = db.Column(db.Integer, primary_key=True)
	
class ProductImage(db.Model):
	id = db.Column(db.Integer, primary_key=True)
	product_id = db.Column(db.Integer, db.ForeignKey(Product.id), nullable=False)

To render a list of 50 products with their images, we’ll use a query similar to this:

[(p, p.images) for p in Product.query.limit(50)]

The list of images isn’t loaded by default, but loaded lazily the first time we access the attribute. In this case this happens by accessing p.images. This happens 50 times in the following list comprehension, necessitating a query to load the images for each product.

SQLAlchemy provides a few options for a loading strategy to use for relationships. For this particular query for 50 products with images, I ended up with these numbers:

Loading strategy	Number of queries	Time
default (lazyload)	51	71.6 ms ± 7.16 ms
subqueryload	2	15.1 ms ± 1.32 ms
joinedload	1	37 ms ± 1.98 ms
selectinload	2	12.7 ms ± 1.54 ms

Some takeaways from these numbers:

For 50 products, lazyload (one query for each list) produced 55 queries.
joinedload (one humongous query to load everything in one go using left outer joins) uses fewer queries, but takes longer due to the extra data being fetched and deduplicated by SQLAlchemy.
subqueryload is almost as fast as selectinload. Both loading strategies have a similar idea in the second query they use to retrieve the relationship objects: selectinload uses the primary keys from the first query to load relationships using an IN clause, subqueryload instead reuses the first query as a subquery to filter matching results.

These results were calculated in IPython using:

from sqlalchemy.orm import joinedload, subqueryload, selectinload

for fn in (subqueryload, joinedload, selectinload):
    %timeit -n 1 -r 100 [p.images for p in Product.query.limit(50).options(fn(Product.images))]

My personal preference is to explicitly specify a loading strategy wherever possible. selectinload performs well for many use cases, but there are many other scenarios: see the section What kind of Loading to use? in the SQLAlchemy documentation.

Loading strategies for debugging

Along with the four strategies, it’s helpful to use the noload and raiseload strategies. raiseload is particularly helpful if you don't want to load results from a specific expensive relationship and want to make sure some part of code in templates isn't referencing a lazily loaded property.

Fetch only the needed columns

Loading a table or relationship with all columns is the default behaviour in SQLAlchemy. But what if we only need a subset of the columns? This is particularly useful for fetching tables with a lot of columns when we only need a couple of them. Use with_entities to specify columns/models to fetch to pick and choose exactly what's needed from the database. For example:

# Fetching only Product and the name of the city the product is located in

Product
    .query
    .outerjoin(Product.city)
    .with_entities(Product, City.name)

Using contains_eager for manually loaded columns

Occasionally, there are one-to-many relationships in a table and the relationship is queried together with the parent model. This is where contains_eager comes in handy. While querying a relationship with manually specified columns, we can tell SQLAlchemy to use the results directly to populate the relationship.

Product.query.outerjoin(
    ProductImage
).options(
    contains_eager(Product.images)
)

This use case becomes interesting when paired with, for example, loading relationships excluding soft deletes. In that case, our query can look like this:

Product.query.outerjoin(
    ProductImage, ProductImage.is_deleted != True
	).options(
    contains_eager(Product.images)
)

In this case, Product.images represents a subset of the items in the database. Keep in mind, because this might be a subset of the real collection in the database, it cannot be used to modify the collection.

Get the parameterized SQL query for debugging

SQLAlchemy queries can be converted to raw SQL queries easily because they implement the __str__ method:

[Input]:
print(Product.query.filter(Product.id == 975585))


[Output]: 
SELECT tiqets.tiqets_products.productstatusid AS tiqets_tiqets_products_productstatusid, ...
FROM tiqets.tiqets_products
WHERE tiqets.tiqets_products.id = %(id_1)s

Notice that this is not really the raw SQL query, it's parametrized and can't be run directly against the database. But that’s possible by integrating with Postgres directly, which is up next.

Get the real SQL query

To log the real queries that Postgres is running:

ALTER DATABASE database_name SET log_statement = 'all';

Once any existing database connections need to be reinitialized by restarting the app, all queries will be logged in Postgres logs (within a Docker container, Postgres logs are sent to stdout).

Set up and enable Flask-DebugToolbar

If you are using Flask, the Flask-DebugToolbar is the perfect companion for development. The extension adds very useful on every HTML page including the number of database queries (and the queries), template information, and can also be used to profile Flask routes and Jinja templates.

Cache inefficient queries

When we have optimized our code and queries as much as possible, there are occasions when caching can help speed things up drastically. SQLAlchemy objects can be pickled and stored in Redis. At Tiqets, we created a dogpile cache to store cached queries in Redis. However, caching comes with its own caveats:

Cache invalidation is very difficult across a distributed system, especially in microservices. But let’s say we ignore that, and are willing to use cached results with a small delay until the cache expires
When caching a function that accepts parameters, we’ll be caching each combination of function name plus parameters separately. These cached results (if not invalidated), will be out of sync until the cache expires for every combination. This creates an unpredictable situation where the same function will return sometimes up-to-date information, and other times stale information, depending on the values of its parameters and how it has been executed before.

Using the SQLAlchemy session as a "cache"

Here’s the scenario: You load some products into your SQLAlchemy session. Later, in another part of the code within the same session, you only have the id for the objects fetched earlier. In that case, a Product.query.get(id) will retrieve the object from the session without an extra SQL query. This use case works well with Flask-SQLAlchemy which has a scoped session for each request.

Hybrid attributes

Hybrid properties behave differently at the class and instance level. A simple use case to demonstrate the repeating the same conditional logic twice (once for querying, and second for getting the value in Python) is like this:

class User(db.Model):
    first_name = db.Column(db.String)
    second_name = db.Column(db.String)
 
    @hybrid_property
    def full_name(self):
        return self.first_name + " " + self.second_name

In this example User.first_name (class level, notice the upper case U) can be used in SQLAlchemy expressions directly, and user.first_name (instance level, notice the lower case u) can be used on instance level. The SQLAlchemy documentation goes into a lot of detail explaining how hybrid properties can be used.

Ordering List

Ordering lists are a convenient abstraction to treat a list of ordered relationship objects as a Python list. Reshuffling the list in Python will keep the order/position attribute on the relationship instances in sync with the position in the list. It's useful for any kind of, user defined ordered listing of objects, e.g: a list of to-do items.