[TIL] PostgreSQL, NOT IN and a subquery

Yesterday we hit pretty interesting issue with our PostgreSQL when all of the sudden queries started to take up an unacceptable long time. It was interesting, because we did not change anything and there was no peak in incoming data. Yet system started to be really irresponsive and we had to investigate what happened. So, here’s a story.

First of all, let me give you a sneak peek of how database looks like. We have user profiles in one table, containing such data as gender, age, name org phone number. In many other tables there we have information about users activity, such as logging into wifi, coupon usage etc. For the purpose of this example let’s say we have this:

| id | name    | gender |
|----|---------|--------|
| 1  | John    | m      |
| 2  | Oona    | f      |
| 3  | Eve     | f      |


| user_id | coupon_id | used_at          |
|---------|-----------|------------------|
| 1       | 1         | 2018-04-01 10:01 |
| 2       | 1         | 2018-03-03 12:13 |
| 2       | 2         | 2018-04-02 09:00 |
| 3       | 3         | 2017-12-24 15:30 |

Bonus: you have about 15 million profile records and maybe 40 million coupon usage events.

One on things we need to query is: “find me all women that used less than 2 coupons in last month”. What we did was what we thought to be a standard approach:

SELECT * FROM profiles
WHERE gender = 'w'
AND id NOT IN (
  SELECT user_id FROM coupons
  WHERE used_at >= now() - interval '1 month'
  GROUP BY user_id
  HAVING count(user_id) >= 2
)

Of course, our query was “a bit” more complicated, but you get the idea. This worked pretty well. Until it stopped.

After some manual investigation (EXPLAIN ANALYZE gave me nothing of value) I found out that when a subquery returns more than some number between 60k and 61k ids (youe mileage will probably vary) it changes from acceptable 1.5 second to more than 300 second to perform a query. That was not good.

Solution

After doing some reading I realized that PostgreSQL is not really good at having many values in NOT IN clause. However, what it is really good at are JOINs. All I need was to figure out how to use joins in our query. Here’s what I ended up with:

SELECT * FROM profiles
LEFT JOIN (
  SELECT user_id FROM coupons
  WHERE used_at >= now() - interval '1 month'
  GROUP BY user_id
  HAVING count(user_id) >= 2
) sub ON sub.user_id = profiles.id
WHERE gender = 'w'
AND sub.user_id IS NULL

Left joining a subquery turns out to be pretty fast. To have NOT IN we need to check if there was nothing to join (so it will be null). This way we have stable less-than-two-seconds query times for all kind of intervals, be it one month or two years.

Related posts