Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select ... not in ... returns wrong results for nulls #5954

Closed
jycor opened this issue May 16, 2023 · 6 comments
Closed

select ... not in ... returns wrong results for nulls #5954

jycor opened this issue May 16, 2023 · 6 comments
Assignees
Labels
analyzer bug Something isn't working correctness We don't return the same result as MySQL sql Issue with SQL

Comments

@jycor
Copy link
Contributor

jycor commented May 16, 2023

Setup:

create table t1 (i int);
create table t2 (j int);
insert into t1 values (1), (null);
insert into t2 values (2), (null);

Dolt:

tmp> select * from t1 where i not in (select j from t2);
+------+
| i    |
+------+
| 1    |
| NULL |
+------+
2 rows in set (0.00 sec)

MySQL:

mysql> select * from t1 where i not in (select j from t2);
Empty set (0.0006 sec)

Plan:

tmp> explain select * from t1 where i not in (select j from t2);
+-----------------------------+
| plan                        |
+-----------------------------+
| AntiJoin                    |
|  ├─ (t1.i = scalarSubq0.j)  |
|  ├─ Table                   |
|  │   └─ name: t1            |
|  └─ TableAlias(scalarSubq0) |
|      └─ Table               |
|          ├─ name: t2        |
|          └─ columns: [j]    |
+-----------------------------+
8 rows in set (0.00 sec)

It seems like we should either avoid using the optimization in this case or fix the join conditions.

@timsehn timsehn added bug Something isn't working sql Issue with SQL analyzer labels May 16, 2023
@nicktobey
Copy link
Contributor

nicktobey commented May 16, 2023

Context for people like me who were confused by this:

In MySQL, A NOT IN B is true if and only if A != b is true for all b in B.

The trouble is that a != NULL is always NULL, not true. so A NOT IN (NULL) is always false.

Reasoning behind this: If you think of NULL as representing a missing or unknown value, then it's impossible to demonstrate that A is not in a set if the set contains NULL.

@nicktobey nicktobey self-assigned this May 16, 2023
@nicktobey
Copy link
Contributor

nicktobey commented May 16, 2023

It seems like we should either avoid using the optimization in this case or fix the join conditions.

To elaborate on this, it seems like the two possible fixes would be either:

  1. Only performing the optimization into an antijoin when the column selected from the right table is NOT NULL. This would provide correct behavior but prevent optimization of nullable columns, even if they don't contain NULL values in practice, or
  2. Change the logic when iterating over the rows of an AntiJoin node.

The trouble with solution 2 is that the other common way to write an anti join in mySql is select i from t1 left join t2 on t1.i = t2.j where t2.j is null;, and in this case dolt's behavior is correct: these two queries are not the same in mySql, because they behave differently in the presence of nulls in the right-hand table. Unless we have some way to distinguish between these two types of queries, we can't optimize both of them into AntiJoins.

Ironically, it looks like this other minimum query actually gets built as a LeftJoin, not an AntiJoin, so either it's being optimized by dolthub/go-mysql-server#1773 back into a LeftJoin, or it was never converted into an AntiJoin in the first place:

> explain select i from t1 left join t2 on t1.i = t2.j where t2.j is null;
+------------------------------+
| plan                         |
+------------------------------+
| Project                      |
|  ├─ columns: [t1.i]          |
|  └─ Filter                   |
|      ├─ t2.j IS NULL         |
|      └─ LeftOuterJoin        |
|          ├─ (t1.i = t2.j)    |
|          ├─ Table            |
|          │   ├─ name: t1     |
|          │   └─ columns: [i] |
|          └─ Table            |
|              ├─ name: t2     |
|              └─ columns: [j] |
+------------------------------+

(Update: I didn't have dolthub/go-mysql-server#1773 pulled locally yet, so it seems like we're not converting this example to an AntiJoin.)

So if we wanted to convert both into AntiJoins we'd need to keep some state that tells the builder how to handle NULLs.

@nicktobey
Copy link
Contributor

Some additional regression tests. It's important to note that we have the incorrect behavior when either side of the joins contain NULL:

Test 2

Setup:

create table t1 (i int);
create table t2 (j int);
insert into t1 values (1), (null);
insert into t2 values (null);

Dolt:

tmp> select * from t1 where i not in (select j from t2);
+------+
| i    |
+------+
| 1    |
| NULL |
+------+
2 rows in set (0.00 sec)

MySQL:

mysql> select * from t1 where i not in (select j from t2);
i
1

Test 3

Setup:

create table t1 (i int);
create table t2 (j int);
insert into t1 values (1);
insert into t2 values (2);

Dolt:

tmp>select * from tbl where null not in (select * from tbl2);
+---+
| i |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

MySQL:

mysql> select * from t1 where i not in (select j from t2);
Empty output

Plan:

explain select * from tbl where null not in (select * from tbl2);
+-----------------------------+
| plan                        |
+-----------------------------+
| AntiJoin                    |
|  ├─ (NULL = scalarSubq0.j)  |
|  ├─ Table                   |
|  │   └─ name: tbl           |
|  └─ TableAlias(scalarSubq0) |
|      └─ Table               |
|          ├─ name: tbl2      |
|          └─ columns: [j]    |
+-----------------------------+

@nicktobey
Copy link
Contributor

It's simple enough to change the transformJoinApply optimization to not apply AntiJoin if the expression on either side of the not in is nullable.

Unfortunately, this breaks a ton of our regression tests around AntiJoin. I'd want to look at performance metrics before deciding whether this is an acceptable workaround.

Tim linked me to our performance benchmarking at https://docs.dolthub.com/sql-reference/benchmarks/latency. I'm going to figure out exactly how much our performance hurts if we're only able to generate AntiJoins for nonnull columns.

@nicktobey
Copy link
Contributor

dolthub/go-mysql-server#1791 fixes this in the vast majority of cases.

There's still one corner case where we don't have MySQL compatibility:

  • If the left side of the NOT IN expression contains a NULL, AND
  • The right side of the expression is not empty, AND
  • The right side of the expression does not contain a NULL, AND
  • The planner converts the original expression into an AntiJoin, AND
  • The planner converts the AntiJoin into a LeftOuterJoin, AND
  • The planner converts the LeftOuterJoin into a LeftHashOuterJoin

Then MySQL evaluates the expression as NULL and we don't.

I decided to let this sit in the back of my mind for a little bit to see if there was an elegant solution, but at a certain point I just had to acknowledge that this is still closer to MySQL than we were before, document how it still differs, and ship it.

@timsehn timsehn added the correctness We don't return the same result as MySQL label Jul 7, 2023
@max-hoffman
Copy link
Contributor

At a glance it is hard to tell if the exception case listed above is still relevant. We have fixed a lot of issues related to this, but all in response to specific queries with regressions. I might close this for now, if we have a specific case for repro we can always make a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer bug Something isn't working correctness We don't return the same result as MySQL sql Issue with SQL
Projects
None yet
Development

No branches or pull requests

4 participants