Emulating intersection in MySQL with self-join

Emulating intersection in MySQL with self-join

As you may know, MySQL does not support INTERSECT operation. Hence, to get the intersection between two result sets in the same table, we need to use self-join. In this tutorial, I cover the emulating intersection in MySQL with self-join.

But what the heck is self-join?

Essentially, self-join is a simple join where both sides of the JOIN is the same table, here.

To understand it better, let’s go through a simple example. Assume we have a table called item_rating which holds ratings of users to items. Like below,

+---------+---------+------+-----+---------+----------------+
| Field   | Type    | Null | Key | Default | Extra          |
+---------+---------+------+-----+---------+----------------+
| id      | int(11) | NO   | PRI | NULL    | auto_increment |
| user_id | int(11) | YES  | MUL | NULL    |                |
| item_id | int(11) | YES  |     | NULL    |                |
| rating  | int(11) | YES  |     | NULL    |                |
+---------+---------+------+-----+---------+----------------+
CREATE TABLE item_rating (id INTEGER AUTO_INCREMENT PRIMARY KEY, user_id INTEGER, item_id INTEGER, rating INTEGER,
CONSTRAINT uc_user_id_item_id UNIQUE(user_id, item_id));

Description of columns:

  • user_id: Id of a user in the user table.
  • item_id: Id of an item in the item table.
  • rating: Rating that a user given to an item. 1 means thumbs up. 0 means thumbs down, and -1 means the user has not given any rating.

Let’s fill the table with some dummy data as follows,

INSERT INTO item_rating (user_id, item_id, rating) VALUES(1, 1, 1);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(1, 2, 1);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(1, 3, -1);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(1, 4, 0);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(1, 5, 1);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(2, 2, 0);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(2, 3, 1);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(2, 4, -1);
INSERT INTO item_rating (user_id, item_id, rating) VALUES(2, 5, 1);

And if we do a select query we get this results,

mysql> SELECT * FROM item_rating;
+----+---------+---------+--------+
| id | user_id | item_id | rating |
+----+---------+---------+--------+
|  1 |       1 |       1 |      1 |
|  2 |       1 |       2 |      1 |
|  4 |       1 |       3 |     -1 |
|  5 |       1 |       4 |      0 |
|  6 |       1 |       5 |      1 |
|  7 |       2 |       2 |      0 |
|  8 |       2 |       3 |      1 |
|  9 |       2 |       4 |     -1 |
| 10 |       2 |       5 |      1 |
+----+---------+---------+--------+
9 rows in set (0.00 sec)

Assume we want to find a set of common items that rated by two users which essentially is finding the intersection of ratings on items between two users. In other words, we want to find a list of items that are rated by both user user_id=1, user user_id=2.

To achieve that in Microsoft SQL Server, one can use INTERSECT. But MySQL does not support it. Hence, instead we should rely on JOIN and construct a query like this:

mysql> SELECT A.user_id AS A_id, B.user_id AS B_id, A.rating AS A_rating, B.rating AS B_rating, A.item_id 
FROM item_rating A, item_rating B WHERE A.user_id = 1 AND B.user_id = 2 AND A.item_id = B.item_id AND A.rating <> -1  
AND B.rating <> -1 ORDER BY A.item_id;
+------+------+----------+----------+---------+
| A_id | B_id | A_rating | B_rating | item_id |
+------+------+----------+----------+---------+
|    1 |    2 |        1 |        0 |       2 |
|    1 |    2 |        1 |        1 |       5 |
+------+------+----------+----------+---------+
2 rows in set (0.00 sec)

As you can see we join table item_rating to itself but under two different WHERE condition. And as a result, we can get the intersection between user 1 and user 2.

Important note: even though MS SQL Server supports intersection, try to avoid it at all cost because of its inefficiency. Why? The intersection operates in such a way that at first two queries execute separately and results will be pulled from DB and then the intersection between two result sets will be calculated by the database engine. This is not much of concern for small tables, but when having gigantic result set, it will be super slow.

More database tutorials: https://geekyhacker.com/database/

Inline/featured images credits