Skip to content

Galera cluster exits abruptly during reindex immediately after upgrade to Magento 2.3.3 #25334

Closed
@andrzejwp

Description

@andrzejwp

Preconditions (*)

  1. Magento 2.3.3
  2. Galera cluster (3 nodes), reproducible at least in 2 different versions

Steps to reproduce (*)

Identified problem

Because of this commit the reindex that happens immediately after upgrade to Magento 2.3.3 causes Galera clusters to exit abruptly. The problem is somewhere between Magento and Galera, so this report will be filed to both parties.

Problem description

Magento uses a number of temporary table during its reindex procedure:

mysql> SELECT TABLE_NAME,ENGINE FROM information_schema.TABLES WHERE  ENGINE='MEMORY';
+---------------------------------------------+--------+
| TABLE_NAME                                  | ENGINE |
+---------------------------------------------+--------+
| catalog_category_product_index_tmp          | MEMORY |
| catalog_product_index_eav_decimal_tmp       | MEMORY |
| catalog_product_index_eav_tmp               | MEMORY |
| catalog_product_index_price_bundle_opt_tmp  | MEMORY |
| catalog_product_index_price_bundle_sel_tmp  | MEMORY |
| catalog_product_index_price_bundle_tmp      | MEMORY |
| catalog_product_index_price_cfg_opt_agr_tmp | MEMORY |
| catalog_product_index_price_cfg_opt_tmp     | MEMORY |
| catalog_product_index_price_downlod_tmp     | MEMORY |
| catalog_product_index_price_final_tmp       | MEMORY |
| catalog_product_index_price_opt_agr_tmp     | MEMORY |
| catalog_product_index_price_opt_tmp         | MEMORY |
| catalog_product_index_price_tmp             | MEMORY |
| cataloginventory_stock_status_tmp           | MEMORY |
+---------------------------------------------+--------+
14 rows in set (0,00 sec)

prior to Magento 2.3.3 these tables were using the MEMORY engine.
Per Galera's design there are some known limitations, importantly - the replication is supported only for the InnoDB engine. That means, specifically the content of tables based on MEMORY engine are not replicated.

During Magento upgrade the index tables are altered and the engine is changed from MEMORY to InnoDB. At this point the content of these tables becomes out-of-sync between the nodes of the Galera cluster and one of these nodes may have data in the table, while the other nodes will not.

First node:

mysql> select * from catalog_product_index_price_bundle_tmp;
+-----------+-------------------+------------+--------------+------------+---------------+--------------+------------+------------+------------+------------+------------+-----------+
| entity_id | customer_group_id | website_id | tax_class_id | price_type | special_price | tier_percent | orig_price | price      | min_price  | max_price  | tier_price | base_tier |
+-----------+-------------------+------------+--------------+------------+---------------+--------------+------------+------------+------------+------------+------------+-----------+
|       337 |                 0 |          1 |            2 |          1 |          NULL |         NULL |  37.950000 |  37.950000 |  37.950000 |  37.950000 |       NULL |      NULL |
|       337 |                 1 |          1 |            2 |          1 |          NULL |         NULL |  37.950000 |  37.950000 |  37.950000 |  37.950000 |       NULL |      NULL |

// ... REDACTED

|       456 |                 4 |          1 |            2 |          1 |          NULL |         NULL |  20.990000 |  20.990000 |  20.990000 |  20.990000 |       NULL |      NULL |
|       456 |                 6 |          1 |            2 |          1 |          NULL |         NULL |  20.990000 |  20.990000 |  20.990000 |  20.990000 |       NULL |      NULL |
+-----------+-------------------+------------+--------------+------------+---------------+--------------+------------+------------+------------+------------+------------+-----------+
60 rows in set (0,00 sec)

other nodes:

mysql> select * from catalog_product_index_price_bundle_tmp;
Empty set (0,00 sec)

at the same time, the engine has been upgraded to InnoDB which means any further changes to these tables will be replicated over the cluster.

mysql> select TABLE_NAME,ENGINE from information_schema.TABLES where TABLE_SCHEMA='ploom_hdls_test' and TABLE_NAME like 'catalog_%_tmp';
+---------------------------------------------+--------+
| TABLE_NAME                                  | ENGINE |
+---------------------------------------------+--------+
| catalog_category_product_index_tmp          | InnoDB |
| catalog_product_index_eav_decimal_tmp       | InnoDB |
| catalog_product_index_eav_tmp               | InnoDB |
| catalog_product_index_price_bundle_opt_tmp  | InnoDB |
| catalog_product_index_price_bundle_sel_tmp  | InnoDB |
| catalog_product_index_price_bundle_tmp      | InnoDB |
| catalog_product_index_price_cfg_opt_agr_tmp | InnoDB |
| catalog_product_index_price_cfg_opt_tmp     | InnoDB |
| catalog_product_index_price_downlod_tmp     | InnoDB |
| catalog_product_index_price_final_tmp       | InnoDB |
| catalog_product_index_price_opt_agr_tmp     | InnoDB |
| catalog_product_index_price_opt_tmp         | InnoDB |
| catalog_product_index_price_tmp             | InnoDB |
| cataloginventory_stock_status_tmp           | InnoDB |
+---------------------------------------------+--------+
14 rows in set (0,01 sec)

In such a case, part of the standard reindex procedure, which sends a DELETE DML is propagated over the cluster and results in termination of all-but-one of the Galera nodes:

The command issued during reindex:

mysql> delete from catalog_product_index_price_bundle_tmp;
Query OK, 60 rows affected (0,01 sec)

Result on remaining Galera nodes:

2019-10-28 10:04:17 41267 [ERROR] Slave SQL: Could not execute Delete_rows event on table catalog_product_index_price_bundle_tmp; Can't find record in 'catalog_product_index_price_bundle_tmp', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 3577, Error_code: 1032
2019-10-28 10:04:17 41267 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 498863560
2019-10-28 10:04:17 41267 [ERROR] WSREP: Failed to apply trx: source: 9f112bd7-f7c1-11e9-a173-f2aa3d9f6709 version: 4 local: 0 state: APPLYING flags: 1 conn_id: 166495 trx_id: 8030458603 seqnos (l: 613988, g: 498863560, s: 498863559, d: 498863504, ts: 56153475565059237)
2019-10-28 10:04:17 41267 [ERROR] WSREP: Failed to apply trx 498863560 4 times
2019-10-28 10:04:17 41267 [ERROR] WSREP: Node consistency compromised, aborting...
2019-10-28 10:04:17 41267 [Note] WSREP: Closing send monitor...
2019-10-28 10:04:17 41267 [Note] WSREP: Closed send monitor.
2019-10-28 10:04:17 41267 [Note] WSREP: gcomm: terminating thread
2019-10-28 10:04:17 41267 [Note] WSREP: gcomm: joining thread
2019-10-28 10:04:17 41267 [Note] WSREP: gcomm: closing backend
2019-10-28 10:04:18 41267 [Note] WSREP: view(view_id(NON_PRIM,67dbaea5,16) memb {
	cb60a6a6,0
} joined {
} left {
} partitioned {
	67dbaea5,0
	9f112bd7,0
})

Summary

The problem is partly inside Galera (at least Server version: 5.6.45 MySQL Community Server (GPL), wsrep_25.27), because it allows to end-up in a situation like this and partly inside Magento upgrade, because it does not empty the temp tables, which would prevent this from happening as well.

Percona XDB is already preventing such thing from happening:

mysql> INSERT INTO fail_test values (1);
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (fail_test.fail_test) that resides in non-transactional storage engine with pxc_strict_mode = ENFORCING or MASTER
mysql>

tested in version Server version: 5.7.27-30-57 Percona XtraDB Cluster (GPL), Release rel30, Revision 64987d4, WSREP version 31.39, wsrep_31.39

Minimal reproducible case

  1. Create MEMORY based table inside cluster
  2. Insert data into the table
  3. Convert table to InnoDB
  4. Delete data in the table
CREATE TABLE `mem_based_table` (field1 int) Engine='MEMORY';
INSERT INTO `mem_based_table` values (1);
ALTER TABLE `mem_based_table` Engine='InnoDB'
DELETE FROM `mem_based_table`;

This has been successfuly replicated on:

  • MySQL 5.6.45 wsrep_25.27
  • MariaDB 10.1.23-MariaDB-1~jessie

For example with the following environment

node1:
    image: hauptmedia/mariadb:10.1
    hostname: node1
    ports:
      - 13306:3306
    environment:
      - MYSQL_ROOT_PASSWORD=test
      - REPLICATION_PASSWORD=test
      - MYSQL_DATABASE=maria
      - MYSQL_USER=maria
      - MYSQL_PASSWORD=test
      - GALERA=On
      - NODE_NAME=node1
      - CLUSTER_NAME=maria_cluster
      - CLUSTER_ADDRESS=gcomm://
    command: --wsrep-new-cluster

node2:
    image: hauptmedia/mariadb:10.1
    hostname: node2
    links:
      - node1
    ports:
      - 23306:3306
    environment:
      - REPLICATION_PASSWORD=test
      - GALERA=On
      - NODE_NAME=node2
      - CLUSTER_NAME=maria_cluster
      - CLUSTER_ADDRESS=gcomm://node1
      
node3:
    image: hauptmedia/mariadb:10.1
    hostname: node3
    links:
      - node1
    ports:
      - 33306:3306
    environment:
      - REPLICATION_PASSWORD=test
      - GALERA=On
      - NODE_NAME=node3
      - CLUSTER_NAME=maria_cluster
      - CLUSTER_ADDRESS=gcomm://node1

after executing the set of commands mentioned above, we get:

$ docker-compose ps
         Name                       Command                State                             Ports                        
--------------------------------------------------------------------------------------------------------------------------
galera-cluster_node1_1   /entrypoint.sh --wsrep-new ...   Up         0.0.0.0:13306->3306/tcp, 4444/tcp, 4567/tcp, 4568/tcp
galera-cluster_node2_1   /entrypoint.sh mysqld            Exit 139                                                        
galera-cluster_node3_1   /entrypoint.sh mysqld            Exit 139     

Possible workaround

Empty the temporary tables before the upgrade to Magento 2.3.3, while still using the MEMORY engine.

Expected result (*)

  1. Successful reindex operation

Actual result (*)

  1. All but one node of the galera cluster fail.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions