SlideShare a Scribd company logo
Synchronous Log Shipping
             Replication

      Takahiro Itagaki and Masao Fujii
     NTT Open Source Software Center

          PGCon 2008
Agenda

• Introduction: What is this?
        – Background
        – Compare with other replication solutions
• Details: How it works
        – Struggles in development
• Demo
• Future work: Where are we going?
• Conclusion



Copyright © 2008 NTT, Inc. All Rights Reserved.      2
What is this?
What is this?
• Successor of warm-standby servers
        – Replication system using WAL shipping.
                  • using Point-in-Time Recovery mechanism
        – However, no data loss after failover because of
          synchronous log-shipping.

                                                   WAL
                                   Active Server         Standby Server


• Based on PostgreSQL 8.2 with a patch and
  including several scripts
        – Patch: Add two processes into postgres
        – Scripts: For management commands

Copyright © 2008 NTT, Inc. All Rights Reserved.                           4
Warm-Standby Servers (v8.2~)
      Active Server (ACT)                                      Standby Server (SBY)

Commit              1
                                2 Flush WAL to disk
(Return)            3
                                                    Failover

            Crash!
                                   WAL seg                        Sent after commits

                                           4 archive_command
                                                                    WAL seg
                                                                                Redo



 The last segment is not                                        We need to wait for remounting
 available in the standby server                                active’s storage on the standby
 if the active crashes before                                   server, or we wait the active’s
 archiving it.                                                  reboot.
  Copyright © 2008 NTT, Inc. All Rights Reserved.                                                 5
Synchronous Log Shipping Servers
      Active Server (ACT)                                      Standby Server (SBY)

                                    WAL entries are sent
                                    before returning from
Commit              1               commits by records.
                                                                                 Segments are formed
                                2 Flush WAL to disk                              from records in the
                                                                  WAL records    standby server.
                                3 Send WAL records

(Return)             4
                                                                     WAL seg
                                                    Failover

            Crash!
                                                                               Redo

                                                                We can start the standby server
                                                                after redoing remaining segments;
                                                                We’ve received all transaction logs
                                                                already in it.
  Copyright © 2008 NTT, Inc. All Rights Reserved.                                                  6
Background: Why new solution?
• We have many migration projects from Oracles
  and compete with them with postgres.
        – So, we hope postgres to be SUPERIOR TO ORACLE!
• Our activity in PostgreSQL 8.3
        – Performance stability
                  • Smoothed checkpoint
        – Usability; Ease to tune server parameters
                  • Multiple autovacuum workers
                  • JIT bgwriter – automatic tuning of bgwriter



• Where are alternatives of RAC?
        – Oracle Real Application Clusters

Copyright © 2008 NTT, Inc. All Rights Reserved.                   7
Background: Alternatives of RAC
• Oracle RAC is a multi-purpose solution
        – … but we don’t need all of the properties.
• In our use:
        –     No downtime                         <-   Very Important
        –     No data loss                        <-   Very Important
        –     Automatic failover                  <-   Important
        –     Performance in updates              <-   Important
        –     Inexpensive hardware                <-   Important
        –     Performance scalability             <-   Not important
• Goal
        – Minimizing system downtime
        – Minimizing performance impact in updated-workloads
Copyright © 2008 NTT, Inc. All Rights Reserved.                         8
Compare with other replication solutions

                                  No data   No SQL                    Performance    Update      How to
                                                          Failover     scalability performance
                                    loss  restriction                                             copy?
Log Shipping                           OK           OK   Auto, Fast      No          Good
                                                                                                  Log
warm-standby                           NG           OK    Manual         No         Async
     Slony-I                           NG           OK    Manual        Good        Async        Trigger
                                                           Auto,
  pgpool-II                            OK           NG    Hard to       Good       Medium         SQL
                                                         re-attach
Shared Disks                           OK           OK   Auto, Slow      No          Good         Disk

• Log Shipping is excellent except performance scalability.
• Also, Re-attaching a repaired server is simple.
       – Just same as normal hot-backup procedure
                 • Copy active server’s data into standby and just wait for WAL replay.
       – No service stop during re-attaching

  Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     9
Compare downtime with shared disks
 • Cold standby with shared disks is an alternative solution
    – but it takes long time to failover in heavily-updated load.
    – Log-shipping saves time for mounting disks and recovery.
Shared disk system


Crash!
                                                               20 sec                  60 ~ 180 sec (*)
                                                             to umount                    to recover
Log-shipping system                                         and remount                 from the last
                                                            shared disks                 checkpoint

Crash!                                           Ok, the service is restarted!
                                     5 sec
   10 sec                         to recover                           (*) Measured in PostgreSQL 8.2.
  to detect                         the last                           8.3 would take less time because
server down                       segement                             of less i/o during recovery.

   Copyright © 2008 NTT, Inc. All Rights Reserved.                                                        10
Advantages and Disadvantages
• Advantages
        – Synchronous
                  • No data loss on failover
        – Log-based (Physically same structure)
                  • No functional restrictions in SQL
                  • Simple, Robust, and Easy to setup
        – Shared-nothing
                  • No Single Point of Failure
                  • No need for expensive shared disks
        – Automatic Fast Failover (within 15 seconds)
                  • “Automatic” is essential not to wait human operations
        – Less impact against update performance (less than 7%)
• Disadvantages
        – No performance scalability (for now)
        – Physical replication. Cannot use for upgrading purposes.

Copyright © 2008 NTT, Inc. All Rights Reserved.                             11
Where is it used?
• Interactive teleconference management package
        – Commercial service in active
        – Manage conference booking and file transfer
        – Log-shipping is an optional module for users requiring
          high availability




                                                                 Internet networks
                                                  Communicator




Copyright © 2008 NTT, Inc. All Rights Reserved.                                  12
How it works
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                          Standby
                                                  Heartbeat             Heartbeat
                      VIP                            RA                    RA

                                                  PostgreSQL           PostgreSQL
                      DB                                                                 DB

                     WAL                           postgres              startup        WAL

                 wal
                buffers                                        WAL
                                                  WALSender            WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     14
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1                                  In our replicator, there are two
        –     Open source high-availability software manages the resources via resource agent(RA)
        –
                                             nodes, active and standby
              Heartbeat provides a virtual IP(VIP)
     Active                                                             Standby
                                                  Heartbeat                 Heartbeat
                      VIP                            RA                        RA

                                                  PostgreSQL               PostgreSQL
                      DB                                                                     DB

                     WAL                           postgres                  startup         WAL

                 wal
                buffers                                           WAL
                                                  WALSender                WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     15
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                           Standby
                                                  Heartbeat               Heartbeat
                      VIP                            RA                      RA

                                                               In the active node, postgres is
                                                  PostgreSQL              PostgreSQL
                      DB                                       running in normal mode with newDB
                                                               child process WALSender
                     WAL                           postgres                startup         WAL

                 wal
                buffers                                         WAL
                                                  WALSender              WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     16
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                         Standby
                                                  Heartbeat            Heartbeat
                      VIP                            RA                   RA

In the standby node, postgres is running
                       PostgreSQL                                     PostgreSQL
in continuous recovery mode with new
           DB                                                                            DB
daemon WALReceiver
                     WAL                           postgres             startup         WAL

                 wal
                buffers                                       WAL
                                                  WALSender           WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     17
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                          Standby
                                                  Heartbeat             Heartbeat
                      VIP                            RA                    RA

                                                  PostgreSQL           PostgreSQL
                      DB                                                                 DB
    In order to manage these resources,
             WAL           postgres                                      startup        WAL
    there is heartbeat in both nodes
                 wal
                buffers                                        WAL
                                                  WALSender            WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     18
System overview
•    Based on PostgreSQL 8.2, 8.3(under porting)
•    WALSender
        –     New child process of postmaster
        –     Reads WAL from walbuffers and sends WAL to WALReceiver
•    WALReceiver
        –     New daemon to receive WAL
        –     Writes WAL to disk and communicates with startup process
•    Using Heartbeat 2.1
        –     Open source high-availability software manages the resources via resource agent(RA)
        –     Heartbeat provides a virtual IP(VIP)
     Active                                                          Standby
                                                  Heartbeat             Heartbeat
                      VIP                            RA                    RA

                                                  PostgreSQL           PostgreSQL
                      DB                                                                 DB

                     WAL                           postgres              startup        WAL

                 wal
                buffers                                        WAL
                                                  WALSender            WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     19
WALSender
        Active
                                 postgres                    walbuffers         WALSender
               Update
                                                  Insert


               Commit
                                                  Flush
                                                                          WAL
                                                  Request

                                                                            Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)




Copyright © 2008 NTT, Inc. All Rights Reserved.                                                    20
WALSender
        Active
                                 postgres                    walbuffers      WALSender
               Update                                                 XLogInsert()
                                                  Insert


               Commit
                                                  Flush       Update command triggers XLogInsert()
                                                              and inserts WAL into walbuffers
                                                                          WAL
                                                  Request

                                                                           Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)




Copyright © 2008 NTT, Inc. All Rights Reserved.                                                      21
WALSender
        Active
                                 postgres                    walbuffers          WALSender
               Update                                                 XLogInsert()
                                                  Insert

                                                                      XLogWrite()
               Commit
                                                  Flush
                                                                           WAL
                                                  Request

                                                                              Read
                                                                                      Send / Recv
                                                  (Return)
                                                                          Commit command triggers
                                                                          XLogWrite() and flushs WAL to disk
               (Return)




Copyright © 2008 NTT, Inc. All Rights Reserved.                                                                22
WALSender
        Active
                                 postgres                    walbuffers         WALSender
               Update                                                 XLogInsert()
                                                  Insert

                                                                      XLogWrite()
               Commit
                                                  Flush
                                                                          WAL
             Changed                              Request

                                                                            Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)
                                                    We changed XLogWrite() to request
                                                    WALSender to transfer WAL

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                    23
WALSender
        Active
                                 postgres                    walbuffers      WALSender
               Update                                                 XLogInsert()
                                                  Insert

                                                                      XLogWrite()
               Commit                                                  WALSender reads WAL from
                                                  Flush
                                                                       walbuffers and transfer them
                                                                        WAL
             Changed                              Request

                                                                           Read
                                                                                     Send / Recv
                                                  (Return)

               (Return)
                                                   After transfer finishes, commit
                                                   command returns

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                       24
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush

                                                  Inform

                                                                      Read


                                                                              Replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                        25
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush

                                                  Inform
    WALReceiver receives WAL from
                                                                      Read
    WALSender and flushes them to disk

                                                                              Replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                        26
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush

                                                  Inform

                                                                      Read
    WALReceiver informs startup
    process of the latest LSN.
                                                                              Replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                        27
WALReceiver
       Standby
                           WALReceiver                     WAL Disk      startup
           Recv / Send
                                                  Flush                        ReadRecord()
                                                  Inform

                                                                      Read         Changed


Startup process reads WAL up to the latest LSN                                Replay
and replays.
We changed ReadRecord() so that startup
process could communicate with WALReceiver
and replay by each WAL record.




Copyright © 2008 NTT, Inc. All Rights Reserved.                                               28
Why replay by each WAL record?
• Minimize downtime
• Shorter delay in read-only queries (at the standby)

                                                          Our replicator   Warm-Standby
  Replay by each                                          WAL record       WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                              shorter          longer


       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     29
Why replay by each WAL record?
• Minimize downtime
• ShorterIn our replicator, becausequeries by each WAL record,
          delay in read-only of replay (at the standby)
                           the standby only has to replay a few records at failover

                                                          Our replicator   Warm-Standby
  Replay by each                                          WAL record       WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                              shorter          longer


       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     30
Why replay by each WAL record?
• Minimize downtime
               On the other hand, in warm-standby, because of
• Shorter delayreplay by each WAL segment, thethe standby)
                in read-only queries (at standby has to
                                            replay the latest one segment

                                                           Our replicator   Warm-Standby
  Replay by each                                           WAL record       WAL segment
  Needed to be replayed at failover a few records                           the latest one segment
  Delay in read-only queries                               shorter          longer


       Our replicator
       Warm-standby
                                         segment1               segment2


                                                   WAL block
                                                   WAL which can be replayed now
                                                   WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                      31
Why replay by each WAL record?
• Minimize downtime
• Shorter delay in read-only queries (at the standby)

                                                          Our replicator   Warm-Standby
  Replay by each                                          WAL record       WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                        In thisshorter
                                                            example, warm-standby needed to
                                                                            longer
                                                    replay most 'segment2' at failover.

       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     32
Why replay by each WAL record?
 • Minimize downtime
And,Shorter delay in read-only queries (at the standby)
 • in our replicator, because of
replay by each WAL record, delay
in read-only queries is shorter
                                                           Our replicator   Warm-Standby
   Replay by each                                          WAL record       WAL segment
   Needed to be replayed at failover a few records                          the latest one segment
   Delay in read-only queries                              shorter          longer


        Our replicator
        Warm-standby
                                          segment1             segment2


                                                   WAL block
                                                   WAL which can be replayed now
                                                   WAL needed to be replayed at failover

 Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     33
Why replay by each WAL record?
• Minimize downtime
• Shorter delay in read-only queries (at the standby)

                                   Our replicator
               Therefore, we implemeted replay                             Warm-Standby
  Replay by each each WAL record WAL record
               by                                                          WAL segment
  Needed to be replayed at failover a few records                          the latest one segment
  Delay in read-only queries                              shorter          longer


       Our replicator
       Warm-standby
                                         segment1             segment2


                                                  WAL block
                                                  WAL which can be replayed now
                                                  WAL needed to be replayed at failover

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                     34
Heartbeat and resource agent
• Heartbeat needs resource agent (RA) to manage
  PostgreSQL(with WALSender) and WALReceiver as a
  resource
• RA is an executable providing the following feature
       Feature Description
       start                  start the resources as standby
       promote change the status from standby to active
       demote                 change the status from active to standby
       stop                   stop the resources
       monitor                check if the resource is running normally

                                Invoke            RA                Resources
         Heartbeat
                                                  start   promote    PostgreSQL
                                                  stop    demote     (WALSender)
                                                       monitor       WALReceiver



Copyright © 2008 NTT, Inc. All Rights Reserved.                                    35
Failover
• Failover occurs when heartbeat detects that the active
  node is not running normally
• After failover, clients can restart transactions only by
  reconnecting to virtual IP provided by Heartbeat
                                                  Standby
                                                      Heartbeat       startup

                                                   Detect
                    Act                                                     ReadRecord()

                                                            Request
                                                                                Changed
     At failover, heartbeat requests startup
     process to finish WAL replay.
                                                                                Replay
     We changed ReadRecord() to deal with
     this request.



Copyright © 2008 NTT, Inc. All Rights Reserved.                                            36
Failover
• Failover occurs when heartbeat detects that the active
  node is not running normally
• After failover, clients can restart transactions only by
  reconnecting to virtual IP provided by Heartbeat
                                                  Standby
                                                      Heartbeat       startup

                                                   Detect
                    Act                                                     ReadRecord()

                                                            Request
                                                                                Changed


                            After finishing WAL replay,                         Replay
                            the standby becomes active




Copyright © 2008 NTT, Inc. All Rights Reserved.                                            37
Struggles in development
Downtime caused by the standby down
• The active down triggers a failover and causes downtime
• Additionally, the standby down might also cause downtime
        – WALSender waits for the response from the standby after sending WAL
        – So, when the standby down occurs, unless WALSender detects the
          failure, WALSender is blocked
        – i.e. WALSender keeps waiting for the response which never comes

• How to detect
        – Timeout notification is needed to detect
        – Keepalive, but it doesn't work occasionally on Linux (Linux bug!?)
        – Original timeout
                                                     Active

                                                        postgres      WALSender

                                                  Commit
                                                               Request
                                                                             Send WAL             Down!!
                                                                                        Standby

                                                                             Wait
                                                               (Return)
                                                  (Return)                   Blocked

Copyright © 2008 NTT, Inc. All Rights Reserved.                                                            39
Downtime caused by clients
• Even if the database finishes a failover immediately,
  downtime might still be long by clients reason
        – Clients wait for the response from the database
        – So, when a failover occurs, unless clients detect a failover, they
          can't reconnect to the new active and restart the transaction
        – i.e. clients keeps waiting for the response which never comes

• How to detect
        – Timeout notification is needed to detect
        – Keepalive
                  • Our setKeepAlive patch was accepted in JDBC 8.4dev
        – Socket timeout
        – Query timeout                                                                  Client
                                                  We want to implement
                                                  these timeouts!!       Down!! Active        Active


Copyright © 2008 NTT, Inc. All Rights Reserved.                                                   40
Split-brain
• High-availability clusters must be able to handle split-
  brain
• Split-brain causes data inconsistency
        – Both nodes are active and provide the virtual IP
        – So, clients might update inconsistently each node

• Our replicator also causes split-brain unless the standby
  can distinguish network failure from the active down
                       Failure!!
    Active                                    Active     Active          Standby
                                                          Down!!


 If network failure are mis-detected as                If the active down are mis-detected as
 the active down, the standby becomes                  network failure, a failover doesn't start
 active even if the other active is still              even if the other active is down.
 running normally.                                     This scenario is also problem though
 This is split-brain scenario.                         split-brain doesn't occur.
Copyright © 2008 NTT, Inc. All Rights Reserved.                                                    41
Split-brain
• How to distinguish
        –       Combining the following solution

1. Redundant network between two nodes
        –       The standby can distinguish unless all networks fail

2. STONITH(Shoot The Other Node In The Head)
        –       Heartbeat's default solution for avoiding split-brain
        –       STONITH always forcibly turns off the active when activating the
                standby
        –       Split-brain doesn't occur because the active node is always only
                one
                                                        Active        Active

                                                  Turn off!!
                                                                 STONITH



Copyright © 2008 NTT, Inc. All Rights Reserved.                                42
What delays the activation of the standby
• In order to activate the standby immediately, recovery
  time at failover must be short!!

• In 8.2, recovery is very slow
        – A lot of WAL needed to be replayed at failover might be
          accumulated
        – Another problem: disk full failure might happen

• In 8.3, reocvery is fast☺
        – Because of avoiding unnecessary reads
        – But, there are still two problems




Copyright © 2008 NTT, Inc. All Rights Reserved.                     43
What delays the activation of the standby
1. Checkpoint during recovery
        –         It took 1min or more (in the worst case) and occupied 21% of
                  recovery time
        –         What is worse is that WAL replay is blocked during checkpoint
                  •       Because only startup process performs both checkpoint and WAL
                          replay
        -> Checkpoint delays recovery...

•        [Just idea] bgwriter during recovery
        –         Leaving checkpoint to bgwriter, and making startup process
                  concentrate on WAL replay




Copyright © 2008 NTT, Inc. All Rights Reserved.                                           44
What delays the activation of the standby
2. Checkpoint at the end of recovery
        – Activation of the standby is blocked during checkpoint
        -> Downtime might take 1min or more...

•        [Just idea] Skip of the checkpoint at the end of recovery
        –         But, postgres works fine if it fails before at least one checkpoint
                  after recovery?
        –         We have to reconsider why checkpoint is needed at the end of
                  recovery

!!! Of course, because recovery is a critical part for DBMS,
     more careful investigation is needed to realize these
     ideas



Copyright © 2008 NTT, Inc. All Rights Reserved.                                    45
How we choose the node with the later LSN

• When starting both two nodes, we should synchronize
  from the node with the later LSN to the other
        – But, it's unreliable to depend on server logs (e.g. heartbeat log)
          or a human memory in order to choose the node

• We choose the node from WAL which is most reliable
        – Find the latest LSN from WAL files in each node by using our
          original tool like xlogdump and compare them




Copyright © 2008 NTT, Inc. All Rights Reserved.                                46
Bottleneck
• Bad performance after failover
        – No FSM
        – A little commit hint bits in heap tuples
        – A little dead hint bits in indexes




Copyright © 2008 NTT, Inc. All Rights Reserved.      47
Demo
Demo
Environment
                                                   Client
•    2 nodes, 1 client
How to watch                               Node0        Node1
•   there are two kind of terminals
•   the terminal at the top of the screen displays the cluster status

                                                      The node with
                                                      • 3 lines is active
                                                      • 1 line is standby
                                                      • no line is not started yet


                                                                  active

                                                                 standby
•        the other terminal is for operation
        –         Client
        –         Node0
        –         Node1

Copyright © 2008 NTT, Inc. All Rights Reserved.                               49
Demo
Operation
1. start only node0 as the active

2. createdb and pgbench -i (from client)

3. online backup

4. copy the backup from node0 to node1

5. pgbench -c2 -t2000

6. start node1 as the standby during pgbench ->
   synchronization starts

7. killall -9 postgres (in active node0) -> failover occurs

Copyright © 2008 NTT, Inc. All Rights Reserved.               50
Future work
- Where are we going? -
Where are we going?
• We’re thinking to make it Open Source Software.
        – To be a multi-purpose replication framework
        – Collaborators welcome.
• TODO items
        – For 8.4 development
                  • Re-implement WAL-Sender and WAL-Receiver as extensions
                    using two new hooks
                  • Xlogdump to be an official contrib module
        – For performance
                  • Improve checkpointing during recovery
                  • Handling un-logged operations
        – For usability
                  • Improve detection of server down in client library
                  • Automatic retrying abundant transactions in client library


Copyright © 2008 NTT, Inc. All Rights Reserved.                                  52
For 8.4 : WAL-writing Hook
• Purpose
        – Make WAL-Sender to be one of general extensions
                  • WAL-Sender sends WAL records before commits
• Proposal
        – Introduce “WAL-subscriber model”
        – “WAL-writing Hook” enables to replace or filter WAL
          records just before they are written down to disks.
• Other extensions using this hook
        – “Software RAID” WAL writer for redundancy
                  • Writes WAL into two files for durability (it might be a
                    paranoia…)
        – Filter to make a bitmap for partial backup
                  • Writes changed pages into on-disk bitmaps
        – …

Copyright © 2008 NTT, Inc. All Rights Reserved.                               53
For 8.4 : WAL-reading Hook
• Purpose
        – Make WAL-Receiver to be one of general extensions
                  • WAL-Receiver redo in each record, not in each segment
• Proposal
        – “WAL-reading Hook” enables to filter WAL records
          during they are read in recovery.
• Other extensions using this hook
        – Read-ahead WAL reader
                  • Read a segment at once and pre-fetch required pages that are
                    not a full-page-writes and not in shared buffers
        – …



Copyright © 2008 NTT, Inc. All Rights Reserved.                                54
Future work : Multiple Configurations
 • Supports several synchronization modes
          – One configuration is not fit all,
            but one framework could fit many uses!
                                                       Before/After Commit in ACT
         No.                        Configuration    Send     Flush    Flush    Redo
                                                    to SBY   in ACT   in SBY   in SBY
              1 Speed                               After    After    After    After
              2 Speed + Durability                  After    Before   After    After
              3 HA + Speed                          Before   After    After    After
Now
              4 HA + Durability                     Before   Before   After    After
              5 HA + More durability                Before   Before   Before   After
              6 Synchronous Reads in SBY            Before   Before   Before   Before



  Copyright © 2008 NTT, Inc. All Rights Reserved.                                      55
Future work : Horizontal scalability
• Horizontal scalability is not our primary goal, but
  for potential users.
• Postgres TODO: “Allow a warm standby system to
  also allow read-only statements” helps us.
• NOTE: We need to support 3 or more servers
  if we need both scalability and availability.

           2 servers
                                                  2 * 50%   1 * 100%

           3 servers

                                                  3 * 66%   2 * 100%


Copyright © 2008 NTT, Inc. All Rights Reserved.                        56
Conclusion
• Synchronous log-shipping is the best for HA.
        – A direction of future warm-standby
        – Less downtime, No data loss, and Automatic failover.
• There remains rooms for improvements.
        – Minimize downtime and performance scalability.
        – Improvements for recovery also helps Log-shipping.


• We’ve shown requirements, advantages, and
  remaining tasks.
        – It has potential to improvements, but requires some
          works to be more useful solution
        – We’ll make it open source! Collaborators welcome!
Copyright © 2008 NTT, Inc. All Rights Reserved.                  57
Fin.



Contact
  itagaki.takahiro@oss.ntt.co.jp
  fujii.masao@oss.ntt.co.jp

More Related Content

PDF
Demonstrating vMotion capabilities with Oracle RAC on VMware vSphere
PPT
Design and implementation of a reliable and cost-effective cloud computing in...
PPTX
Deploying Maximum HA Architecture With PostgreSQL
PDF
Large customers want postgresql too !!
PDF
What's LUM Got To Do with It: Deployment Considerations for Linux User Manage...
KEY
Oracle ASM 11g - The Evolution
PDF
Migrating Novell GroupWise to Linux
PDF
Presentation implementing oracle asm successfully
Demonstrating vMotion capabilities with Oracle RAC on VMware vSphere
Design and implementation of a reliable and cost-effective cloud computing in...
Deploying Maximum HA Architecture With PostgreSQL
Large customers want postgresql too !!
What's LUM Got To Do with It: Deployment Considerations for Linux User Manage...
Oracle ASM 11g - The Evolution
Migrating Novell GroupWise to Linux
Presentation implementing oracle asm successfully

What's hot (20)

PDF
VMware Performance for Gurus - A Tutorial
PPTX
Virtualizacao de Servidores - Windows
PDF
Oracle 11g R2 RAC setup on rhel 5.0
PDF
The Magic of Hot Streaming Replication, Bruce Momjian
PDF
Xen community update
PDF
KVM Tuning @ eBay
PDF
XS Japan 2008 Oracle VM English
PDF
Docking postgres
PDF
Comandos
PDF
Amazon EC2 in der Praxis
PDF
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
PDF
Xen ATG case study
PDF
PVOps Update
PDF
XS Boston 2008 VT-D PCI
PDF
XS Japan 2008 Citrix English
PDF
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
PPTX
High Availability Solutions in SQL 2012
PDF
090507.New Replication Features(2)
PDF
Atril-Déjà Vu Tea mserver 2 general presentation
PDF
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
VMware Performance for Gurus - A Tutorial
Virtualizacao de Servidores - Windows
Oracle 11g R2 RAC setup on rhel 5.0
The Magic of Hot Streaming Replication, Bruce Momjian
Xen community update
KVM Tuning @ eBay
XS Japan 2008 Oracle VM English
Docking postgres
Comandos
Amazon EC2 in der Praxis
VMworld 2013: Capacity Jail Break: vSphere 5 Space Reclamation Nuts and Bolts
Xen ATG case study
PVOps Update
XS Boston 2008 VT-D PCI
XS Japan 2008 Citrix English
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
High Availability Solutions in SQL 2012
090507.New Replication Features(2)
Atril-Déjà Vu Tea mserver 2 general presentation
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
Ad

Similar to Synchronous Log Shipping Replication (20)

PDF
Built-in Replication in PostgreSQL
PPTX
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
PDF
PostgreSQL Scaling And Failover
PDF
Dataguard physical stand by setup
PPTX
Sql server 2012 ha and dr sql saturday boston
PPTX
Sql Server 2012 HA and DR -- SQL Saturday Richmond
PPTX
Sql server 2012 ha and dr sql saturday tampa
PPTX
Sql server 2012 - always on deep dive - bob duffy
PPTX
Sql server 2012 ha and dr sql saturday dc
PDF
Load balancing and failover options
DOCX
Oracle 12c far sync standby instance
PDF
MySQL for Large Scale Social Games
PPT
Oracle dataguard overview
PPTX
Oracle WebLogic Server 12c: Seamless Oracle Database Integration (with NEC, O...
PPTX
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
DOC
Dr queries
PDF
Replication Tips & Trick for SMUG
PDF
The Magic of Hot Streaming Replication (Bruce Momjian)
PDF
SQL Server Clustering and High Availability
PDF
WalB: Block-level WAL. Concept.
Built-in Replication in PostgreSQL
Always On - Wydajność i bezpieczeństwo naszych danych - High Availability SQL...
PostgreSQL Scaling And Failover
Dataguard physical stand by setup
Sql server 2012 ha and dr sql saturday boston
Sql Server 2012 HA and DR -- SQL Saturday Richmond
Sql server 2012 ha and dr sql saturday tampa
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 ha and dr sql saturday dc
Load balancing and failover options
Oracle 12c far sync standby instance
MySQL for Large Scale Social Games
Oracle dataguard overview
Oracle WebLogic Server 12c: Seamless Oracle Database Integration (with NEC, O...
SQL 2012 AlwaysOn Availability Groups (AOAGs) for SharePoint Farms - Norcall ...
Dr queries
Replication Tips & Trick for SMUG
The Magic of Hot Streaming Replication (Bruce Momjian)
SQL Server Clustering and High Availability
WalB: Block-level WAL. Concept.
Ad

More from elliando dias (20)

PDF
Clojurescript slides
PDF
Why you should be excited about ClojureScript
PDF
Functional Programming with Immutable Data Structures
PPT
Nomenclatura e peças de container
PDF
Geometria Projetiva
PDF
Polyglot and Poly-paradigm Programming for Better Agility
PDF
Javascript Libraries
PDF
How to Make an Eight Bit Computer and Save the World!
PDF
Ragel talk
PDF
A Practical Guide to Connecting Hardware to the Web
PDF
Introdução ao Arduino
PDF
Minicurso arduino
PDF
Incanter Data Sorcery
PDF
PDF
Fab.in.a.box - Fab Academy: Machine Design
PDF
The Digital Revolution: Machines that makes
PDF
Hadoop + Clojure
PDF
Hadoop - Simple. Scalable.
PDF
Hadoop and Hive Development at Facebook
PDF
Multi-core Parallelization in Clojure - a Case Study
Clojurescript slides
Why you should be excited about ClojureScript
Functional Programming with Immutable Data Structures
Nomenclatura e peças de container
Geometria Projetiva
Polyglot and Poly-paradigm Programming for Better Agility
Javascript Libraries
How to Make an Eight Bit Computer and Save the World!
Ragel talk
A Practical Guide to Connecting Hardware to the Web
Introdução ao Arduino
Minicurso arduino
Incanter Data Sorcery
Fab.in.a.box - Fab Academy: Machine Design
The Digital Revolution: Machines that makes
Hadoop + Clojure
Hadoop - Simple. Scalable.
Hadoop and Hive Development at Facebook
Multi-core Parallelization in Clojure - a Case Study

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
August Patch Tuesday
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
Getting Started with Data Integration: FME Form 101
PPTX
1. Introduction to Computer Programming.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
August Patch Tuesday
Group 1 Presentation -Planning and Decision Making .pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Tartificialntelligence_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Univ-Connecticut-ChatGPT-Presentaion.pdf
TLE Review Electricity (Electricity).pptx
A comparative analysis of optical character recognition models for extracting...
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
cloud_computing_Infrastucture_as_cloud_p
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Getting Started with Data Integration: FME Form 101
1. Introduction to Computer Programming.pptx

Synchronous Log Shipping Replication

  • 1. Synchronous Log Shipping Replication Takahiro Itagaki and Masao Fujii NTT Open Source Software Center PGCon 2008
  • 2. Agenda • Introduction: What is this? – Background – Compare with other replication solutions • Details: How it works – Struggles in development • Demo • Future work: Where are we going? • Conclusion Copyright © 2008 NTT, Inc. All Rights Reserved. 2
  • 4. What is this? • Successor of warm-standby servers – Replication system using WAL shipping. • using Point-in-Time Recovery mechanism – However, no data loss after failover because of synchronous log-shipping. WAL Active Server Standby Server • Based on PostgreSQL 8.2 with a patch and including several scripts – Patch: Add two processes into postgres – Scripts: For management commands Copyright © 2008 NTT, Inc. All Rights Reserved. 4
  • 5. Warm-Standby Servers (v8.2~) Active Server (ACT) Standby Server (SBY) Commit 1 2 Flush WAL to disk (Return) 3 Failover Crash! WAL seg Sent after commits 4 archive_command WAL seg Redo The last segment is not We need to wait for remounting available in the standby server active’s storage on the standby if the active crashes before server, or we wait the active’s archiving it. reboot. Copyright © 2008 NTT, Inc. All Rights Reserved. 5
  • 6. Synchronous Log Shipping Servers Active Server (ACT) Standby Server (SBY) WAL entries are sent before returning from Commit 1 commits by records. Segments are formed 2 Flush WAL to disk from records in the WAL records standby server. 3 Send WAL records (Return) 4 WAL seg Failover Crash! Redo We can start the standby server after redoing remaining segments; We’ve received all transaction logs already in it. Copyright © 2008 NTT, Inc. All Rights Reserved. 6
  • 7. Background: Why new solution? • We have many migration projects from Oracles and compete with them with postgres. – So, we hope postgres to be SUPERIOR TO ORACLE! • Our activity in PostgreSQL 8.3 – Performance stability • Smoothed checkpoint – Usability; Ease to tune server parameters • Multiple autovacuum workers • JIT bgwriter – automatic tuning of bgwriter • Where are alternatives of RAC? – Oracle Real Application Clusters Copyright © 2008 NTT, Inc. All Rights Reserved. 7
  • 8. Background: Alternatives of RAC • Oracle RAC is a multi-purpose solution – … but we don’t need all of the properties. • In our use: – No downtime <- Very Important – No data loss <- Very Important – Automatic failover <- Important – Performance in updates <- Important – Inexpensive hardware <- Important – Performance scalability <- Not important • Goal – Minimizing system downtime – Minimizing performance impact in updated-workloads Copyright © 2008 NTT, Inc. All Rights Reserved. 8
  • 9. Compare with other replication solutions No data No SQL Performance Update How to Failover scalability performance loss restriction copy? Log Shipping OK OK Auto, Fast No Good Log warm-standby NG OK Manual No Async Slony-I NG OK Manual Good Async Trigger Auto, pgpool-II OK NG Hard to Good Medium SQL re-attach Shared Disks OK OK Auto, Slow No Good Disk • Log Shipping is excellent except performance scalability. • Also, Re-attaching a repaired server is simple. – Just same as normal hot-backup procedure • Copy active server’s data into standby and just wait for WAL replay. – No service stop during re-attaching Copyright © 2008 NTT, Inc. All Rights Reserved. 9
  • 10. Compare downtime with shared disks • Cold standby with shared disks is an alternative solution – but it takes long time to failover in heavily-updated load. – Log-shipping saves time for mounting disks and recovery. Shared disk system Crash! 20 sec 60 ~ 180 sec (*) to umount to recover Log-shipping system and remount from the last shared disks checkpoint Crash! Ok, the service is restarted! 5 sec 10 sec to recover (*) Measured in PostgreSQL 8.2. to detect the last 8.3 would take less time because server down segement of less i/o during recovery. Copyright © 2008 NTT, Inc. All Rights Reserved. 10
  • 11. Advantages and Disadvantages • Advantages – Synchronous • No data loss on failover – Log-based (Physically same structure) • No functional restrictions in SQL • Simple, Robust, and Easy to setup – Shared-nothing • No Single Point of Failure • No need for expensive shared disks – Automatic Fast Failover (within 15 seconds) • “Automatic” is essential not to wait human operations – Less impact against update performance (less than 7%) • Disadvantages – No performance scalability (for now) – Physical replication. Cannot use for upgrading purposes. Copyright © 2008 NTT, Inc. All Rights Reserved. 11
  • 12. Where is it used? • Interactive teleconference management package – Commercial service in active – Manage conference booking and file transfer – Log-shipping is an optional module for users requiring high availability Internet networks Communicator Copyright © 2008 NTT, Inc. All Rights Reserved. 12
  • 14. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 14
  • 15. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 In our replicator, there are two – Open source high-availability software manages the resources via resource agent(RA) – nodes, active and standby Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 15
  • 16. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA In the active node, postgres is PostgreSQL PostgreSQL DB running in normal mode with newDB child process WALSender WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 16
  • 17. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA In the standby node, postgres is running PostgreSQL PostgreSQL in continuous recovery mode with new DB DB daemon WALReceiver WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 17
  • 18. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB In order to manage these resources, WAL postgres startup WAL there is heartbeat in both nodes wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 18
  • 19. System overview • Based on PostgreSQL 8.2, 8.3(under porting) • WALSender – New child process of postmaster – Reads WAL from walbuffers and sends WAL to WALReceiver • WALReceiver – New daemon to receive WAL – Writes WAL to disk and communicates with startup process • Using Heartbeat 2.1 – Open source high-availability software manages the resources via resource agent(RA) – Heartbeat provides a virtual IP(VIP) Active Standby Heartbeat Heartbeat VIP RA RA PostgreSQL PostgreSQL DB DB WAL postgres startup WAL wal buffers WAL WALSender WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 19
  • 20. WALSender Active postgres walbuffers WALSender Update Insert Commit Flush WAL Request Read Send / Recv (Return) (Return) Copyright © 2008 NTT, Inc. All Rights Reserved. 20
  • 21. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert Commit Flush Update command triggers XLogInsert() and inserts WAL into walbuffers WAL Request Read Send / Recv (Return) (Return) Copyright © 2008 NTT, Inc. All Rights Reserved. 21
  • 22. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert XLogWrite() Commit Flush WAL Request Read Send / Recv (Return) Commit command triggers XLogWrite() and flushs WAL to disk (Return) Copyright © 2008 NTT, Inc. All Rights Reserved. 22
  • 23. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert XLogWrite() Commit Flush WAL Changed Request Read Send / Recv (Return) (Return) We changed XLogWrite() to request WALSender to transfer WAL Copyright © 2008 NTT, Inc. All Rights Reserved. 23
  • 24. WALSender Active postgres walbuffers WALSender Update XLogInsert() Insert XLogWrite() Commit WALSender reads WAL from Flush walbuffers and transfer them WAL Changed Request Read Send / Recv (Return) (Return) After transfer finishes, commit command returns Copyright © 2008 NTT, Inc. All Rights Reserved. 24
  • 25. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush Inform Read Replay Copyright © 2008 NTT, Inc. All Rights Reserved. 25
  • 26. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush Inform WALReceiver receives WAL from Read WALSender and flushes them to disk Replay Copyright © 2008 NTT, Inc. All Rights Reserved. 26
  • 27. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush Inform Read WALReceiver informs startup process of the latest LSN. Replay Copyright © 2008 NTT, Inc. All Rights Reserved. 27
  • 28. WALReceiver Standby WALReceiver WAL Disk startup Recv / Send Flush ReadRecord() Inform Read Changed Startup process reads WAL up to the latest LSN Replay and replays. We changed ReadRecord() so that startup process could communicate with WALReceiver and replay by each WAL record. Copyright © 2008 NTT, Inc. All Rights Reserved. 28
  • 29. Why replay by each WAL record? • Minimize downtime • Shorter delay in read-only queries (at the standby) Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 29
  • 30. Why replay by each WAL record? • Minimize downtime • ShorterIn our replicator, becausequeries by each WAL record, delay in read-only of replay (at the standby) the standby only has to replay a few records at failover Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 30
  • 31. Why replay by each WAL record? • Minimize downtime On the other hand, in warm-standby, because of • Shorter delayreplay by each WAL segment, thethe standby) in read-only queries (at standby has to replay the latest one segment Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 31
  • 32. Why replay by each WAL record? • Minimize downtime • Shorter delay in read-only queries (at the standby) Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries In thisshorter example, warm-standby needed to longer replay most 'segment2' at failover. Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 32
  • 33. Why replay by each WAL record? • Minimize downtime And,Shorter delay in read-only queries (at the standby) • in our replicator, because of replay by each WAL record, delay in read-only queries is shorter Our replicator Warm-Standby Replay by each WAL record WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 33
  • 34. Why replay by each WAL record? • Minimize downtime • Shorter delay in read-only queries (at the standby) Our replicator Therefore, we implemeted replay Warm-Standby Replay by each each WAL record WAL record by WAL segment Needed to be replayed at failover a few records the latest one segment Delay in read-only queries shorter longer Our replicator Warm-standby segment1 segment2 WAL block WAL which can be replayed now WAL needed to be replayed at failover Copyright © 2008 NTT, Inc. All Rights Reserved. 34
  • 35. Heartbeat and resource agent • Heartbeat needs resource agent (RA) to manage PostgreSQL(with WALSender) and WALReceiver as a resource • RA is an executable providing the following feature Feature Description start start the resources as standby promote change the status from standby to active demote change the status from active to standby stop stop the resources monitor check if the resource is running normally Invoke RA Resources Heartbeat start promote PostgreSQL stop demote (WALSender) monitor WALReceiver Copyright © 2008 NTT, Inc. All Rights Reserved. 35
  • 36. Failover • Failover occurs when heartbeat detects that the active node is not running normally • After failover, clients can restart transactions only by reconnecting to virtual IP provided by Heartbeat Standby Heartbeat startup Detect Act ReadRecord() Request Changed At failover, heartbeat requests startup process to finish WAL replay. Replay We changed ReadRecord() to deal with this request. Copyright © 2008 NTT, Inc. All Rights Reserved. 36
  • 37. Failover • Failover occurs when heartbeat detects that the active node is not running normally • After failover, clients can restart transactions only by reconnecting to virtual IP provided by Heartbeat Standby Heartbeat startup Detect Act ReadRecord() Request Changed After finishing WAL replay, Replay the standby becomes active Copyright © 2008 NTT, Inc. All Rights Reserved. 37
  • 39. Downtime caused by the standby down • The active down triggers a failover and causes downtime • Additionally, the standby down might also cause downtime – WALSender waits for the response from the standby after sending WAL – So, when the standby down occurs, unless WALSender detects the failure, WALSender is blocked – i.e. WALSender keeps waiting for the response which never comes • How to detect – Timeout notification is needed to detect – Keepalive, but it doesn't work occasionally on Linux (Linux bug!?) – Original timeout Active postgres WALSender Commit Request Send WAL Down!! Standby Wait (Return) (Return) Blocked Copyright © 2008 NTT, Inc. All Rights Reserved. 39
  • 40. Downtime caused by clients • Even if the database finishes a failover immediately, downtime might still be long by clients reason – Clients wait for the response from the database – So, when a failover occurs, unless clients detect a failover, they can't reconnect to the new active and restart the transaction – i.e. clients keeps waiting for the response which never comes • How to detect – Timeout notification is needed to detect – Keepalive • Our setKeepAlive patch was accepted in JDBC 8.4dev – Socket timeout – Query timeout Client We want to implement these timeouts!! Down!! Active Active Copyright © 2008 NTT, Inc. All Rights Reserved. 40
  • 41. Split-brain • High-availability clusters must be able to handle split- brain • Split-brain causes data inconsistency – Both nodes are active and provide the virtual IP – So, clients might update inconsistently each node • Our replicator also causes split-brain unless the standby can distinguish network failure from the active down Failure!! Active Active Active Standby Down!! If network failure are mis-detected as If the active down are mis-detected as the active down, the standby becomes network failure, a failover doesn't start active even if the other active is still even if the other active is down. running normally. This scenario is also problem though This is split-brain scenario. split-brain doesn't occur. Copyright © 2008 NTT, Inc. All Rights Reserved. 41
  • 42. Split-brain • How to distinguish – Combining the following solution 1. Redundant network between two nodes – The standby can distinguish unless all networks fail 2. STONITH(Shoot The Other Node In The Head) – Heartbeat's default solution for avoiding split-brain – STONITH always forcibly turns off the active when activating the standby – Split-brain doesn't occur because the active node is always only one Active Active Turn off!! STONITH Copyright © 2008 NTT, Inc. All Rights Reserved. 42
  • 43. What delays the activation of the standby • In order to activate the standby immediately, recovery time at failover must be short!! • In 8.2, recovery is very slow – A lot of WAL needed to be replayed at failover might be accumulated – Another problem: disk full failure might happen • In 8.3, reocvery is fast☺ – Because of avoiding unnecessary reads – But, there are still two problems Copyright © 2008 NTT, Inc. All Rights Reserved. 43
  • 44. What delays the activation of the standby 1. Checkpoint during recovery – It took 1min or more (in the worst case) and occupied 21% of recovery time – What is worse is that WAL replay is blocked during checkpoint • Because only startup process performs both checkpoint and WAL replay -> Checkpoint delays recovery... • [Just idea] bgwriter during recovery – Leaving checkpoint to bgwriter, and making startup process concentrate on WAL replay Copyright © 2008 NTT, Inc. All Rights Reserved. 44
  • 45. What delays the activation of the standby 2. Checkpoint at the end of recovery – Activation of the standby is blocked during checkpoint -> Downtime might take 1min or more... • [Just idea] Skip of the checkpoint at the end of recovery – But, postgres works fine if it fails before at least one checkpoint after recovery? – We have to reconsider why checkpoint is needed at the end of recovery !!! Of course, because recovery is a critical part for DBMS, more careful investigation is needed to realize these ideas Copyright © 2008 NTT, Inc. All Rights Reserved. 45
  • 46. How we choose the node with the later LSN • When starting both two nodes, we should synchronize from the node with the later LSN to the other – But, it's unreliable to depend on server logs (e.g. heartbeat log) or a human memory in order to choose the node • We choose the node from WAL which is most reliable – Find the latest LSN from WAL files in each node by using our original tool like xlogdump and compare them Copyright © 2008 NTT, Inc. All Rights Reserved. 46
  • 47. Bottleneck • Bad performance after failover – No FSM – A little commit hint bits in heap tuples – A little dead hint bits in indexes Copyright © 2008 NTT, Inc. All Rights Reserved. 47
  • 48. Demo
  • 49. Demo Environment Client • 2 nodes, 1 client How to watch Node0 Node1 • there are two kind of terminals • the terminal at the top of the screen displays the cluster status The node with • 3 lines is active • 1 line is standby • no line is not started yet active standby • the other terminal is for operation – Client – Node0 – Node1 Copyright © 2008 NTT, Inc. All Rights Reserved. 49
  • 50. Demo Operation 1. start only node0 as the active 2. createdb and pgbench -i (from client) 3. online backup 4. copy the backup from node0 to node1 5. pgbench -c2 -t2000 6. start node1 as the standby during pgbench -> synchronization starts 7. killall -9 postgres (in active node0) -> failover occurs Copyright © 2008 NTT, Inc. All Rights Reserved. 50
  • 51. Future work - Where are we going? -
  • 52. Where are we going? • We’re thinking to make it Open Source Software. – To be a multi-purpose replication framework – Collaborators welcome. • TODO items – For 8.4 development • Re-implement WAL-Sender and WAL-Receiver as extensions using two new hooks • Xlogdump to be an official contrib module – For performance • Improve checkpointing during recovery • Handling un-logged operations – For usability • Improve detection of server down in client library • Automatic retrying abundant transactions in client library Copyright © 2008 NTT, Inc. All Rights Reserved. 52
  • 53. For 8.4 : WAL-writing Hook • Purpose – Make WAL-Sender to be one of general extensions • WAL-Sender sends WAL records before commits • Proposal – Introduce “WAL-subscriber model” – “WAL-writing Hook” enables to replace or filter WAL records just before they are written down to disks. • Other extensions using this hook – “Software RAID” WAL writer for redundancy • Writes WAL into two files for durability (it might be a paranoia…) – Filter to make a bitmap for partial backup • Writes changed pages into on-disk bitmaps – … Copyright © 2008 NTT, Inc. All Rights Reserved. 53
  • 54. For 8.4 : WAL-reading Hook • Purpose – Make WAL-Receiver to be one of general extensions • WAL-Receiver redo in each record, not in each segment • Proposal – “WAL-reading Hook” enables to filter WAL records during they are read in recovery. • Other extensions using this hook – Read-ahead WAL reader • Read a segment at once and pre-fetch required pages that are not a full-page-writes and not in shared buffers – … Copyright © 2008 NTT, Inc. All Rights Reserved. 54
  • 55. Future work : Multiple Configurations • Supports several synchronization modes – One configuration is not fit all, but one framework could fit many uses! Before/After Commit in ACT No. Configuration Send Flush Flush Redo to SBY in ACT in SBY in SBY 1 Speed After After After After 2 Speed + Durability After Before After After 3 HA + Speed Before After After After Now 4 HA + Durability Before Before After After 5 HA + More durability Before Before Before After 6 Synchronous Reads in SBY Before Before Before Before Copyright © 2008 NTT, Inc. All Rights Reserved. 55
  • 56. Future work : Horizontal scalability • Horizontal scalability is not our primary goal, but for potential users. • Postgres TODO: “Allow a warm standby system to also allow read-only statements” helps us. • NOTE: We need to support 3 or more servers if we need both scalability and availability. 2 servers 2 * 50% 1 * 100% 3 servers 3 * 66% 2 * 100% Copyright © 2008 NTT, Inc. All Rights Reserved. 56
  • 57. Conclusion • Synchronous log-shipping is the best for HA. – A direction of future warm-standby – Less downtime, No data loss, and Automatic failover. • There remains rooms for improvements. – Minimize downtime and performance scalability. – Improvements for recovery also helps Log-shipping. • We’ve shown requirements, advantages, and remaining tasks. – It has potential to improvements, but requires some works to be more useful solution – We’ll make it open source! Collaborators welcome! Copyright © 2008 NTT, Inc. All Rights Reserved. 57