Skip to content

SpecPaxos replica crashes upon roll backs and when merging logs #4

@ramanala

Description

@ramanala

Hello, I have been trying to play around with the SpecPaxos implementation. The scenario I'm trying is simple: I run five replicas on a single machine (listening on different ports) and I have five clients sending requests in a closed loop. I understand that for SpecPaxos to deliver high throughput and low latency, the network needs to provide ordered delivery (at least for most of the time). If not, there will be many conflicts, leading to many roll backs that can hurt performance, but the system must keep making progress.

However, in the above scenario, I see that the replicas start to crash after a while. Once two replicas crash (in a five-node cluster), the clients block indefinitely.

Details:

I compiled with the paranoid flag on.
Here is how I start the servers:

./bench/replica -c ./conf -i 0 -m spec >rep0 2>&1 &
./bench/replica -c ./conf -i 1 -m spec >rep1 2>&1 &
./bench/replica -c ./conf -i 2 -m spec >rep2 2>&1 &
./bench/replica -c ./conf -i 3 -m spec >rep3 2>&1 &
./bench/replica -c ./conf -i 4 -m spec >rep4 2>&1 &

Here is how start the clients:

./bench/client -c ./conf -n 1000 -m spec >cli-0 2>&1 &
./bench/client -c ./conf -n 1000 -m spec >cli-1 2>&1 &
./bench/client -c ./conf -n 1000 -m spec >cli-2 2>&1 &
./bench/client -c ./conf -n 1000 -m spec >cli-3 2>&1 &
./bench/client -c ./conf -n 1000 -m spec >cli-4 2>&1 &

Here is the stack trace of a replica that is crashing:

20190907-154417-2122 17865 * MergeLogs (replica.cc:820): [2] Merging 3 logs
20190907-154417-2124 17865 PANIC MergeLogs (replica.cc:1060): Assertion `newEntry.viewstamp.view == entry.view()' failed
20190907-154417-2124 17865 ! Backtrace (message.cc:169): Backtrace:
20190907-154417-2128 17865 ! Backtrace (message.cc:220): 0: _Z6_Panicv+0x9 [0x440314]

20190907-154417-2130 17865 ! Backtrace (message.cc:220): 1: _ZN9specpaxos4spec11SpecReplica9MergeLogsEmmRKSt3mapIiNS0_5proto19DoViewChangeMessageESt4lessIiESaISt4pairIKiS4_EEERSt6vectorINS_3Log8LogEntryESaISG_EE+0x1a19 [0x40bc9f]

20190907-154417-2132 17865 ! Backtrace (message.cc:220): 2: _ZN9specpaxos4spec11SpecReplica18HandleDoViewChangeERK16TransportAddressRKNS0_5proto19DoViewChangeMessageE+0x965 [0x40e5a3]

20190907-154417-2134 17865 ! Backtrace (message.cc:220): 3: ZN9specpaxos4spec11SpecReplica14ReceiveMessageERK16TransportAddressRKSsS6+0x5bd [0x4077c7]

20190907-154417-2136 17865 ! Backtrace (message.cc:220): 4: _ZN12UDPTransport10OnReadableEi+0xb84 [0x4489fc]

20190907-154417-2138 17865 ! Backtrace (message.cc:220): 5: _ZN12UDPTransport14SocketCallbackEisPv+0x39 [0x448f2b]

20190907-154417-2140 17865 ! Backtrace (message.cc:220): 6: event_base_loop+0x754 [0x7f684a341f24]

20190907-154417-2142 17865 ! Backtrace (message.cc:220): 7: _ZN12UDPTransport3RunEv+0x1f [0x447bff]

20190907-154417-2144 17865 ! Backtrace (message.cc:220): 8: main+0x94f [0x40610f]

20190907-154417-2146 17865 ! Backtrace (message.cc:220): 9: __libc_start_main+0xf5 [0x7f684938af45]

20190907-154417-2148 17865 ! Backtrace (message.cc:220): 10: _start+0x29 [0x4056c9]

20190907-154417-2150 17865 ! Backtrace (message.cc:220): 11: ???+0x29 [0x29]

I can attach the full logs if needed. Ideally, the replicas should not crash but resolve the conflicts and make progress.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions