Guugll Search: Stellar Switches To Centralized System

The ripple paper and the code are not the same.

The ripple code causes nodes to determine “quorum” from the nodes it heard from in its last ledger close, not from its total UNL. This is what likely causes forks and this code is live in both consensus systems.

I’m not trying to get into this blaming back and forth with Ripple Inc. We have an obligation to make it clear to people what we have seen and what we believe the issues are. When we first reached out to Prof Mazieres, it was to get an independent 3rd party review of the ripple algorithm from a respected computer scientist so we could be certain that it worked. This is what should be done for any complicated algorithm. Bitcoin has had its paper rigorously reviewed and generally has passed all such review on a technical level. We wanted to do the same thing but unfortunately for us, the algorithm did not pass Prof. Mazieres’s review and we do not know any distributed systems expert who is not employed by Ripple Inc who has reviewed the algorithm and thinks it works.

We’ve seen the nodes exhibit a tendency to get out of sync since at least September. The network would split 3 or 4 ways and then eventually come back together but it would do so relatively quickly and without loss. Last week’s fork was a case of this happening but the ledger was not able to come together quickly.

Let’s review the only commit you can argue changes consensus.

https://github.com/stellar/stellard/commit/067d7158720331937fc782cbb230e8d422cd7341

This commit is the only thing we did that could affect consensus. It was also only deployed on a minority of validators at the time of the fork.

This simple change only causes a node to stop waiting if it realizes it is way behind the rest of the network. Waiting longer won’t really do anything positive:

* the majority of the network has already moved on to a different consensus phase.

* updateposition was already called, so the instance already learned what it can from its peers.

* Waiting longer for other positions will actually increase the chance of divergence (best case some of the positions the instance sees were from the majority, otherwise it’s just random stuff). David Schartwz says as much here: “This may mean we occasionally are forked from the main ledger chain, but that’s perfectly fine.” https://github.com/stellar/stellard/pull/176#issuecomment-64780903

* this partition will hit a timeout anyways that will cause it to advance (see below), before even seeing that the majority network is proposing again.

We were running 7 validating nodes all connected to each other. Our validation_quorum for each node was set to 4. The system getting out of sync regardless was most likely triggered by the existing ripple code below (per our log files).

LedgerTiming.cpp:157 (in stellard) LedgerTiming.cpp:121 (in rippled):

if (currentAgreeTime

This code ignores the number of participants when for some reason the node missed proposals from other peers. This is a contradiction to the Ripple paper so pointing to the paper as the answer does not explain the issues because the code and the paper do not match.

When a fork like this occurs, the minority partition doesn't have enough validations to take over the network but still closes ledgers (while the majority network continues to go on, validated and all). At some point later, the majority network has a glitch that causes some of its participants to not rejoin consensus. Then those recently caught up nodes may then decide to join the wrong network at that point (in this case if the former majority network does not look like it has majority from a LCL point of view).

The interesting thing that happens at that point is that this new majority network (that has been closing ledgers for some time, but not validated), may have enough participants to cross the validation threshold. When this happens, we end up with gaps in history (as from that forks point of view, the previous fully validated ledgers dates back from the time the fork occurred).

The main misunderstanding on how this code works comes from the fact that "previousProposers" is not the UNL - it's just the subset of the UNL that was participating in the last consensus, which in case of timeout can take any number between 0 and the actual size of the UNL.

source: https://stellarverse.org/forum/index.php/topic,10.15.html

Stellar Switches To Centralized System

Guugll Search

Monday, March 16, 2015

Stellar Switches To Centralized System

1 comment: