Benefits of LTOR in block entropy encoding, or:
ISPs hate him! Learn how to make your block 75% Xthinner with this one weird trick!
We can encode the transaction set for a block efficiently by only specifying one or two bytes per transaction for most transactions. With LTOR, we can omit the first few leading bytes of the TXID because they are the same as neighboring transactions in the block, and for all orders we can omit most of the trailing bytes because they specify more information than is needed to disambiguate between potentially matching mempool transactions. A specific encoding scheme called Xthinner is presented herein.
Note: In this article, I will be using the terms “LTOR” and “CTOR” to refer to different aspects of the Bitcoin ABC proposal. “LTOR” refers to the lexical transaction ordering rule — that is, the requirement that transactions be sorted numerically (or alphabetically in hex) by TXID. “CTOR” refers to a canonical transaction ordering rule, and could be any rule that fully specifies how any set of transactions must be ordered. Gavin’s order, for example, is a CTOR but is not LTOR; on the other hand, LTOR is necessarily a CTOR. This usage differs from the definitions used by Vermorel et al (2018), but only because their definitions and usage are wrong.
As I discussed in my previous article, block propagation performance is important. Current block propagation techniques are able to send blocks at 1 MB/s or less of uncompressed block size, using about 40 kB/s of network bandwidth. This slow speed the result of the TCP congestion control algorithms, and is not related to the network pipe size of the nodes in the network. In order to improve block propagation, we need to improve the algorithms we use, not the hardware. With current block propagation speeds, we would likely start to see significant incentives encouraging mining pool centralization at 10–20 MB blocks. It would be better if we could avoid this centralization.
One of the technologies that has recently been proposed to help with block propagation performance is the canonical transaction ordering rule, or CTOR. If we have a CTOR, then there is only one valid ordering for any given set of transactions in a block. This reduces the information-theory entropy of blocks, and should allow a CTOR-aware coding algorithm to transmit blocks using less data.
During the BCH stress test, we saw that about 86% of the bytes that Graphene sent when transmitting blocks was used to encode the order of transactions. If we had a canonical transaction order, this data could be discarded, as the block recipient could algorithmically recompute the correct order for the transactions.
Graphene is a really cool and an extremely efficient algorithm for coding blocks when it works. However, it is fundamentally a probabilistic algorithm. It uses IBLTs (invertible bloom lookup tables) to encode the transaction ID data, and the ability of a recipient to decode the IBLT depends on the size of the IBLT encoded by the sender, the amount of synchrony between the sender’s and the recipient’s mempools, and some degree of luck. When Graphene fails, it fails utterly, and retransmission attempts have to start from scratch. The current alpha implementations of Graphene was able to get a 41% success rate over two days during vswr’s tests. While it’s expected that this will improve as development continues for Graphene, it will never reach 100%, so an efficient and reliable fallback method is desirable.
This got me to thinking: If canonically-ordered blocks have lower entropy, then it should be possible to improve upon the encoding used in Xthin and Compact Blocks. How would this savings manifest? While trying to find a way to take advantage of this entropy reduction, I came up with an encoding that is about 4x as efficient as Xthin/CB with CTOR, and 2x as efficient without CTOR.
Since this encoding scheme is inspired by and partially derivative of Xthin, I call this encoding scheme Xthinner. In Chinese contexts, the recommended spelling is “Xthin二” . Note that the Chinese character “二” means “two” in English, but is transliterated in pinyin as “ èr”.
The problem and its solution
(Note: all “block” TXIDs in this article come from this 21.3 MB stress test BCH block, and all “mempool” TXIDs come from 3 neighboring blocks.)
The fundamental problem that we need to solve in block propagation is specifying enough information about each of the transactions in a block for the recipient of the message to be able to reconstruct the transaction list (or set). In order to do that, we need to be able to uniquely identify each transaction in the block, and we want to do that with as few bits as possible. Let’s say we are trying to uniquely identify a transaction in our block with the following TXID:
We can identify this transaction using only the first few bytes of its TXID. But how many bytes do we need? If we sort our mempool by TXID, the neighboring transactions in our mempool will show us how many initial bytes (or bits) match between neighbors and which bits are unique:
In this case, to uniquely specify our transaction with an integer number of bytes, we would need to transmit
00 06 40, as the 40 is needed to disambiguate between our target transaction and the
00 06 4d... transaction. This allows us to reduce our encoding for this transaction to only 3 bytes, plus a bit of flow control information to specify the encoding length. In comparison, Xthin always transmits the first 8 bytes of a TXID, so we already have a large improvement over that.
Our second optimization comes from viewing the transaction in the context of a block sorted by TXID. The first 10 transactions for our block might look like this:
All of these TXIDs begin with the byte
00, and (for example) five in a row use
04 as the second byte. We don’t need to transmit those repeating bytes for each transaction. Instead, we can choose to only transmit information on which bytes are changing from transaction to transaction. For our
00 06 40... transaction, we are only able to omit the first byte, as the
06 is not the same as the previous TXID, so we need to encode the two bytes
06 40 for this transaction. (On average, we can do a little better than this, as most transactions only need one byte each to be uniquely specified. However, adding in flow control and error correcting information means our final performance will be close to two bytes per transaction.)
So how do we deal with the flow control issue? If we have a stream of bytes corresponding to the bytes needed to uniquely specify each transaction, how do we know which is the last byte for each transaction? And how do we know how many bytes are identical to the previous transaction? We could encode this information as a
(uint_8 first_byte_index, uint_8 last_byte_index) pair, but that would use another two full bytes, and would roughly double our typical encoding size. Instead, we can define a stack-based state machine, and encode our flow control information as one-bit commands to our state machine.
The full algorithm in four parts
For encoding, we use a stack-based state machine that follows a four-stage cycle. At the end of each cycle, the bytes on the stack represent enough of the TXID’s initial bytes in order to be able to uniquely identify that TXID in mempool. At the start of each cycle, we’re left with the initial bytes for the previous transaction. Sample code for this process can be viewed here. Our four stages are as follows:
- We can pop 1 or more bytes off the stack as necessary to disambiguate from the previous block transaction.
- We can push 1 or more bytes onto the stack as necessary to disambiguate from neighboring mempool transactions.
- We commit to a transaction using the prefix encoded on the stack.
- We accumulate data into one or more checksum bytes for error detection and correction.
We will always need to pop and push at least 1 byte per transaction; otherwise, encoding ambiguities would result. Each pop or push after the first can be encoded as a 1-bit command, where a
0indicates that we’re ready to move onto the next stage of the state machine and a
1 indicates that we still need to do at least one more pop or push in the current stage. Let’s go through an example. Let’s say the following list of hashes are the TXIDs in our mempool, where the bolded TXIDs are the ones in our block:
Our state machine starts with an empty stack.
1.1 (Round 1 stage 1) We don’t need to pop off any bytes to disambiguate from the previous transaction (because there was none), so we encode a
0 into our state machine pop command list.
1.2 Our transaction needs
00 02 87 to disambiguate it from its neighbors, and our stack currently empty, so we need to push a total of three bytes onto the stack.
pushbytes.extend([0x00, 0x02, 0x87]) The first push is always needed, so we only need to add two
1 values to our push command list.
pushes.extend([1, 1, 0])
1.3 Our stack is now
00 02 87. That’s enough information to uniquely identify this transaction as long as the recipient of this message has that transaction in their mempool and does not have a conflicting transaction with the same prefix.
1.4 We’ll talk about the checksumming later.
2.1 We have
00 02 87 on our stack, and our next transaction begins with
00 04 37, so we need to pop off two bytes. The first pop is automatic, so we need to tell our state machine to pop a second byte before going to the next stage.
2.2 We now have
00 on our stack. We need
00 04 37 to disambiguate from the mempool transaction beginning with
00 04 41, so we push two bytes:
pushbytes.extend([0x04, 0x37]). The first push is automatic, so we only need to tell our state machine about the second push before switching to the next tx.
2.3 This leaves
00 04 37 on our stack to uniquely identify our transaction.
3.1 Our next tx begins with
00 04 43. Only the last byte differs from our stack value, and that pop happens automatically, so we tell our state machine to perform no additional pops.
3.2 We now have
00 04 on our stack. We need one more byte
43 to disambiguate from the nearest mempool neighbor,
00 04 41, so we only need our one automatic push for that.
3.3 This leaves
00 04 43 on our stack to uniquely identify our transaction.
And so on.
Our encoded data for the first three transactions from part (a) was this:
pops = [0, 1, 0, 0]
pushes = [1, 1, 0, 1, 0, 0]
pushbytes = [0x00, 0x02, 0x87, 0x04, 0x37, 0x43]
To decode, we run our state machine with that data.
1.1 Our stack is empty to start with. We can’t do our automatic first pop, so we skip that step. The next bit in our pop command list is
0, so we do not do any additional pops. Our stack remains empty.
1.2 We do an automatic push of
pushes list has two
1 values followed by a
0, so we do two additional pushes of
02 87 to finish this stage.
1.3 Our stack is now
00 02 87. We check our mempool for transactions beginning with those bytes, and hopefully only find one such transaction, which we append to our block.
2.1 The next two values in our
pops list are
1, 0, so we do our automatic pop plus one more, leaving us with only
00 on our stack.
2.2 Our next values in
1, 0, so we do our automatic push plus one more. This pushes
04 37 onto our existing
2.3 We now have
00 04 37 on our stack. We check our mempool for transactions beginning with those bytes, and hopefully only find one such transaction, which we append to our block.
3.1 The next value in our
pops list is
0, so we only do our automatic pop, leaving us with
00 04 on our stack.
3.2 The next value in our
pushes list is
0, so we only do our automatic push. Our next value in
43, so we push that onto our stack.
3.3 We now have
00 04 43 on our stack. We check our mempool for transactions beginning with those bytes, and hopefully only find one such transaction, which we append to our block.
(c) Potential errors
Decoding errors can come in three types:
- The recipient does not have any transactions in their mempool whose TXID begins with the specified prefix. (missing-tx error)
- The recipient has more than one transaction in their mempool whose TXID begins with the specified prefix. (prefix-collision error)
- The recipient does not have the correct transaction in their mempool for the specified TXID prefix, but has an incorrect transaction with the same prefix. (Combined missing-tx and prefix-collision)
Error types #1 and #2 are trivial to detect. To detect error type #3, we can use checksums. Once errors have been detected, they can be resolved by asking the sender for the full TXIDs for the relevant block indices or ranges. Type #2 errors can often also be resolved by using checksums to disambiguate between different transactions.
In case our recipient does have a conflicting transaction, we can encode some additional checksum data so that the recipient can narrow down approximately where the decode error occurred. We can encode checksums at multiple levels in order to provide good location-specificity for typical errors, good overall sensitivity for rare multiple-collision errors, and low total checksum size. My suggested checksum scheme is to use a level 0 checksum for each group of 8 transactions (i.e. tx[0:8], tx[8:16], etc.), a level 1 checksum for each group of 64 transactions (i.e. tx[0:64], tx[64:128], etc.), a level 2 checksum for each group of 256 transactions, and a level 3 checksum for each group of 1024 transactions. For each level and for each peer connection, we pick a different (random) byte position from the TXID to use in our checksums. This gives a total of 4 bytes of checksum coverage for any given transaction, so if there’s a type #3 error, the probability of that error being undetected is 1 in 2³². If the decode error is undetected, the decoded block will fail the merkle root hash check and will DoS ban the peer that sent that message. However, since other peers will use different checksum byte indices, that same error will have a 1/2³² = 99.999999977% chance of being detected on every other p2p connection that the recipient has. This checksum scheme uses 1.164 bits per transaction for large blocks
Unlike plain Xthin, Xthinner is essentially immune to adversarial cases. Our adversarial cases can be divided into the following categories:
- Our adversary pre-fills our mempools with transactions that are difficult to encode, thereby bloating Xthinner messages.
- Our adversary waits until a block has been mined and encoded into an Xthinner message, and then injects conflicting transactions into the recipient’s mempool before they have had a chance to download and decode the Xthinner message.
Neither one of those two scenarios is particularly troublesome.
- Want to expend 256^n computation in order to increase the Xthinner message size for a transaction by n bytes? Okay, bring it on. Signature operations with libsecp256k1 happen at a rate of roughly 10,000–20,000 per second per core, so if an attacker tried to make transactions that took 6 bytes instead of 2 bytes each to encode, they would be able to spit out about 1 transaction every 3 seconds per CPU core, and they would still still have to pay transaction fees.
- A savvy adversary can also spy a Xthinner block message in flight, and pollute the mempool of the recipient with conflicting (#2 or possibly #3) transactions before the Xthinner message transfer finishes, but this just forces one extra round trip (~0.2 sec) and 32 bytes of traffic to fetch the full TXIDs for the conflicting positions. This attack is difficult to pull off, and will be limited to a small number of transactions (i.e. whatever you can get into the recipient’s mempool while the block is in flight). If the attacker is able to inject 100 conflicting transactions, that might make the recipient request another 3.2 kB of TXIDs over the network, which should be tolerable. If this attack actually gets performed in the wild, there’s a straightforward defense, too: the recipient could also attempt decoding the Xthinner message preferentially or exclusively using transactions that it knew about before it started downloading the block.
To run my xthinner-python test code, enter the following commands on a *nix machine:
git clone https://github.com/jtoomim/xthinner-python.git
python xthinner.py # for just the summary info
python xthinner.py -d | less # for full debugging output
Here is sample output:
$ python xthinner.py
95860 tx in block, 176671 tx in mempool (54%)
Sorting into LTOR took 0.031356 seconds
Encoding finished! 140752 pops, 140754 pushes, 140754 pushbytes, 13950 checksum bytes
Encoding took 0.283996 seconds
Encoding is 189901 bytes total, or 15.84820 bits per tx
Decoding took 0.256313 seconds
Does decoded match original block? True
The above scenario is for a 21.3 MB LTOR block that uses 54% of a 39.4 MB mempool. It does not include retransmissions due to missing transactions in the recipient’s mempool. If the 21.3 MB block comprises 100% of a 21.3 MB mempool, then the encoding size falls from 15.85 bits/tx to 14.15 bits/tx.
The proof-of-concept python implementation of the encoding scheme takes only 284 ms to encode a 21.3 MB block on my laptop, and 256 ms to decode. A proper C++ implementation will likely be 10x to 100x faster than that. Without any parallelization, encoding and decoding a 1 GB block will likely take less than 1 second in C++.
Are bytes optimal?
The decision to use a state machine that pushes full bytes onto the stack was completely arbitrary and driven by programming laziness. I have not tested any other push sizes. It’s quite possible that better encoding efficiency could be achieved with a smaller (or possibly larger) push size, like perhaps 4 bits or 6 bits. Using byte-sized pushes is probably good enough for now, though.
Operation without LTOR
Xthinner does not require LTOR, but it works better with it than with other orderings. If LTOR is not used, a separate step will be required before encoding and after decoding to change the transaction order to and from the LTOR order. If a CTOR other than LTOR is used, these steps can be performed without any additional encoding data overhead.
If LTOR is required (i.e. a CTOR), then the transactions will need to be sorted into LTOR once by the miner who creates the block. If LTOR is not required, then each node will need to sort the block into LTOR before each Xthinner hop, and will need to unsort from LTOR back into the true block order.
However, if a CTOR is not used, like with the current standard TTOR, then the encoder will need to separately encode order information. This can take up to (n log_2(n)) bits for a block with n transactions, or about 2 bytes per transaction for a 15 MB block. This roughly doubles the total encoding size. It is also possible to use specialized order compression schemes optimized for typical block transaction orders (i.e. sorted mostly by feerate) to reduce the size cost of order information by e.g. 72%. This could bring the order encoding information down to about 5 bits, for roughly 21 bits per transaction total on average for blocks with cooperative miners. Such an algorithm could run on BTC. Choosing to not use a CTOR also exposes the network to the transaction-reordering version of the outcast attack (concept — second half of post; math).
Xthinner isn’t just for blocks anymore. The same technique can be used to compress INV messages down to a few bytes per transaction. INV messages are the way that nodes announce to each other that they have one or more new transactions in their inventory. Currently, they require that each node send or receive 32 bytes (plus up to 95 bytes of TCP/IP and p2p overhead) to each of that node’s peers. For a node with 50 peers, that can result in 3.2 kB of download and upload traffic per transaction just to announce that a transaction is available for download, without even sending the transaction itself. With Xthinner encoding, this could be reduced to around 2–4 bytes per transaction, albeit with the same 95 bytes of TCP/IP and p2p overhead. Batching multiple INVs into a single IP packet can reduce the overhead, but bandwidth savings on INVs will be minimal unless typical INV batch sizes are at least 5 transactions per INV.
While Xthinner encoding and decoding on a single core is probably fast enough for the reasonable future, there might eventually come a time at which parallelization is desirable. Fortunately, it’s easy to do. You can split a block up into n contiguous chunks of size 1/n and encode each chunk independently and in parallel. Decoding can only happen in parallel if the encoding was split into chunks. Splitting a block into chunks adds a small amount of overhead, as you’re discarding the contents of the stack at the end of one chunk and re-specifying it at the beginning of the next chunk. Other than that, the overhead for each chunk should be minimal, and will mostly consist of 2 or 4 bytes for encoding the chunk size and about 13 bytes for encoding the checksum scheme parameters.
For an old and partially out-of-date primer on the Blocktorrent concept, please read this post. Warning: the rest of this section is rather technical, and won’t be of interest to most casual readers. The tl;dr is that Xthinner can make Blocktorrent both more efficient and easier to implement.
Instead of using Xthinner to encode a set of transactions which can be Merkle hashed down into the Merkle root hash for a block, we can instead use Xthinner to encode a set of transactions that Merkle hash down to one of the block Merkle tree’s internal hashes. The packet can also include the remaining Merkle hashes (in their full 256 bit glory) to allow each packet to be independently verified as being an error-free chunk of the true block. For example, splitting the block into 512-transaction chunks would give a chunk encoding size of roughly 1014 bytes. A single UDP/IP packet could contain 512 Xthinner-encoded TXIDs plus up to 12 Merkle path hashes while still fitting into a 1500 byte MTU, allowing each UDP/IP packet to be independently verified back to the Merkle root for blocks with up to (2¹¹ * 2⁹) = 1,048,576 transactions. For larger blocks, 256-transaction chunks can allow up to 27 Merkle path hashes, allowing for blocks with up to 2³⁴ (17 billion) transactions. Alternately, UDP supports datagrams larger than the MTU (up to 64 kiB in size). Oversize datagrams suffer a bit more from packet loss, since the whole datagram gets lost if a single packet is lost, but for a two-packet datagram, it shouldn’t be too bad.
If we can independently verify each 256- or 512-tx chunk as correctly corresponding to a Merkle root hash for which we have a block header with valid proof of work, then we can trustlessly forward each chunk to all of our peers before waiting for the rest of the block without any subsequent DoS risk. That is very cool.
My previous 2016 attempt at implementing Blocktorrent ran into issues from excessive code complexity. This complexity was the result of needing to be efficient in what data was included in each packet. If each transaction takes 8 bytes to encode, then a packet could only contain 128 transactions plus 11 Merkle path hashes, which would limit it to 131,072-transaction blocks. Dropping that down to 64-tx packets would allow 2³² transaction blocks, but would use up to 40% of the total bandwidth for repeated transmission of Merkle path information. To try to solve this problem, I spent a lot of time writing code to allow nodes to model their peer’s block state and efficiently transmit state information. This sorta worked, but it was a huge time sink. By using Xthinner, we can get more transactions into each packet, thereby enabling larger blocks with greater efficiency while still keeping each packet fully independent of all others, thereby eliminating the need for that state modeling and transmitting code. Xthinner makes Blocktorrent much simpler, more efficient, and more feasible.