LIP: <LIP number>
Title: Use base32 encoding of long hash of pubkey plus checksum for address
Author: Andrea Kendziorra <andreas.kendziorra@lightcurve.io>
Discussions-To: https://research.lisk.com/t/use-base32-encoding-of-long-hash-of-public-key-plus-checksum-for-address/
Type: Standards Track
Created: <YYYY-MM-DD>
Updated: <YYYY-MM-DD>
Abstract
This LIP proposes a new format for addresses (also referred to as Lisk ID) to replace the current format in the Lisk protocol. The proposed address format uses 160 bits of the SHA-256 hash of the public key and a 30-bit checksum generated by the BCH code proposed in BIP 173. To increase the user-friendliness, the address is encoded in a customized base32 format.
The new address format makes the “registration process” (performing at least one transaction per account to associate a public key to an address) obsolete. Moreover, the collision resistance gets significantly increased, and accidentally sending funds to a wrong address becomes almost impossible due to the checksum.
Copyright
This LIP is licensed under the Creative Commons Zero 1.0 Universal.
Motivation
The current address system was designed with the aim to be convenient for users when reading, typing, inserting or communicating an address. This was done by choosing a short length. In fact, the addresses are generated by taking the first 64 bits of the SHA-256 hash of the public key (plus a few small modifications). Despite the advantages, the address system has at least three significant shortcomings. First, an account holder is currently strongly advised to perform at least one transaction from an account to associate a public key with the address. Otherwise, an adversary could find another key pair that yields the same address and use this one to spend funds from this account.
Secondly, the addresses do not have a checksum or any other error detection mechanism. Hence, mistyping a single character of the address for a balance transaction results in sending funds to a different account (with very high probability, no user has a corresponding key pair for this account).
The third disadvantage is the low resistance against collisions of addresses. If the number of created Lisk accounts will rise significantly in the future, the probability for a collision could become non-negligible.
Therefore, a new address system is proposed that eliminates the mentioned issues. The first and the third issue get resolved by using a 160-bit hash of the public key. The second one by using a checksum. Since the overall bit length of the hash value plus checksum is rather large, a base32 encoding is used. This results in shorter addresses than in hexadecimal representation while still being human-friendly.
Specification
Address Computation
The address for an account, given by its public key, is generated as follows:
- Compute
SHA-256(pubkey),
where pubkey
is the public key of the account, and truncate the output to its first 160 bits.
- Compute the checksum of the output of step 1 using the BCH code described below.
- Concatenate the output of step 1 and the output of step 2.
- Encode the output of step 3 into a specified base32 format (see below for details).
- Add the prefix “lsk” to the output of step 4.
The output of step 5 is the address for the given account. An example of an address computation can be found in the appendix. In the following subsections, some details of the steps above get clarified.
Step 2: BCH checksum
The checksum is computed using the BCH code proposed in BIP 173. In this section, we only provide example Javascript code that computes and validates the checksum correctly (based on the code from bech32). For a more formal description, see the appendix.
Creating checksum
The checksum can be computed by the function createChecksum
below. For this, the output of step 1 has to be split into 5-bit strings, and each bit string has to be converted into an integer. Hence, the input for createChecksum
is a list of 32 integers between 0 and 31. The output of createChecksum
is a sequence of 6 integers between 0 and 31, where each integer represents a 5-bit string. Thus, the checksum consists of 30 bits.
function polymod (values) {
var GENERATOR = [0x3b6a57b2, 0x26508e6d, 0x1ea119fa, 0x3d4233dd, 0x2a1462b3];
var chk = 1;
for (var p = 0; p < values.length; ++p) {
var top = chk >> 25;
chk = (chk & 0x1ffffff) << 5 ^ values[p];
for (var i = 0; i < 5; ++i) {
if ((top >> i) & 1) {
chk ^= GENERATOR[i];
}
}
}
return chk;
}
function createChecksum (data) {
var values = data.concat([0, 0, 0, 0, 0, 0]);
var mod = polymod(values) ^ 1;
var ret = [];
for (var p = 0; p < 6; ++p) {
ret.push((mod >> 5 * (5 - p)) & 31);
}
return ret;
}
Validating checksum
To verify if a code word has a correct checksum, the function verifyChecksum
below can be used.
function verifyChecksum (codeword) {
return polymod(codeword) === 1;
}
As for the checksum creation, the input for verifyChecksum
must be a sequence of integers between 0 and 31. The 6 most right integers of the codeword represent the checksum.
Step 3: Concatenation
The output of step 1 is a 160-bit string, and the output of step 2 a 30-bit string. In step 3, the output of step 2 is appended to the right of the output of step 1. The output of step 3 has a length of 190 bits.
Step 4: Base32
The base32 encoding in step 4 is a customized version of base32. It uses lower case letters and digits. The alphabet consists of the following characters:
abcdefghjkmnopqrstuvwxyz23456789
The characters “i”, “l” (lowercase L), “0” (zero) and “1” are excluded.
In the following subsections, the encoding and decoding for bit lengths that are multiples of 5 are described. Note that the encoding and decoding could easily be extended to arbitrary bit lengths.
Encoding
First, the bit string is split into substrings of 5 bits and the substrings are processed from left to right. Each substring is mapped to a character from the alphabet set according to the following mapping f
, where each 5-bit string is interpreted as an integer:
x |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
f(x) |
z |
x |
v |
c |
p |
m |
b |
n |
3 |
4 |
6 |
5 |
o |
9 |
7 |
8 |
x |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
f(x) |
u |
y |
r |
t |
k |
q |
e |
w |
2 |
a |
d |
s |
j |
h |
f |
g |
In the case of step 4 of the address computation, the input string consisting of 190 bits is split into 38 substrings consisting of 5 bits, and each substring gets converted to a character by the mapping f
.
Decoding
Each character of the encoded string gets converted to a 5-bit string by the inverse of the mapping f
above. The resulting bit string represents the output. In our specific case, the decoder has to convert 38 characters into a sequence of 38 integers between 0 and 31.
Step 5: Prefix
The output of step 4 is a string of length 38. It is appended by the prefix “lsk” from the left, yielding a string of length 41.
Usage of the Addresses in the Protocol
Addresses have to be used for the recipientID
property of balance transactions. Further usages are optional.
Address Format Validation
The following steps have to be done to validate if a given address addr
satisfies the proposed address format. Note that these steps validate only the format. It does not guarantee that any user or even a specific user is in possession of a key pair that corresponds to this address. The validation consists of:
- Verify that
addr
is a string of length 41.
- Verify that the substring of the first 3 characters of
addr
equals “lsk”.
- Let
substr
be the substring of the last 38 characters of addr
. Verify that every character in substr
is an element of the used base32 alphabet. Uppercase letters are not allowed.
- Decode
substr
into sequence of 38 integers between 0 and 31, denoted by intseq
, by using the inverse of the mapping f
above. Verify that verifyChecksum(intseq)
returns True
.
If any of these validation steps fails, then the address does not have a valid format.
Address Format Validation in Transaction Validation
When a transaction object contains an address, the address format has to be validated during the transaction validation. If the format validation fails, the transaction has to be rejected. The procedure described above can be used for the format validation. Of course, a node may also keep track of all validated addresses and may validate the format of an address by a lookup.
Address Format Validation in User Interfaces
This subsection does not contain rules for the protocol, but recommendations for the design of user interfaces to make use of all capabilities of the new address format. The recommendations are:
- When entering an address into an input field, letters should be automatically converted to lowercase letters.
- Before sending out a transaction that contains an address data field, the address format should be verified. If the validation fails, the transaction should not be sent out.
Serialization
Whenever an address value has to be serialized, the 20 bytes of the public key hash are taken, i.e. the bytes which are used as the input for the checksum computation. This applies for the serialization of the recipientID
property of balance transfer transaction and for any other property of a transaction (or block) that contains and address.
Converting Existing Accounts
We distinguish between accounts with and without associated public key. Note that an account has an associated key if and only if an outgoing transaction from this account is included in the blockchain.
Accounts with Public Key
For accounts with an associated public key, the new address has to be computed and stored such that the account is accessible via the new address.
Accounts without Public Key
Funds that were sent to an address of the current format and for which no public key was associated before the implementation of the new address system are not automatically accessible. To make them accessible, a special transaction, called reclaim transaction, has to be included in the blockchain. The set of addresses of the current format without associated public key and for which no reclaim transaction was included in the blockchain have to be persisted along with their balances. In the following, we denote this set of addresses by unregisteredAddresses
.
Reclaim Transaction
Type
The type of the reclaim transaction is the lowest positive integer that is not used for any other transaction type on the Lisk blockchain (8 at the time of writing).
Schema
The asset
property of a reclaim transaction needs to contain the amount
property which is of type string. The value of asset.amount
must represent a decimal number and this number must be equal to the balance associated to the address of the current format in Beddows.
A reclaim transaction is of the following form:
{
"type": <type>, // at the time of writing, the value must be 8
"senderPublicKey": <Public key of the sender>,
...
"asset" : {
"amount" : <amount in Beddows>
}
}
Validity
A reclaim transaction, tx
, must fulfil the following points to be valid:
- Let
old_addr
be the address according to the old format for tx.senderPublicKey
. Then, old_addr
must be contained in unregisteredAddresses
.
- Let
old_balance
be the sum of all funds send to old_addr
in Beddows. Then tx.asset
must contain the property amount
, and the number represented by tx.asset.amount
must be equal to old_balance
.
- Let
new_addr
be the address of the new format for tx.senderPublicKey
and let new_balance
be the balance of new_addr
. If there exists no account for new_addr
, then new_balance
is 0. Then tx.fee
must be smaller or equal than old_balance + new_balance
.
State Changes
We use the same notation as in the Validity section. If the transaction tx
is included in the blockchain, the following state changes apply:
-
old_addr
is removed from unregisteredAddresses
.
- If no account for
new_addr
exists, the account is created.
- If no public key is associated to
new_addr
, the public key tx.senderPublicKey
is associated to new_addr
.
- The balance for
new_addr
is set to old_balance + new_balance - tx.fee
.
Serialization
The asset property of a reclaim transaction tx
needs to be serialized to an array of 8 bytes containing the big endian encoding of tx.asset.amount
interpreted as a 64-bit signed integer.
Rationale
Hashing Step
To keep the length of the addresses at a reasonable level, a hash of the public key with a length smaller than 256 bits is used instead of the public key itself. When choosing the hash length, one crucial requirement has to be kept in mind: For a given address, it should be computationally infeasible to find a key pair (public key and private key) that yields this address. This requirement prevents that an attacker can find a key pair for an address for which no public key is yet associated to (a public key gets associated to an address the first time a transaction outgoing from this account is included in a block). If an attacker were able to find such a key pair, they could transfer the funds from this account. In the current Lisk protocol, it is computationally very expensive to find a key pair for a given address, but it is not infeasible. For an account with no associated public key, such an attack may be economically profitable if there are enough funds in the account. Therefore, it is currently recommended to do at least one transaction per account such that the address is associated to the account holder’s public key. This process is sometimes referred to as the “address registration” process. However, the address registration process is an additional step required from the user, and it is highly desired to make it obsolete.
The length of the hash output determines the resistance against the mentioned attack. Although a security level of 128 bits is considered to be sufficient for the next years (see the security recommendations from ECRYPT [1, Chapter 2] and NIST [2, Section 5.6.2]), we choose a hash output length of 160 bits which provides 160-bit resistance against such attacks. Besides the strong pre-image resistance, this output length also provides sufficient collision resistance, namely 80-bit resistance. Note that collisions do not provide any value to an attacker. Therefore, the hash function does not need to be resistant against brute force attacks for collisions. However, collisions of addresses could result in “accidental” losses of money: If a user creates a new account by creating a key pair and provides the resulting address to another user in order to receive a payment, and the address is already associated to another public key, then the user will never be able to spend the funds send to this address. Only the key pair associated with the address can be used to spend the funds. With the chosen hash length, the probability for an address collision becomes negligible, even for very large numbers of addresses. For example, the probability of having an address collision after 1015 addresses is smaller than 10-18.
Why SHA-256?
SHA-256 is the currently used hash function in the Lisk protocol. To keep the protocol simple, we propose to not introduce a new hash function and to use SHA-256 here as well.
Checksum
Several checksum options were considered and are discussed in the following.
Hash Functions
A very common choice in blockchain projects is to use a truncated hash value, where a length of 32 bits is typically used. However, hash functions do not provide any error detection guarantees and were discarded for this reason.
Reed-Solomon Codes
Some versions of shortened Reed-Solomon codes were investigated.
The first one uses the alphabet size 32, which is very suitable for using a base32 encoding. That means, each character of an address (represented by 5 bits) corresponds to one symbol of a codeword. Since a Reed-Solomon code with alphabet size q allows only a maximum block length of q-1, the maximum block length for q=32 is 31. Hence, a codeword would consist of maximum 31*5 = 165 bits. Since this is too short for a 160-bit input plus an adequate checksum, the input has to be split into parts which results in two checksums. More precisely, the input (the 160-bit hash value) is split into two inputs of 80 bits, and both inputs are encoded with a [31, 28, 4]32 Reed-Solomon code, where the output is shortened. Hence, there is a 3 symbol (15 bits) checksum for each input. The distance of the code, d=4, guarantees error detection for up to 3 errors in the address, which is less than for the chosen BCH code. Moreover, the error detection capabilities are very poor for more than 3 errors. For example, the false positive rate (false positive means to not detect errors) for 4 errors in an address is ~3*10-6.
The second considered Reed-Solomon code is the one with the parameters [1023, 1020, 3]1024. The checksum consists of 3 symbols with 10 bits each. One symbol of a codeword always represents 2 characters in the base32-encoded address. The distance of the code, d=4, guarantees detection of up to 3 errors in the base32-encoded address. This is again worse than for the chose BCH code. Moreover, the error detection capabilities for more errors are worse than the ones for the BCH code.
Other alphabet size were disregarded due to the worse error detection guarantees in the base32 encoding (unless one uses more than 32 bits for the checksum).
Cyclic Redundancy Check Codes
Several cyclic redundancy check (CRC) codes were considered, mainly 32-bit CRCs. The best performing CRC code is the one by Castagnoli (also referred to as CRC-32/4 in [3]). It’s error detection capabilities for more than 4 errors in the base32-encoded addresses are better than for the chosen BCH code. However, the error detection guarantee is worse. The code guarantees detection of up to 7 bit inversions which results in an error detection guarantee of up to one character in base32 encoding. Moreover, the overall length of the address increases by one with a 32-bit checksum. Therefore, this CRC code was not selected.
BCH Codes
BCH codes have many free parameters which results in many possible choices. Finding a suitable choice of parameters requires an extensive computational search. We simply use the result presented in BIP 173 without performing the search ourselves. This code is well suited for a base32 encoding. It provides detection guarantee of up to 4 characters in base32 encoding for the input length of 38 characters, has good error detection capabilities for more than 4 errors and is relatively efficient. Moreover, the authors of BIP 173 claim that it was optimized for detecting low numbers of bit inversions which fits well together with the base32 encoding proposed here (see
below for the reasoning of the specific encoding function).
Base32 Encoding
Why Base32?
A base32 encoding is clearly preferable over the decimal or hexadecimal system as it is shorter (~20% shorter than hexadecimal). Base64 or any other encodings using a larger alphabet were excluded due to the usage of special characters that can have disadvantages when double-clicking on an address. Any alphabet size that is not a power of 2 was excluded as well, as it disallows to map consecutive strings of bits to characters of the alphabet. Hence, a single mistyped character in an address using such an alphabet could result in many inverted bits which makes it unsuitable for our checksum choice. Base58 is an example of such an alphabet. Hence, a base32 encoding is the preferred choice.
Although there exist already some versions of base32, for instance RFC 4648 and z-base-32, none of them are satisfactory. This is in most cases due to the selection of the alphabet. The base32 encoding that matches our requirements the most (but not sufficiently) is the one proposed in BIP 173. Due to the unsatisfactory design of the existing base32 versions, a custom version is defined in this protocol.
The Alphabet
The alphabet should not contain mixed case letters to ease the communication, for instance, via phone. Lowercase letters are preferred over uppercase letter as they are easier to read. Using an alphabet size of 32 allows to exclude 4 characters from the set of all lowercase letters and all digits. The characters “1”, “i”, “l” (lowercase L), “0” (zero) and “o” are for many fonts the most ambiguous ones. Restricting the alphabet to lowercase letters makes “o” and “i” less ambiguous than the other ones, which led to the decision in keeping one of the two (“o” is part of the alphabet).
The Encoding Function
The encoding function which maps each 5-bit string to a character of the alphabet was chosen in the way to have few bit inversions for some error types that are expected to be common. As the checksum is optimized for detecting low numbers of bit inversions, the probability of detecting more than 4 errors of these types increases. The types of error considered are
- mistyping a character by a left or right adjacent character on the keyboard, and
- mistyping a character by a similar looking character.
Errors of the first kind were given a priority. The 32-bit Gray code was used to ensure that two elements of the alphabet that are in the same row and adjacent on a QWERTY keyboard differ in exactly one bit. In fact, all possible permutations of the contiguous rows of the QWERTY keyboard consisting of elements from the alphabet, i.e. the rows
23456789, qwertyu, op, asdfghjk, zxcvbnm
fulfil this condition when their concatenation is mapped to the Gray code. Moreover, using the reversed rows and rotations of the concatenations fulfils the same condition. Hence, all these different orderings were considered, and the one that is the best with respect to the second kind of error was chosen. More precisely, the bit representations of any of the pairs†
(b, 6), (g, q), (g, 9), (q, 9), (s, 5), (v, u), (z, 2), (q, p), (r, v)
should differ in at most two bits, and the number of pairs differing in two bits should be as small as possible. The lowest number of pairs differing in two bits is 7. In fact, there are 40 different orderings that fulfil this number. The encoding function f
defined in the
specification is just a random choice from these 40 possibilities.
Prefix
A prefix is added to the address to minimize the risk that a Lisk address gets confused with an address of another blockchain project. This can be helpful, for instance, when users use applications/clients that handle several cryptocurrencies including Lisk.
Converting Existing Accounts
The usage of a reclaim transaction for accessing funds send to “unregistered” addresses was preferred over solutions that automatically merge account information of old and new addresses (e.g., as in this post) for the following architectural reason: It allows that the SDK does eventually not need to implement the current address format. In contrast, an automatic merge solution that is part of the transaction verification is hard to extract from general transaction verification logic, and therefore hard to remove from the SDK. Moreover, an automatic merge solution requires additional computational steps whenever a transaction contains a public key that was not contained in the database before.
Amount Property in Reclaim Transactions
One may argue that the amount property of a reclaim transaction is not required to process the transaction. This is true. However, the undo step in case of reverting a block would be very expensive because one could not easily distinguish between the following two cases:
- The account received balance transfers only to the old address.
- The account received balance transfers to the old and the new address.
Backwards Compatibility
This change introduces a hard fork because
- balance transactions using the current address format get rejected by nodes following the proposed rules, and vice versa, and
- reclaim transactions get rejected by nodes following the current protocol.
References
[1] M. Abdalla et. al, “Algorithms, Key Size and Protocols Report (2018)”, ECRYPT-CSA, Feb. 2018, http://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf
[2] E. Barker, “Recommendation for key management”, NIST Special Publication 800-57 Part 1 Rev. 4, Jan. 2016, https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-57pt1r4.pdf
[3] G. Castagnoli, S. Brauer, M. Herrmann, “Optimization of cyclic redundancy-check codes with 24 and 32 parity bits”, IEEE Transactions on Communications, Vol. 41 (6), pp. 883 - 892, Jun. 1993
Appendix
Examples
Address Computation
In the following, the address for the public key
pubkey = 0x0eb0a6d7b862dc35c856c02c47fde3b4f60f2f3571a888b9a8ca7540c6793243
gets computed.
- Let
H' = SHA-256(pubkey) = 0xc247a42e09e6aafd818821f75b2f5b0de47c8235b580881bd7750c9365993d25
and let H
be the first 160 bits of H'
, i.e.,
H = 0xc247a42e09e6aafd818821f75b2f5b0de47c8235.
The representation of H
as a sequence of 5-bit integers is
H_5bit = [24 9 3 26 8 11 16 9 28 26 21 15 27 0 12 8 4 7 27 21 22 11 26 27 1 23 18 7 25 0 17 21].
- Let
CS
be the checksum of H
, i.e.,
CS = createChecksum(H_5bit) = [21 21 31 11 22 16].
- Let
HCS
be the concatenation of H
and CS
in 5-bit representation, i.e.
HCS = [24 9 3 26 8 11 16 9 28 26 21 15 27 0 12 8 4 7 27 21 22 11 26 27 1 23 18 7 25 0 17 21 21 21 31 11 22 16]
- Let
B
be the base32 encoding of HCS
, i.e.
B = 24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu.
- Adding the prefix “lsk” to
B
yields the address
lsk24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu.
Valid Addresses
We list some example addresses for a given public key. Note that the provided public keys are not necessarily valid public keys (curve points for which there exists a private key).
pubkey = 0x0000000000000000000000000000000000000000000000000000000000000000
address = lskoaknq582o6fw7sp82bm2hnj7pzp47mpmbmux2g
pubkey = 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
address = lskqf5xbhu874yqg89k449zk2fctj46fona9bafgr
pubkey = 0x00ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
address = lskamc9kfzenupkgexyxsf4qz9fv8mo9432of9p5j
pubkey = 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff00
address = lsk6xevdsz3dpqfsx2u6mg3jx9zk8xqdozvn7x5ur
pubkey = 0x749b88baa787e83b5a06bbfee95002ac5a9925dcaeea28262e683498147b8fce
address = lskxwnb4ubt93gz49w3of855yy9uzntddyndahm6s
pubkey = 0xe877c250d725ac7ca6c5d9d83b39645de7eebe6c4ae6e076da7163434d297c7d
address = lskzkfw7ofgp3uusknbetemrey4aeatgf2ntbhcds
Invalid Addresses
In the following, we list some invalid addresses and for each address at least one reason for the invalidity.
24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu # missing prefix
lsk24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5e # incorrect length (length 40 instead of 41)
lsk24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5euu # incorrect length (length 42 instead of 41)
LSK24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu # incorrect prefix
tsk24cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu # incorrect prefix
lsk24cd35u4jdq8sz03pnsqe5dsxwrnazyqqqg5eu # invalid character (contains a zero)
lsk24Cd35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu # invalid character (contains an upper case 'C' instead of lower case 'c')
LSK24CD35U4JDQ8SZO3PNSQE5DSXWRNAZYQQQG5EU # invalid characters (all letters in upper case)
lsk24dc35u4jdq8szo3pnsqe5dsxwrnazyqqqg5eu # invalid checksum due to character swap
Formal Description of the BCH Code
The specification section provides example code for creating and validating the checksum, but not a formal description. This shall be done here (see also the documentation of the Bitcoin implementation of bech32 for some details; for example, the representation of the generator polynomial as an array of 32 bit numbers is explained there) .
Let F32 be the finite field with 32 elements. We use the representation F2[x]/(x5+x3+1) for F32, i.e., every element in F32 is interpreted as a polynomial with coefficients in F2 modulo x5+x3+1. Each polynomial is identified by the decimal number whose bit representation matches with the polynomial. For example, the decimal number 23 represents the polynomial x4+x2+x+1 because the bit representation of 23 equals 10111. Let g(x) be the polynomial with coefficients in F32 of the following form:
g(x) = x6 + 29x5 + 22x4 + 20x3 + 21x2 + 29x + 18.
g(x) is the generator polynomial of the BCH code. Every input for the checksum is interpreted as a polynomial with coefficients in F32 too. For example, the input
d = 0xc247a42e09e6aafd818821f75b2f5b0de47c8235
is interpreted as the polynomial
d(x) = 24x31 + 9x30 + 3x29 + … + 25x3 + 0x2 + 17x + 21
where each coefficient represents 5 bits of d (see the
example above for all coefficients). To achieve that a codeword can be represented as the concatenation of the input and the checksum, systematic encoding is used. In this specific case, the checksum polynomial, chk(x), can be computed by
chk(x) = - (d(x) * x6 mod g(x))
for an input polynomial d(x). Then, the whole codeword, represented as a polynomial c(x), has the form
c(x) = d(x) * x6 + chk(x).
A polynomial p(x) with coefficients in F32 is then a valid codeword if and only if
0 = p(x) mod g(x)
holds. However, two more tweaks were used for the design of the bech32 checksum. Both are discussed in the following subsections.
Adding a Leading Term to the Input Polynomial
A leading term is added to the input polynomial in bech32. More precisely, if k is the length of the input sequence of 5-bit integers, then let
d’(x) = xk + d(x).
The polynomial d’(x) is then used for the checksum computation instead of d(x). This ensures that the checksum differs if some zeros are added to the left of the input sequence of 5-bit integers. E.g., the inputs
[5, 21, 31]
and [0, 0, 5, 21, 31]
yield different checksums in bech32. Here, this property is not needed due the fixed length input, but was kept to be compatible with bech32. Since the input sequence has always length 32, we always have
d’(x) = x32 + d(x)
in our case.
Codewords Are not Multiples of Generator Polynomial
The second tweak is to add 1 to d’(x) * x6 when computing the checksum, i.e., the checksum is computed as
chk’(x) = - (d’(x) * x6 + 1 mod g(x)).
Then, a polynomial p(x) with coefficients in F32 is a valid codeword if and only if
-1 = p(x) mod g(x)
holds (note that 1=-1 holds in F32). This avoids that appending some zeros to the right of a codeword yields again a valid codeword. If the checksum were computed by
chk(x) = - (d(x) * x6 mod g(x)),
then appending n zeros to the right of a valid codeword would yield again a valid codeword. This is because
0 = c(x) mod g(x) implies 0 = c(x) * xn mod g(x).
This property is not strictly required for this proposal due to the fixed length of codewords, but was kept to be compatible with bech32.
Notes
[†]:
Note that this list of pairs of similar looking characters is based on subjective perception since no scientific result is known to the author.