In this email, I would like to propose a new address system for Lisk. This proposal poses a solution for the roadmap objective “Replace address system”.
A full LIP draft is also in preparation and will be posted soon. Currently, some computations for some checksum alternatives are still ongoing. If the alternatives turn out to perform better than the checksum proposed here, this proposal may change slightly. Therefore, the full LIP draft will not be posted before the computation is finalized.
The proposed address format uses a 160-bit hash value of the public key using SHAKE256 and a checksum generated by a 32-bit cyclic redundancy check code. To increase the user-friendliness, the address is encoded in a customized base32 format.
With this proposal, the following gets achieved:
the “registration process” (performing at least one transaction per
account to associate a public key to an address) becomes obsolete
the collision resistance gets significantly increased
accidentally sending funds to a wrong address becomes almost impossible
# The Address Computation
The address for an account, given by its public key, is generated as follows:
Compute SHAKE256(pubkey, 160), where pubkey is the public key of the
Compute the checksum CRC-32C of the output of step 1.
Concatenate the output of step 1 and the output of step 2.
Encode the output of step 4 into a specified base32 format (see
below for details).
Add the prefix “lsk” to the output of step 4.
The output of step 5 yields the address for pubkey.
In the following, some details of the five steps above get clarified.
## Step 2: CRC-32C
The checksum used in step 2 is the 32-bit cyclic redundancy check code by Castagnoli. It is defined by the generator polynomial
The hexadecimal representation of this polynomial is 0x11edc6f41.
## Step 3: Concatenation
In step 3, the output of step 2 is appended to the right of the output of step 1. The output of step 3 has a length of 192 bits.
## Step 4: Base32
The base32 encoding in step 4 is a customized version of base32. It uses lower case letters and digits. The alphabet consists of the following characters:
The characters “i”, “l” (lowercase L), “0” (zero) and “1” are excluded.
First, the input bit string is filled with zeros on the right until the length is a multiple of five. If the length of the input bit string is a multiple of five, no zeros are appended. Then, the bit string is split into substrings of 5 bits and the substrings are processed from left to right. Each substring is mapped to a character from the alphabet set according to the following mapping f, where each 5-bit string is interpreted as an integer:
In the specific case of step 4 of the address computation, the input string consisting of 192 bits is filled with three zeros on the right. Then, it is split into 39 substrings consisting of 5 bits. Finally, each substring gets converted to a character by the mapping f.
#### Example of Base32 Encoding
Consider the following 192-bit input string in hexadecimal representation:
The bit representation (grouped in 5-bit substrings) is as follows:
00111 11110 10011 11000 11101 11111 10000 10111 10101 00110 11000 11110 11000 00001 01010 00101 00001 11000 01010 10110 11011 00010 00001 01100 10001 00010 10111 10110 11101 10101 10001 10111 01100 11011 01000 11001 11000 10010 10
The string gets appended by three zeros from the right to have a length that is a multiple of five:
00111 11110 10011 11000 11101 11111 10000 10111 10101 00110 11000 11110 11000 00001 01010 00101 00001 11000 01010 10110 11011 00010 00001 01100 10001 00010 10111 10110 11101 10101 10001 10111 01100 11011 01000 11001 11000 10010 10000
If each 5-bit substring is interpreted as an integer, the sequence looks a follows:
7 30 19 24 29 31 16 23 21 6 24 30 24 1 10 5 1 24 10 22 27 2 1 12 17 2 23 22 29 21 17 23 12 27 8 25 24 18 16
Mapping each element to an alphabet element by using the function f defined in the above results in the following base32 string:
For decoding, the decoder needs to know the expected bit length n of the output. Each character of the encoded string gets converted to a 5-bit string by the inverse of the mapping f above. The first n bits from the left of the resulting bit string represent the output.
In our specific case, the decoder has to convert 39 characters into a 195-bit string. The first 192 bits from the left represent the output. If the 3 most right bits are not zero, then the input was not valid.
## Example of an Address Computation
In the following, the address for the public key pubkey = 0x0eb0a6d7b862dc35c856c02c47fde3b4f60f2f3571a888b9a8ca7540c6793243
Let H := SHAKE256( pubkey, 160 ) =
Let CS := CRC-32C( H ) = 0x66d19c4a.
Let HCS be the concatenation of H and CS, i.e. HCS =
Let B be the base32 encoding of HCS, i.e. B =
Adding the prefix “lsk” to B yields the address
# Justification for this Proposal
## The Hashing Step
We choose a hash output length of 160 bits which provides 160-bit resistance against pre-image attacks, and, therefore, 160-bit resistance against brute-force attacks for finding keys for unregistered addresses.
Besides the strong pre-image resistance, this output length also provides sufficient collision resistance, namely 80-bit resistance. Note that collisions do not provide any value to an attacker. Therefore, the hash function does not need to be resistant against brute force attacks for collisions. However, collisions of addresses could result in “accidental” losses of money: If a user creates a new account by creating a key pair and provides the resulting address to another user in order to receive a payment, and the address is already associated to another public key, then the user will never be able to spend the funds send to this address.
Several checksum options were considered. A very common choice in blockchain projects is to use a truncated hash value, where a length of 32 bits is typically used. However, hash functions do not provide any error detection guarantees and are even slower to compute than a CRC checksum.
Some versions of shortened Reed-Solomon codes were investigated. Although some of them provide higher distances, their overall error detection capabilities were worse than the ones of CRC-32C (we conducted a sampling study introducing random errors and checking for detection success). Therefore, Reed-Solomon codes were disregarded.
BCH codes pose another alternative. They are, for example, used in the Bitcoin improvement proposal 173. BCH codes have, however, many free parameters which results in many possible choices. Finding a suitable choice of parameters requires an extensive computational search. Therefore, they were not considered any further except for the BCH code proposed in BIP 173. Some computations are currently ongoing to compare its error detection capabilities to the ones from CRC-32C (false positives rates for number of errors above the distance of the code).
Among the set of cyclic redundancy codes, the search was restricted to codes with a 32-bit checksum. One may argue that using a 32-bit checksum is a suboptimal choice since 3 bits of the base32-encoded address (using 195 bits) are not used for the checksum. In fact, one could use a 35-bit CRC code without increasing the overall length of the base32-encoded address. However, 35-bit CRC codes are not well studied in contrast to 32-bit CRC codes (see, for instance, the research of Philip Koopman). Hence, finding a 35-bit CRC code that outperforms CRC-32C for 160-bit inputs would require significant research. Among the set of 32-bit CRC codes, CRC-32C provided the best properties for our setting.
### Details of CRC-32C
For 160-bit inputs, CRC-32C has the distance d=8 which guarantees error detection for up to 7 bit inversions. Moreover, it provides error detection guarantee for any odd number of bit inversions and 32-bit burst error detection. The probability for not detecting an 8-bit error is ~5*10-10.
When using a base32 encoding, this checksum provides error detection guarantee for one mistyped character because one mistyped character can result in up to 5 bit inversions. Due to the burst error detection guarantee, swapping two adjacent characters of an address is guaranteed to be detected as well.
## Base32 Encoding
### Why Base32?
A base32 encoding is clearly preferable over the decimal or hexadecimal system as it is shorter (~20% shorter than hexadecimal). Base64 or any other encodings using a larger alphabet were excluded due to the usage of special characters that can have disadvantages when double-clicking on an address. Any alphabet size that is not a power of 2 was excluded as well, as it disallows to map consecutive strings of bits to characters of the alphabet. Hence, a single mistyped character in an address using such an alphabet could result in many inverted bits which makes it unsuitable for our checksum choice. Base58 is an example of such an alphabet. Hence, a base32 encoding is the preferred choice.
Although there exist already some versions of base32, for instance RFC 4648 and z-base-32, none of them are satisfactory. Therefore, a custom version is defined in this proposal.
### The Alphabet
The alphabet should not contain mixed case letters to ease the communication, for instance, via phone. Lowercase letters are preferred over uppercase letter as they are easier to read. Using an alphabet size of 32 allows to exclude 4 characters from the set of all lowercase letters and all digits. The characters “1”, “i”, “l” (lowercase L), “0” (zero) and “o” are for many fonts the most ambiguous ones. Restricting the alphabet to lowercase letters makes “o” and “i” less ambiguous than the other ones, which led to the decision in keeping one of the two (“o” is part of the alphabet).
### The Encoding Function
The encoding function which maps each 5-bit string to a character of the alphabet was chosen in the way to have few bit inversions for common errors. Consequently, the number of bit inversions introduced by a several common errors may be smaller than the distance of the checksum code in which case the error detection is guaranteed. The common errors considered are
mistyping a character by a left or right adjacent character on the
mistyping a character by a similar looking character.
Errors of the first kind were given a priority. The 32-bit Gray code was used to ensure that two elements of the alphabet that are in the same row and adjacent on a QWERTY keyboard differ in exactly one bit. In fact, all possible permutations of the contiguous rows of the QWERTY keyboard consisting of elements from the alphabet, i.e. the rows
23456789 qwertyu op asdfghjk zxcvbnm,
fulfil this condition when their concatenation is mapped to the Gray code. Moreover, using the reversed rows and rotations of the concatenations fulfils the same condition. Hence, all these different orderings were considered, and the one that is the best with respect to the second kind of error was chosen. More precisely, the bit representations of any of the pairs
(b, 6) (g, q) (g, 9) (q, 9) (s, 5) (v, u) (z, 2) (q, p) (r, v)
should differ in at most two bits, and the number of pairs differing in two bits should be as small as possible. The lowest number of pairs differing in two bits is 7. In fact, there are 40 different orderings that fulfil this number. The encoding function f defined above is just a random choice from these 40 possibilities.
With this encoding function, it is guaranteed, for example, that introducing seven errors of the first kind (mistyping by an adjacent character) into an address gets detected. Moreover, it is guaranteed to detect up to three errors of the second kind (mistyping by a similar looking character) incorporated in an address.