Conversion numbers to octal representation

Author:	Wojciech Muła
Added on:	2014-10-02
Updated:	2014-10-07 (core i7 results); 2014-10-05 (lookup table methods)

Contents

Introduction
Algorithms
Experiments
- Results from Pentium M
- Results from Core i7
Sample code

Introduction

Conversion to octal isn't very computer-friendly, each digit occupy 3 bits, it isn't power of two. The smallest number of bits to convert worth to consider is 12 (3 nibbles and 4 digits):

+--------------+
|dddc ccbb baaa|
+--------------+
8..11 4..7 0..3

Then output is a 32-bit word:

+--------+--------+--------+--------+
|00000ddd|00000ccc|00000bbb|00000aaa|
+--------+--------+--------+--------+

Conversion to ASCII require single add of 0x30303030 (0x30 = ord('0')).

Algorithms

Naive version

Naive conversion consists three basic steps:

isolate octal digits, i.e. 3-bit fields;
shift digits to the desired positions in the output word;
finally merge fields in a one word and convert to ASCII.

Sample implementation:

#define packed32(x) ((uint8_t)(x) * (uint32_t)(0x01010101))

uint32_t expand_12bits_naive(uint32_t x) {
    const uint32_t a = (x & 0x007) << 24;
    const uint32_t b = (x & 0x038) << (16 - 3);
    const uint32_t c = (x & 0x1c0) << (8 - 6);
    const uint32_t d = (x & 0xe00) >> 9;

    return a | b | c | d;
}

uint32_t to_oct_naive(uint32_t x) {
    return expand_12bits_naive(x) + packed32('0');
}

Lookup tables

I think using lookup tables aren't good solution. First: shifting on current CPUs is very fast, second: LUT requires memory and pollutes cache. The results from my old Pentium M shows that LUT are the fastest, but I bet that PDEP or SSE2 version win on Core i7 .

Two implementations of LUT were tested:

Single, big table indexed by 12 bits (4096 entries). Entry of a table contains of course 4 bytes converted to ASCII.
Two small tables indexed by lower 8 bits and higer 4 bits of 12-bit word (256 and 16 elements respectively). Tables contain partial results of expand_12bits operations.

Variant 1 sample code:

uint32_t lookup[1 << 12];

void prepare_single_table() {
    for (uint32_t x=0; x <= 0xfff; x++) {
        lookup[x] = to_oct_naive(x);
    }
}

uint32_t to_oct_single_table(uint32_t x) {
    return lookup[x]; // assert x <= 0xfff;
}

Variant 2 sample code:

uint32_t lookup_nibbles01[256];
uint32_t lookup_nibble2[16];

void prepare_single_table() {
    for (uint32_t i=0; i <= 256; i++) {
        lookup_nibbles01[i] = expand_12bits_naive(i);
    }
    for (uint32_t i=0; i <= 16; i++) {
        lookup_nibble2[i] = expand_12bits_naive(i << 8);
    }
}

uint32_t to_oct_two_tables(uint32_t x) {
    const uint32_t expanded = lookup_nibbles01[x & 0xff] | lookup_nibble2[(x >> 8) & 0xf];

    return expanded + packed32('0');
}

Variant 2 requires less memory than variant 1 (only 6%), but performs 4 operations to calculate result (2 bit-and, 1 shift and add) and issues 2 memory fetches.

Improved naive version

Using single multiplication we can perform two shifts and one bit-or. Input word have to be split to two subwords: containing digits a, c and b, d.

For example digits a, c are converted in following steps:

isolate digits (bit-and):
```
ac = x & 0x1c7
```

+--------+--------+--------+--------+
|00000000|00000000|0000000c|cc000aaa|
+--------+--------+--------+--------+

the digit a is on the correct position, but c have to be shifted 15 bits to the left. This could be saved as:
```
expanded_ac = (ac << 15) + (ac << 0)
```
Or using multiplication:
```
expanded_ac = ac * ((1 << 15) + 1)
```
Layout of expanded_ac:

+--------+--------+--------+--------+
|00000000|00000ccc|000aaa0c|cc000aaa|
+--------+--------+--------+--------+

Mask out garbage (bits marked with red).

Downside of this method is that digits are generated in reversed order, so the instruction bswap have to be used on the result.

Sample implementation:

uint32_t expand_12bits_mul(uint32_t x) {

    uint32_t ac = x & 0x1c7; // ac = 000ccc000aaa
    uint32_t bd = x & 0xe38; // bd = ddd000bbb000

    uint32_t expanded_ac = (ac * ((1 << 0) + (1 << 10))) & 0x00070007;
    uint32_t expanded_bd = (bd * ((1 << 5) + (1 << 15))) & 0x07000700;

    return expanded_ac | expanded_bd;
}


uint32_t to_oct_mul(uint32_t x) {
    return bswap(expand_12bits_mul(x) + packed32('0'));
}

PDEP version

PDEP is a new instruction introduced in BMI2 extension. Please take a look at my previous article where this instructions is discussed in details. PDEP makes conversion easy and very fast — code have just three instructions (pdep, add, bswap).

Sample implementation:

uint32_t to_oct_pdep(uint32_t x) {
    const uint32_t result = pdep(x, 0x07070707);

    return bswap(result + packed32('0'));
}

SSE2 version

SSE2 does in parallel the same bit-and & shifts operations as the naive version.

Variable shifts are possible with multiplication, instruction PMULHUW multiplies 16-bit words and store higher word of a result. When multiplier is in form (1 << n) then result of this instruction is x >> (16 - n).

Unfortunately a multiplier is always greater than 1, because of that input arguments have to be modified (in step 2) in order to get the correct results.

The SSE2 procedure contains following steps:

Clone two 12-bit words (we'll use dot instead of zero).

+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
|....dddcccbbbaaa|....dddcccbbbaaa|....dddcccbbbaaa|....dddcccbbbaaa|....ggghhhfffeee|....ggghhhfffeee|....ggghhhfffeee|....ggghhhfffeee|
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+

Shift left by 1 bit all words. This can be done using single PADDWD instruction.

+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
|...dddcccbbbaaa.|...dddcccbbbaaa.|...dddcccbbbaaa.|...dddcccbbbaaa.|...ggghhhfffeee.|...ggghhhfffeee.|...ggghhhfffeee.|...ggghhhfffeee.|
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+

Mask octal digits (PAND).

+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
|...ddd..........|......ccc.......|.........bbb....|............aaa.|...ggg..........|......hhh.......|.........fff....|............eee.|
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+

Parallel shift right (PMULHUW):
- d and g must be shifted 10 bits right,
- c and h must be shifted 7 bits right,
- b and f must be shifted 4 bits right,
- a and e must be shifted 1 bit right.
Multipliers are (respectively):
- 1 << (16 - 10)
- 1 << (16 - 7)
- 1 << (16 - 12)
- 1 << (16 - 15)

+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+
|.............ddd|.............ccc|.............bbb|.............aaa|.............ggg|.............hhh|.............fff|.............eee|
+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+

pack to bytes and convert to ASCII (PACKUSWB & PADDB).

Experiments

Converter naive implements naive algorithm,
converter mul implements improved naive,
converter SSE2 is a SSE implementation of naive,
converter pdep use PDEP instruction,
converter 1 LUT is variant 1 from lookup tables,
converter 2 LUTs is variant 2 from lookup tables.

Results from Pentium M

convert	time [s]	speedup
naive	0.2296	1.00	██████████████████
mul	0.3846	0.60	███████████
SSE2	0.2629	0.87	████████████████
1 LUT	0.1122	2.05	██████████████████████████████████████
2 LUTs	0.1286	1.79	█████████████████████████████████

Notes:

The SSE2 procedure is slower than scalar one.
Simple shifts and bit-or are faster than multiplication.
LUT versions are the fastest, even version with two lookup-tables is very fast.

Results from Core i7

Statistics from 20 samples.

convert	time [s]			speedup (avg)
convert	min	max	avg	speedup (avg)
naive	0.1630	0.2040	0.1841	1.00	████████████████████
mul	0.1630	0.2200	0.1777	1.04	████████████████████
SSE2	0.2170	0.2570	0.2318	0.79	███████████████
pdep	0.1360	0.1650	0.1491	1.23	████████████████████████
1 LUT	0.1350	0.1680	0.1452	1.27	█████████████████████████
2 LUTs	0.1350	0.1770	0.1468	1.25	█████████████████████████

Notes:

The SSE2 is the slowest. This is big surprise
PDEP and lookup methods has simiar speedup. PDEP is better as doesn't require additional memory. On CPUs without BMI2 extension method with two lookup-tables would be good choice.
Multiplication is not faster than 2 shifts and bit-or. Another surprise.

Sample code

Full implementation and program used in experiments are available on github.