# SSE: conversion integers to decimal representation

Author: Wojciech Muła 2011-10-21 2018-03-25 (updated test results, link to Milo's benchmarks)

# Introduction

With SSE2 instructions it's possible to convert up to four numbers in range 0..9999_9999 and get 32 decimal digits results. This texts describe code for two numbers (suitable for 64-bit conversions) and for one number (suitable for 32-bit conversions).

The outline of algorithm 1 has been posted by Piotr Wyderski on the usenet group pl.comp.lang.c, I merely implemented it. The main idea is to perform in parallel divisions & modulo by 108 (for 64-bit numbers), then 104, 102 and finally 101.

I've developed the algorithm 2, converting just a single 8-digit number. First division & modulo by 108 is performed, and an input vector is formed [abcd, abcd, abcd, abcd, efgh, efgh, efgh, efgh]. Then this vector is divided by vector [103, 102, 101, 100, 103, 102, 101, 100] yielding vector [a, ab, abc, abcd, e, ef, efg, efgh]. Obtaining an 8 digits result from this vector requires some shuffling, multiply by 10 and a subtract.

# SSE details

## Division

SSE doesn't provide an integer division, but division by a constant can be achieved using multiplication. This is well known and widespread technique, see for example Division by Invariant Integers using Multiplication by Torbjörn Granlund & Peter L. Montgomery, it's also described in Hackers Delight and Reciprocal Multiplication.

The division algorithm follows scheme:

```k   := 10^n
div := (x * MC) >> shift
mod := x - div * k
```

There are two multiplication, one subtract and (sometimes optional) one right shift. The constant MC is calculated with following formula:

```MC := (2^d - k + correction)/k
shift := d
```

Where d >= word bits and word bits = {8, 16, 32, 64, ...}. The value of correction depends on divisor and number of bits, please check one of the notes source for details.

## Variable shifts

Instruction pmulhuw does in parallel (word_x[i] * word_y[i]) >> 16.

If one component has form 1 << n[i] (n[i] = 0 .. 15), then the result of the instruction is word_x[i] >> (16 - n[i]), i.e. a word shifted right by specified amount from 1 to 16. This is used in algorithm 2.

(Likewise pmullw can be used to perform shifts left.)

Both algorithms require this. Removing leading zeros is similar procedure to strchr:

```pcmpeqb     packed_byte('0'), xmmreg
pmovmskb    xmmreg, reg
not         reg               -- invert bits
or          \$0x8000, reg      -- do not trim '0' if val was 0
bsf         reg, reg          -- reg contains position of first non-zero digit
```

On a CPU with SSE4.2 this could be done with single instruction.

# Algorithm 1

Division by 108 is not suitable for SSE code, because require a 64-bit multiplication. In a 64-bit code such multiplication is available with CPU instructions, but in a 32-bit code require 4 multiplications and a long add — this would kill performance.

Division by 104 require a 32-bit multiplication. SSE have pmuludq that does 2 multiplications and store full a 64-bit result.

Division by 102 require a 16-bit multiplication. SSE have pmulhuw that does 8 multiplications and store the high 16-bit of result; likewise pmullw stores the low 16-bit of result. This perfectly fits our needs.

Division by 101 require an 8-bit multiplication, but SSE does not provide any instruction that vertically multiply bytes. We have to use a 16-bit multiplication.

Code for conversion 64-bit numbers consist following steps:

```[ ____ ____ | ____ ____ | 9765 4321 | 1234 5678 ]       -- packed dword (_ = 0)
```
2. Then divmod 32-bit numbers by 104 (pmuludq), results are 32-bit numbers:

```[ ____ 9765 | ____ 4321 | ____ 1234 | ____ 5678 ]       -- packed dword
```
3. Then divmod 32-bit numbers by 102 (pmulhuw and pmullw), results are 16-bit numbers:

```[ __97 | __65 | __43 | __21 | __12 | __34 | __56 | __78 ]       -- packed word
```
4. Then divmod 32-bit numbers by 101 (again pmulhuw and pmullw), results are 16-bit numbers:

```[ ___7 | ___5 | ___3 | ___1 | ___2 | ___4 | ___6 | ___8 ]       -- packed word
[ ___9 | ___6 | ___4 | ___2 | ___1 | ___3 | ___5 | ___7 ]       -- packed word
```
5. Finally join decimal digits, convert to ASCII and remove leading zeros.

# Algorithm 2

This algorithm can convert only a single number from range 0..9999_9999. The main idea is to divide 4-digits numbers by 103, 102 and 101 at once.

1. Divide & modulo by 104 (pmuludq):

```v0 = [ 0000 | abcd | 0000 | efgh ] -- packed dword
```
2. Populate words:

```v1 = [ abcd | abcd | abcd | abcd | efgh | efgh | efgh | efgh ] -- packed word
```
3. Multiply v1 by 4 with psllw (safe, because 9999 ⋅ 4 < 216 − 1):

```v2 = [ 4*abcd | 4*abcd | 4*abcd | 4*abcd | 4*efgh | 4*efgh | 4*efgh | 4*efgh ]
```
4. Divide by 103, 102, 101, and 100 (single pmulhuw):

```v3 = [ a * 2^(7 + 2) | ab * 2^(3 + 2) | abc * 2^(1 + 2) | 2*abcd | <likwise digits efgh> ]
```
5. Shift right by 9, 5, 3 and 1 bits each 4-words group (single pmulhuw, see subsection below):

```v4 = [    a |   ab |  abc | abcd |    e |   ef |  efg | efgh ]
```
6. Calculate modulo:

```v5 = packed_qword(v4) >> 16

v6 = packed_word(v5) * 10 =
= [    0 |   a0 |  ab0 | abc0 |    0 |   e0 |  ef0 | efg0 ]

v7 = v4 - v6 =
= [    a |    b |    c |    d |    e |    f |    g |    h ]
```
7. Pack words to bytes (packuswb), convert to ASCII and remove leading zeros.

Ad 3. With pmuluhw we cannot divide by 1, the minimum is 2. Because pmulhuw is invoked two times (for divide and shift), we have to multiply input by 4, to finally get the abcd & efgh after step 5.

# Real-word requirements & limitations

1. Signed numbers conversion requires to put a sign char in the front of string. This have to be done with CPU (scalar) instructions.
2. Vectorized code generate 8 or 16 digits at single run. The longest 32-bit number has 10 decimal digits, and 64-bit — 20 digits, thus 2 or 4 digits are not converted in SSE code. Using SSE code to obtain them wouldn't be wise decision.

# Implementation

Sample program provides following functions:

• utoa64_sse — algortihm 1: described here;
• utoa32_sse — algorithm 1: implementation for 32-bit numbers;
• utoa32_sse_2 — algorithm 2
• utoa32 & uta64 — plain C implementations; faster than sprintf("%d", &x).

Please read the source header to find how to compile. Program can be used to measure speed of the selected procedure and to print results, thus allow to verify correctness.

# Tests results

The best times are considered. For larger number speedup is around 3 times, while for short numbers time is comparable to the plain C code, and for very short numbers is much worse.

## Core2 Duo E8200 (2Gz)

Tests were run on my Linux box with Core2 Duo E8200 (2GHz); gcc 4.6.1 (-O3 flag) was used to compile the program.

### Algorithm 1 — utoa32_sse

number range iterations C [s] SSE [s] speedup
0 .. 99 1_000_000 0.492 0.880 56%
0 .. 9999 10_000 1.076 0.884 122%
0 .. 9999_9999 1 2.480 0.872 284%
1000_0000 .. 9999_9999 1 2.276 0.784 290%

### Algorithm 2 — utoa32_sse_2

number range iterations C [s] SSE [s] speedup
0 .. 99 1_000_000 0.492 0.808 61%
0 .. 9999 10_000 1.076 0.812 132%
0 .. 9999_9999 1 2.480 0.728 340%
1000_0000 .. 9999_9999 1 2.276 0.812 280%

## Westmere i5 M540 @ 2.53GHz Compilation with GCC 7.3.0.

### Algorithm 1 — utoa32_sse

number range iterations C [ms] SSE [ms] speedup
0 .. 99 100_000 62 98 0.62
0 .. 9999 1_000 106 98 1.08
0 .. 9999_9999 1 2393 881 2.71
1000_0000 .. 9999_9999 1 2197 881 2.49

### Algorithm 2 — utoa32_sse_2

number range iterations C [ms] SSE [ms] speedup
0 .. 99 100_000 62 78 0.79
0 .. 9999 1_000 106 76 1.34
0 .. 9999_9999 1 2393 783 3.05
1000_0000 .. 9999_9999 1 2197 704 3.12

## Core i7-6700 CPU @ 3.40GHz Compilation with GCC 7.3.0.

### Algorithm 1 — utoa32_sse

number range iterations C [ms] SSE [ms] speedup
0 .. 99 100_000 31 154 0.20
0 .. 9999 1_000 51 155 0.33
0 .. 9999_9999 1 1077 1389 0.78
1000_0000 .. 9999_9999 1 986 1533 0.64

### Algorithm 2 — utoa32_sse_2

number range iterations C [ms] SSE [ms] speedup
0 .. 99 100_000 31 123 0.25
0 .. 9999 1_000 51 122 0.42
0 .. 9999_9999 1 1077 1221 0.88
1000_0000 .. 9999_9999 1 986 1099 0.90

# See also Milo Yip compared different itoa and dtoa implementations on Core i7, including the algorithm 2 which use SSE2 instructions.

Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.

• 2011-11-06