<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
 <title>Wojciech Muła --- website</title>
 <description>SIMD, AVX, AVX2, AVX512, optimization, algorithms, data structures</description>
 <link>http://0x80.pl</link>
 <atom:link href="http://0x80.pl/feed.xml" rel="self" type="application/rss+xml" />
 <lastBuildDate>Thu, 08 Jan 2026 20:26:48 +0100</lastBuildDate>
 <item>
  <title>Rust is perfectly imperfect</title>
  <link>http://0x80.pl/notesen/2026-01-08-imperfect-rust.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2026-01-08-imperfect-rust.html</guid>
  <pubDate>Thu, 08 Jan 2026 12:00:00 +0100</pubDate>
  <description>
&lt;img alt="2026-01-08-imperfect-rust/2Q6A1876_small.jpg" class="align-center" src="2026-01-08-imperfect-rust/2Q6A1876_small.jpg" style="width: 50%;" /&gt;
&lt;p&gt;My sampling of the real world is performed via X, formerly Twitter. And as of
the end of 2025, it appears that Rust is the most hated language ever. I
noticed that the C and C++ programmers constitute the majority of Rust critics.&lt;/p&gt;
&lt;p&gt;Despite the critics, Rust is having its momentum — more and more companies are
adopting Rust, trying to create new pieces of software using it or just
rewriting their software. There are also sporadic opposite moves, that is
getting back to C or C++.&lt;/p&gt;
&lt;p&gt;I was thinking about these strong, anti-Rust sentiments and — more importantly
— why people want to use Rust. The anti-Rust sentiments seem to me to be more
of a social problem caused by individuals, and since I am rarely involved in
&amp;quot;community&amp;quot; life and seldom read any comments, I don’t feel like I have
anything interesting to add. Let me share my thoughts about the latter issue.&lt;/p&gt;
&lt;p&gt;From the perspective of language programming theory, Rust added &lt;strong&gt;nothing&lt;/strong&gt;.
Literally, there’s nothing new in it. Of course, someone might shout “THE
BORROW CHECKER” — but the borrow checker is just more advanced liveness
analysis.&lt;/p&gt;
&lt;p&gt;Rust did something that many other programming languages tried in the past but
had failed to deliver. &lt;strong&gt;It is easy to use&lt;/strong&gt; thanks to &lt;strong&gt;its consistency and
predictability&lt;/strong&gt;. And it was achieved by Rust as the whole, not only the
language itself.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;First of all, bringing the toolchain – which contains the compiler, linker,
standard library, documentation and package manager — is seamless with
&lt;tt class="docutils literal"&gt;rustup&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;The package manager, cargo, makes it easy to tame dependencies. And it works
with third-party packages as well as local ones. From the programmer point of
view there’s no difference whether they use “local” or “remote” packages. It’s
also easy to point out the path where the given package is located, so with
established dependencies, the internet connection is not needed.&lt;/li&gt;
&lt;li&gt;The standard library provides sufficiently powerful entities to produce a
decent program, although it is not as rich as Python's one. But all basic,
numeric data types come with a handful of useful methods and constants. Strings
are by default UTF-8 encoded ones, also with a rich set of methods. There are
also popular containers: resizable vectors, maps and sets (both key-ordered and
hash-based). There are interfaces to the underlying OS: we can work with a file
system, network, threading and timers.&lt;/li&gt;
&lt;li&gt;The language comes with a small, strong type system. There are arrays,
tuples, and structures and, the most important element of the type system,
algebraic data types, called for an unknown reason “enums”. Algebraic data
types remove the need of introducing hierarchies of types in 99% cases —
complex structures are naturally expressed with “enums”. Even using interfaces
(called in Rust “traits”) is not needed. Apart from libraries, when we work on
regular software, we usually know all types in advance, or the rate of
introducing new types is relatively slow.&lt;/li&gt;
&lt;li&gt;The language supports two special kinds of enums: &lt;tt class="docutils literal"&gt;Option&lt;/tt&gt; (value or
null/nil/none) and &lt;tt class="docutils literal"&gt;Result&lt;/tt&gt; (value or error). The types come with a couple of
methods that eases their usage, for instance returning default value when
option is none. The standard library uses these enums everywhere, and their
usage is ubiquitous and feels natural.&lt;/li&gt;
&lt;li&gt;Rust allows us to generate code based on annotations, this is a kind of
compile-time reflection. These code generators are written in Rust. This
feature allows programmers to extend their types with extra methods or
implement some boring parts of code. Think about serialization, for instance.&lt;/li&gt;
&lt;li&gt;Generics, called in C++ as “templates”, are built into the language and
their use is mostly not different than using regular types/functions.&lt;/li&gt;
&lt;li&gt;The language introduces the traits, called in Java as “interfaces”, that
enable two important programming constructions: abstract data types and
restricting template arguments, acting as C++’s concepts. When we need to
interface with other-programmer-defined code, we use traits. Traits are also
practical: they allow not only to define an abstract API, but also provide
default implementations. Oh, and traits can extend other traits.&lt;/li&gt;
&lt;li&gt;Feature flags are built in the whole ecosystem.&lt;/li&gt;
&lt;li&gt;The tools for documenting code are built in. There’s no difference in
documentation generated for standard library and any other package.&lt;/li&gt;
&lt;li&gt;Rust is also aware of situations where we don’t work under any OS (so
called “baremetal”) or &lt;strong&gt;we are&lt;/strong&gt; an OS. Then a single declaration no_std
disable linking with the standard library — yet keep providing some
OS-independent types and functions via crate “core”.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These are the major factors that make Rust ecosystem composable. I’m not really
convinced that Rust’s safety is the primary criteria of choosing the language.
For sure, Rust makes it hard to introduce memory-related bugs, but there is
still a huge area where bugs are possible. Well, if people really needed safety
in their system, they would use functional languages, theorem provers and other
sophisticated tools or methods to assure software is safe in all possible
attack or malfunction directions. The C and C++ world already have quite a nice
set of tools helping in catching different kinds of bugs, for instance fuzzers,
runtime checkers like address sanitizers or valgrind.&lt;/p&gt;
&lt;p&gt;Rust, similarly to functional languages, allows a programmer to focus on their
problem: if your code compiles, it likely will work.&lt;/p&gt;
&lt;p&gt;But in my opinion Rust brings back the joy of programming in an imperative paradigm.&lt;/p&gt;
&lt;p&gt;Thank you for reading.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Acknowledgements&lt;/strong&gt;: I’d like to thank &lt;em&gt;Daniel Lemire&lt;/em&gt;, &lt;em&gt;Dan Shechter&lt;/em&gt; and
&lt;em&gt;Romek Kurc&lt;/em&gt; for reading the draft of this text and sharing their corrections
and thoughts.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Change case of UTF-32-encoded strings</title>
  <link>http://0x80.pl/notesen/2025-02-02-utf32-change-case.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-02-02-utf32-change-case.html</guid>
  <pubDate>Sun, 02 Feb 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Changing letters case is something that appear handy in many situations, for
instance input normalization, parsing, and other text-related tasks.&lt;/p&gt;
&lt;p&gt;When we are dealing with ASCII, such conversion is straightforward. We check
if a character code lies in the range 'a' .. 'z' (or 'A' .. 'Z') and when its
true, we toggle the 5th bit (0x20).&lt;/p&gt;
&lt;p&gt;But in the case of &lt;a class="reference external" href="https://home.unicode.org/"&gt;Unicode-encoded&lt;/a&gt; strings it is not that simple. There are
many code points having upper- or lowercase counterparts, and they are not
placed in any regular way in the Unicode code space. Although we may identify
shorter or longer ranges of such codes, similarly to Latin letters in ASCII,
this does not help much. See &lt;a class="reference internal" href="#appendix-a"&gt;appendixes A&lt;/a&gt;, &lt;a class="reference internal" href="#appendix-b"&gt;B&lt;/a&gt; and &lt;a class="reference internal" href="#appendix-c"&gt;C&lt;/a&gt;, where we visualize
how these characters are located.&lt;/p&gt;
&lt;p&gt;Additionally, there are cases where an uppercase or lowercase version of the given
character is not a single character, but a string. These strings are short, have
two or three Unicode points. For example the uppercase of &amp;quot;Latin small ligature ffi&amp;quot;
(ﬃ) is a three-character string &amp;quot;FFI&amp;quot;.&lt;/p&gt;
&lt;table border="1" class="col2right col3right col4right col5right docutils"&gt;
&lt;caption&gt;How many characters are subject of case change (Unicode 15.0.0)&lt;/caption&gt;
&lt;colgroup&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class="head"&gt;1 code point&lt;/th&gt;
&lt;th class="head"&gt;2 code points&lt;/th&gt;
&lt;th class="head"&gt;3 code points&lt;/th&gt;
&lt;th class="head"&gt;total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;uppercase&lt;/td&gt;
&lt;td&gt;1423&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;1525&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;lowercase&lt;/td&gt;
&lt;td&gt;1432&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1433&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table border="1" class="col2right col3right col4right col5right docutils"&gt;
&lt;caption&gt;How many characters are subject of case change (Unicode 13.0.0)&lt;/caption&gt;
&lt;colgroup&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;col width="20%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class="head"&gt;1 code point&lt;/th&gt;
&lt;th class="head"&gt;2 code points&lt;/th&gt;
&lt;th class="head"&gt;3 code points&lt;/th&gt;
&lt;th class="head"&gt;total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;uppercase&lt;/td&gt;
&lt;td&gt;1383&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;1485&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;lowercase&lt;/td&gt;
&lt;td&gt;1392&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1393&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is one thing that makes our task easier: while the Unicode codes span the range
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;0..0x10ffff&lt;/span&gt;&lt;/tt&gt;, we can check that only characters up to &lt;tt class="docutils literal"&gt;0x1ffff&lt;/tt&gt; &lt;strong&gt;may&lt;/strong&gt; have
different codes due to case change.&lt;/p&gt;
&lt;p&gt;The outline of case change algorithm for UTF-32 encoded strings can be written as
follows.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;utf32_uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="c1"&gt;// fetch the character code
&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_uppercase_variant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uppercase_letters_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="c1"&gt;// the majority: single character
&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

                &lt;/span&gt;&lt;span class="c1"&gt;// rare cases: strings
&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

                &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;uppercase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="c1"&gt;// no upper
&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;src_code&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>LoongArch64 subjective higlights</title>
  <link>http://0x80.pl/notesen/2025-01-21-loongarch64-highlights.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-21-loongarch64-highlights.html</guid>
  <pubDate>Sun, 19 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;I get back to work on &lt;a class="reference external" href="https://github.com/simdutf/simdutf"&gt;simdutf&lt;/a&gt; recently, and noticed that the library gained
support for &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Loongson"&gt;LoongArch64&lt;/a&gt;. This is a custom design and
&lt;strong&gt;custom ISA&lt;/strong&gt; by Loongson from China. They provide documentation for scalar
ISA, but not for the vector extension. Despite that, GCC, binutils, QEMU and
other tools already support the ISA. To our luck, &lt;strong&gt;Jiajie Chen&lt;/strong&gt; did an
impressive work of reverse engineering the vector stuff and published results
online as &lt;a class="reference external" href="https://jia.je/unofficial-loongarch-intrinsics-guide/"&gt;The Unofficial LoongArch Intrinsics Guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;LoongArch comes with two vector extensions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;LSX, having 128-bit vector registers,&lt;/li&gt;
&lt;li&gt;LSAX, having 256-bit vector registers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These extensions are similar, especially most instructions present in LSX
exist in LSAX. According to the Wikipedia entry, the ISA is mixture of
RISC-V and MIPS.&lt;/p&gt;
&lt;p&gt;ISA supports both integer and floating point instructions. There's support
for 8-bit, 16-bit, 32-bit, 64-bit and also 128-bit integers. Floating point
instructions cover single precision, double precision and half precision
numbers.&lt;/p&gt;
&lt;p&gt;Comparisons yield byte-masks, similarly to SSE.&lt;/p&gt;
&lt;p&gt;Integer instructions are defined for most integer types, this makes
the ISA regular.&lt;/p&gt;
&lt;p&gt;My impression is that the ISA is well designed, but have not vectorized
any code for that architecture. Below is the list of features I found
interesting while browsing the intrinsics guide.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD binary heap operations</title>
  <link>http://0x80.pl/notesen/2025-01-18-simd-heap.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-18-simd-heap.html</guid>
  <pubDate>Sat, 18 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Binary_heap"&gt;Binary heap&lt;/a&gt; is a binary tree data structure having some interesting
properties. One of them is an array-friendly memory layout, achieved by building
(almost) complete binary tree.&lt;/p&gt;
&lt;p&gt;A binary heap keeps at index 0 the maximum value, or the minimum one depending
on convention &amp;mdash; let's stick to &lt;em&gt;maximum heaps&lt;/em&gt;. There is exactly one invariant:
a child node, if exist, keep a value less than the parent node. For comparison,
in the case of binary search trees, we have more strict rules: the left child keeps
a smaller value than the parent's value, and the right child keeps a bigger value.&lt;/p&gt;
&lt;p&gt;A non-root node stored at index &lt;em&gt;i&lt;/em&gt; have the parent node at index
&lt;span class="math"&gt;floor[(&lt;i&gt;i&lt;/i&gt; &amp;minus; 1)/2]&lt;/span&gt;, and children nodes at indices &lt;span class="math"&gt;2 &amp;sdot; &lt;i&gt;i&lt;/i&gt; + 1&lt;/span&gt;
and &lt;span class="math"&gt;2 &amp;sdot; &lt;i&gt;i&lt;/i&gt; + 2&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;In this text we cover two procedures related to heaps:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;is_heap&lt;/tt&gt; &amp;mdash; checks if an array is proper heap,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;push_heap&lt;/tt&gt; &amp;mdash; adds a new element to the heap,&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The procedure &lt;tt class="docutils literal"&gt;is_heap&lt;/tt&gt; is vectorizable and using SIMD instructions brings
profit. We also show that it is possible to define this function using
&lt;em&gt;forward iterators&lt;/em&gt; rather random iterators, as the C++ standard imposes.&lt;/p&gt;
&lt;p&gt;The procedure &lt;tt class="docutils literal"&gt;push_heap&lt;/tt&gt; can be expressed with &lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/VPGATHERDD_VPGATHERQD.html"&gt;gather&lt;/a&gt; and
&lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/VPSCATTERDD_VPSCATTERDQ_VPSCATTERQD_VPSCATTERQQ.html"&gt;scatter&lt;/a&gt;, but performance is terrible. For the sake
of completeness, we show the AVX-512 implementation.&lt;/p&gt;
&lt;p&gt;There is also one more crucial method for heaps: removing the maximum value,
&lt;tt class="docutils literal"&gt;pop_heap&lt;/tt&gt;. However, it is difficult to vectorize, and benefits from
vectorization would likely be worse than in the case of &lt;tt class="docutils literal"&gt;push_heap&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: printing u64 as binary</title>
  <link>http://0x80.pl/notesen/2025-01-18-avx512-print-bin.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-18-avx512-print-bin.html</guid>
  <pubDate>Sat, 18 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;Printing 64-bit numbers in binary format can be done nicely with AVX-512 instructions.
First, we populate each byte from the number into a separate 64-bit word of an
AVX-512 register:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
    ┌───┬───┬───┬───┬───┬───┬───┬───┐
x = │ &lt;span style="color: blue; font-weight: bold"&gt;h&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;g&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;f&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;e&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;d&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;c&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;b&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;a&lt;/span&gt; │
    └───┴───┴───┴───┴───┴───┴───┴───┘
      &lt;span style="color: gray"&gt;|&lt;/span&gt;   &lt;span style="color: gray"&gt;|&lt;/span&gt;                   &lt;span style="color: gray"&gt;|&lt;/span&gt;   &lt;span style="color: gray"&gt;│&lt;/span&gt;
      &lt;span style="color: gray"&gt;│&lt;/span&gt;   &lt;span style="color: gray"&gt;│&lt;/span&gt;                   &lt;span style="color: gray"&gt;│&lt;/span&gt;   &lt;span style="color: gray"&gt;└──────────────────┐&lt;/span&gt;
      &lt;span style="color: gray"&gt;│&lt;/span&gt;   &lt;span style="color: gray"&gt;│&lt;/span&gt;                   &lt;span style="color: gray"&gt;└──────────┐&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;
      &lt;span style="color: gray"&gt;│&lt;/span&gt;   &lt;span style="color: gray"&gt;└──────────┐&lt;/span&gt;                   &lt;span style="color: gray"&gt;│&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;
      &lt;span style="color: gray"&gt;└──┐&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;                   &lt;span style="color: gray"&gt;│&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;
         &lt;span style="color: gray"&gt;│&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;                   &lt;span style="color: gray"&gt;│&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;
         &lt;span style="color: gray"&gt;├─╴┈┈┈╶─┐&lt;/span&gt;   &lt;span style="color: gray"&gt;├─╴┈┈┈╶─┐&lt;/span&gt;           &lt;span style="color: gray"&gt;├─╴┈┈┈╶─┐&lt;/span&gt;   &lt;span style="color: gray"&gt;├─╴┈┈┈╶─┐&lt;/span&gt;
         &lt;span style="color: gray"&gt;│&lt;/span&gt;       &lt;span style="color: gray"&gt;│&lt;/span&gt;   &lt;span style="color: gray"&gt;│&lt;/span&gt;       &lt;span style="color: gray"&gt;│&lt;/span&gt;           &lt;span style="color: gray"&gt;│&lt;/span&gt;       &lt;span style="color: gray"&gt;│&lt;/span&gt;   &lt;span style="color: gray"&gt;│&lt;/span&gt;       &lt;span style="color: gray"&gt;│&lt;/span&gt;
         &lt;span style="color: gray"&gt;▼&lt;/span&gt;       &lt;span style="color: gray"&gt;▼&lt;/span&gt;   &lt;span style="color: gray"&gt;▼&lt;/span&gt;       &lt;span style="color: gray"&gt;▼&lt;/span&gt;           &lt;span style="color: gray"&gt;▼&lt;/span&gt;       &lt;span style="color: gray"&gt;▼&lt;/span&gt;   &lt;span style="color: gray"&gt;▼&lt;/span&gt;       &lt;span style="color: gray"&gt;▼&lt;/span&gt;
       ┌───┬┈┈┈┬───┬───┬┈┈┈┬───┬┈┈┈┈┈┈┈┬───┬┈┈┈┬───┬───┬┈┈┈┬───┐
zmm0 = │ &lt;span style="color: blue; font-weight: bold"&gt;h&lt;/span&gt; │   │ &lt;span style="color: blue; font-weight: bold"&gt;h&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;g&lt;/span&gt; │   │ &lt;span style="color: blue; font-weight: bold"&gt;g&lt;/span&gt; │       │ &lt;span style="color: blue; font-weight: bold"&gt;b&lt;/span&gt; │   │ &lt;span style="color: blue; font-weight: bold"&gt;b&lt;/span&gt; │ &lt;span style="color: blue; font-weight: bold"&gt;a&lt;/span&gt; │   │ &lt;span style="color: blue; font-weight: bold"&gt;a&lt;/span&gt; │
       └───┴┈┈┈┴───┴───┴┈┈┈┴───┴┈┈┈┈┈┈┈┴───┴┈┈┈┴───┴───┴┈┈┈┴───┘

       │           │                               │           │
       ╰─ word 7 ╶─╯                               ╰─ word 0 ╶─╯&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then, in each byte of 64-bit words we isolate i-th bit, where &lt;strong&gt;i&lt;/strong&gt; is the byte position within a 64-bit word.&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
     ┈┈┈┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬┈┈┈
zmm0    │ &lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;1010100 │ 0&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;010100 │ 01&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;10100 │ 010&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0100 │ 0101&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;100 │ 01010&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;00 │ 010101&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;0 │ 0101010&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt; │
     ┈┈┈┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴┈┈┈
     ┈┈┈┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬┈┈┈
zmm1    │ &lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0000000 │ 0&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;000000 │ 00&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;00000 │ 000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0000 │ 0000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;000 │ 00000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;00 │ 000000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0 │ 0000000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt; │
     ┈┈┈┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴┈┈┈

zmm0 &amp;amp; zmm1 =

     ┈┈┈┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬┈┈┈
        │ &lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;0010100 │ 0&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;000000 │ 00&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;00000 │ 000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0000 │ 0000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;000 │ 00000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;00 │ 000000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;0 │ 0000000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt; │
     ┈┈┈┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴┈┈┈&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Finally, we convert non-zero bytes into ASCII '1' (0x31) and zero bytes into ASCII '0' (0x30). This particular
operation can be done in two different ways:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;converting non-zero bytes to value 1 using &lt;tt class="docutils literal"&gt;min&lt;/tt&gt; operation and performing a non-masked addition:
&lt;tt class="docutils literal"&gt;byte[i] = min(byte[i], 1) + 0x30&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;building a bitmask in k-register and performing standard masked instructions, like:
&lt;tt class="docutils literal"&gt;byte[i] = mask[i] ? 0x31 : 0x30&lt;/tt&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two methods do not differ in performance, just the first one does not use mask registers.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Drawing trees</title>
  <link>http://0x80.pl/notesen/2025-01-12-drawing-trees.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-12-drawing-trees.html</guid>
  <pubDate>Sun, 12 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;We have a tree of any degree and depth. Each node has assigned a bounding box
of its graphical representation.&lt;/p&gt;
&lt;p&gt;We want to draw such data structure, taking into account geometry of nodes.&lt;/p&gt;
&lt;img alt="2025-01-12-drawing-trees/screen.png" src="2025-01-12-drawing-trees/screen.png" /&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Building full-text search in Javascript</title>
  <link>http://0x80.pl/notesen/2025-01-07-js-search.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-07-js-search.html</guid>
  <pubDate>Tue, 07 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;I dedicated the last few days of 2024 on refreshing my website. The project
started around 2002, when the Internet was not widespread, there was no GitHub,
Wikipedia or anything we know right now. Thus the website served also as
a hosting platform for my open-source software.&lt;/p&gt;
&lt;p&gt;I created custom python software to maintain both the articles and software.
In the meantime things evolved. I started to write my texts in English, and
publish them more in a blog style (although, I'm not a fan of the term &amp;quot;blog&amp;quot;),
also GitHub allowed to easily distribute software. At some point of time my
fancy system for static website become more cumbersome than helpful.&lt;/p&gt;
&lt;p&gt;The decision was simple: drop the old build system, create a new one,
uncomplicated and tailored to my current needs &amp;mdash; focus only on publishing
articles. A wise decision I made 20 years ago was picking &lt;a class="reference external" href="https://docutils.sourceforge.io/rst.html"&gt;reStructuedText&lt;/a&gt;
to write texts. I prefer it over markdown. Not to mention that ReST allows
to easily extend itself, which I found extremely &lt;a class="reference external" href="/roles.html"&gt;handy&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Long story short, the new system allowed me to introduce tags, to maintain
texts in draft mode and, last but not least, to let &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Make_(software)"&gt;make&lt;/a&gt;
perform all boring tasks.&lt;/p&gt;
&lt;p&gt;But with the new build system, a new idea appeared: &amp;quot;how about searching?&amp;quot;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD parallel bits deposit/extract</title>
  <link>http://0x80.pl/notesen/2025-01-05-simd-pdep-pext.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-05-simd-pdep-pext.html</guid>
  <pubDate>Sun, 05 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The &lt;a class="reference external" href="http://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set"&gt;BMI2 extension&lt;/a&gt; introduced two
complementary instructions: parallel bits deposit (&lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/PDEP.html"&gt;PDEP&lt;/a&gt;) and parallel
bits extract (&lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/PEXT.html"&gt;PEXT&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;PDEP&lt;/tt&gt; scatters continuous set of bits to positions denoted by the
mask. The &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; does the opposite: gathers/compresses selected bits into
a continuous word.&lt;/p&gt;
&lt;p&gt;SIMD instruction sets do not directly support this kind of operations. There is
&lt;a class="reference external" href="https://www.felixcloutier.com//x86/gf2p8affineqb"&gt;GF2P8AFFINEQB&lt;/a&gt; in AVX-512, that allows arbitrary bit shuffling at
the &lt;strong&gt;byte level&lt;/strong&gt; (see &lt;a class="reference external" href="/notesen/2020-01-19-avx512-galois-field-for-bit-shuffling.html"&gt;Use AVX512 Galois field affine transformation for bit shuffling&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In this text we show approaches suitable for implementing &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;PDEP&lt;/tt&gt;
for wider element widths on any SIMD ISA.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Dividing unsigned 16-bit numbers</title>
  <link>http://0x80.pl/notesen/2025-01-03-uint16-division.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2025-01-03-uint16-division.html</guid>
  <pubDate>Fri, 03 Jan 2025 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is a follow-up text for &lt;a class="reference external" href="/notesen/2024-12-21-uint8-division.html"&gt;Dividing unsigned 8-bit numbers&lt;/a&gt;. We checked
if dividing 16-bit unsigned numbers is also feasible for SIMD instructions.&lt;/p&gt;
&lt;p&gt;Apart from obvious path, where we use floating-point division (&lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/DIVPS.html"&gt;DIVPS&lt;/a&gt;),
8-bit numbers could also utilize the approximate reciprocal instruction &lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/RCPPS.html"&gt;RCPPS&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Unfortunately, for 16-bit numbers the latter instruction cannot be used directly.
To properly divide 16-bit integers we need to perform a single step of the
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Newton-Raphson_algorithm"&gt;Newton-Raphson algorithm&lt;/a&gt;, which kills performance.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Dividing unsigned 8-bit numbers</title>
  <link>http://0x80.pl/notesen/2024-12-21-uint8-division.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2024-12-21-uint8-division.html</guid>
  <pubDate>Sat, 21 Dec 2024 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Division is quite an expensive operation. For instance, latency of the 32-bit
division varies between 10 and 15 cycles on the Cannon Lake CPU, and for Zen4
this range is from 9 to 14 cycles. The latency of 32-bit multiplication is
3 or 4 cycles on both CPU models.&lt;/p&gt;
&lt;p&gt;None of commonly used SIMD ISAs (SSE, AVX, AVX-512, ARM Neon, ARM SVE) provides
the integer division, only &lt;a class="reference external" href="2024-11-09-riscv-vector-extension.html"&gt;RISC-V Vector Extension&lt;/a&gt; does. However, all these
ISAs have floating point division.&lt;/p&gt;
&lt;p&gt;In this text we present two approaches to achieve a SIMD-ized division of 8-bit
unsigned numbers:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;using floating point division,&lt;/li&gt;
&lt;li&gt;using the long division algorithm.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We try to vectorize the following C++ procedure. The procedure cannot assume
anything about dividends, especially if they are all equal. Thus, it is not
possible to employ &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant"&gt;division by a constant&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scalar_div_u8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Compilers cannot vectorize it. For example GCC 14.1.0 produces the following assembly
(stripped from code alignment junk):&lt;/p&gt;
&lt;pre class="code literal-block"&gt;
2b40:       48 85 c9                test   %rcx,%rcx
2b43:       74 30                   je     2b75 &amp;lt;_Z13scalar_div_u8PKhS0_Phm+0x35&amp;gt;
2b45:       45 31 c0                xor    %r8d,%r8d
2b60:       42 0f b6 04 07          movzbl (%rdi,%r8,1),%eax
2b65:       42 f6 34 06             divb   (%rsi,%r8,1)
2b69:       42 88 04 02             mov    %al,(%rdx,%r8,1)
2b6d:       49 ff c0                inc    %r8
2b70:       4c 39 c1                cmp    %r8,%rcx
2b73:       75 eb                   jne    2b60 &amp;lt;_Z13scalar_div_u8PKhS0_Phm+0x20&amp;gt;
2b75:       c3                      ret
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Myriad sequences of RISC-V code</title>
  <link>http://0x80.pl/notesen/2024-11-11-myriad-riscv-sequence.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2024-11-11-myriad-riscv-sequence.html</guid>
  <pubDate>Mon, 11 Nov 2024 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="myriad-sequences"&gt;
&lt;h1&gt;Myriad sequences&lt;/h1&gt;
&lt;p&gt;The RISC-V assembler defines the pseudo-instruction &lt;tt class="docutils literal"&gt;li&lt;/tt&gt; that load an immediate
into a register. Unlike other pseudo-instructions, having one or a few expansions,
&lt;tt class="docutils literal"&gt;li&lt;/tt&gt; explodes into &amp;mdash; as the spec says &amp;mdash; &lt;em&gt;myriad sequences&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;RISC-V opcodes have 32 bits, it's impossible to encode 64-bit immediates.
It's impossible to encode 32-bit immediates too, as we need to have some
spare bits for the opcode itself (instruction + destination).&lt;/p&gt;
&lt;p&gt;Assemblers have to do quite complex job, as they can only use a single
register &amp;mdash; the &lt;tt class="docutils literal"&gt;li&lt;/tt&gt; argument; compilers have more freedom.&lt;/p&gt;
&lt;p&gt;RISC-V comes with two instructions that are used to fill registers with
the given value:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;ADDI rd, rs1, imm12&lt;/tt&gt; &amp;mdash; that adds a &lt;strong&gt;sign-extended&lt;/strong&gt; 12-bit immediate
to register &lt;tt class="docutils literal"&gt;rs1&lt;/tt&gt; and stores result in &lt;tt class="docutils literal"&gt;rd&lt;/tt&gt;;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;ADDIW rd, rs1, imm12&lt;/tt&gt; &amp;mdash; likewise, but defined for RV64, i.e., the CPUs
with 64-bit registers;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;LUI rd, imm20&lt;/tt&gt; &amp;mdash; that stores a &lt;strong&gt;sign-extend&lt;/strong&gt; 20-bit immediate shifted
left by 12 positions in &lt;tt class="docutils literal"&gt;rd&lt;/tt&gt;; alternatively: 32-bit immediate with reset
lowest 12 bits.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For easy cases we can use pick a single instruction from the above list.
But when a constant fall off any of the ranges, more instructions have to be used.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>RISC-V Vector Extension overview</title>
  <link>http://0x80.pl/notesen/2024-11-09-riscv-vector-extension.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2024-11-09-riscv-vector-extension.html</guid>
  <pubDate>Sat, 09 Nov 2024 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The goal of this text is to provide an overview of &lt;a class="reference external" href="https://github.com/riscv/riscv-v-spec/tree/master"&gt;RISC-V Vector extension&lt;/a&gt;
(&lt;strong&gt;RVV&lt;/strong&gt;), and compare &amp;mdash; when applicable &amp;mdash; with widespread SIMD vector
instruction sets: &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions"&gt;SSE&lt;/a&gt;, &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Advanced_Vector_Extensions"&gt;AVX&lt;/a&gt;,
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/AVX-512"&gt;AVX-512&lt;/a&gt;, &lt;a class="reference external" href="http://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(NEON)"&gt;ARM Neon&lt;/a&gt;
and &lt;a class="reference external" href="http://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE)"&gt;SVE&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The RISC-V architecture defines four basic modes (32-bit, 32-bit for embedded
systems, 64-bit, 128-bit) and &lt;a class="reference external" href="https://en.wikichip.org/wiki/risc-v/standard_extensions"&gt;several extensions&lt;/a&gt;. For instance, the support for
single precision floating-point numbers is added by the F extension.&lt;/p&gt;
&lt;p&gt;The vector extension is quite a huge addition. It adds 302 instructions plus
four highly configurable load &amp;amp; store operations.  The RVV instructions can be
split into three groups:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;related to masks,&lt;/li&gt;
&lt;li&gt;integer operations,&lt;/li&gt;
&lt;li&gt;and floating-point operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a CPU does not support floating-point instructions, it still may provide
the integer subset.&lt;/p&gt;
&lt;p&gt;RVV introduces 32 vector registers &lt;tt class="docutils literal"&gt;v0&lt;/tt&gt;, ..., &lt;tt class="docutils literal"&gt;v31&lt;/tt&gt;, a &lt;strong&gt;concept&lt;/strong&gt; of mask
(similar to AVX-512), and nine control registers.&lt;/p&gt;
&lt;p&gt;Unlike other SIMD ISAs, RVV does not explicitly define size of vector register.
It is &lt;strong&gt;an implementation parameter&lt;/strong&gt; (called &lt;tt class="docutils literal"&gt;VLEN&lt;/tt&gt;): the size has to be
a power of two, but not greater than &lt;span class="math"&gt;2&lt;sup&gt;16&lt;/sup&gt;&lt;/span&gt; bits. Likewise, the maximum vector
element size is &lt;strong&gt;an implementation parameter&lt;/strong&gt; (called &lt;tt class="docutils literal"&gt;ELEN&lt;/tt&gt;, also a power
of two and not less than 8 bits). For example, a 32-bit CPU might not support
vectors of 64-bit values.&lt;/p&gt;
&lt;p&gt;But generally, we may expect that a decent 64-bit CPU would support elements
having 8, 16, 32 or 64-bit, interpreted as integers or floats.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Simple suggestions using popcount</title>
  <link>http://0x80.pl/notesen/2023-11-20-popcount-suggestions.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-11-20-popcount-suggestions.html</guid>
  <pubDate>Mon, 20 Nov 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;One great feature that some of CLI applications and compilers recently gained is
providing suggestions in the case of misspellings arguments or options. For
instance, Python suggests possible method/fields names, like:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
&amp;gt;&amp;gt;&amp;gt; 'lower'.is_upper()
Traceback (most recent call last):
  File &amp;quot;&amp;lt;stdin&amp;gt;&amp;quot;, line 1, in &amp;lt;module&amp;gt;
AttributeError: 'str' object has no attribute 'is_upper'. Did you mean: 'isupper'?
&lt;/pre&gt;
&lt;p&gt;If we want to include similar feature in our program, then a quite obvious
solution is to use &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Levenshtein_distance"&gt;Levenshtein distance&lt;/a&gt;. Its basic implementation
is simple and short. If we want suggestion for larger corpus, we may use
tries to speedup matching &amp;mdash; see: &lt;a class="reference external" href="http://stevehanov.ca/blog/?id=114"&gt;Fast and Easy
Levenshtein distance using a Trie&lt;/a&gt; by &lt;strong&gt;Steve Hanov&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Despite that, I was curious if simpler algorithm would do the job. We may assume
that our users know what they are doing.  Their inputs will be sane, with an
exception of minor mistakes. Like omitting a single letter, swapping two
adjacent letters or maybe hitting a wrong key from time to time.&lt;/p&gt;
&lt;p&gt;The two additional assumptions I made: we search suggestions in small
sets, and we use only ASCII letters (128 possible bytes).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX-512 conflict detection without resolving conflicts</title>
  <link>http://0x80.pl/notesen/2023-05-06-avx512-conflict-detection.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-05-06-avx512-conflict-detection.html</guid>
  <pubDate>Sat, 06 May 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;One of the hardest problem in &lt;a class="reference external" href="http://en.wikipedia.org/wiki/SIMD"&gt;SIMD&lt;/a&gt; is dealing with non-continuous
data accesses, that appear pretty common. Data structures based on
indices, like graphs or trees, are a good example. CPU vendors
introduced instructions &lt;strong&gt;GATHER&lt;/strong&gt; and &lt;strong&gt;SCATTER&lt;/strong&gt; to address these needs.
A gather instruction builds a SIMD vector from N values loaded
from N addresses. A scatter instruction stores N values from a SIMD
vector at N addresses. Both instructions allow &lt;strong&gt;repeated&lt;/strong&gt; indices.&lt;/p&gt;
&lt;p&gt;Repeated indices are the real issue if an algorithm uses
the &lt;strong&gt;SCATTER&lt;/strong&gt; &amp;mdash; that is, it either sets or updates values. Then
we need to define how to handle repeated stores.
To solve that particular problem &lt;a class="reference external" href="http://en.wikipedia.org/wiki/AVX-512"&gt;AVX-512&lt;/a&gt; introduced
a complex instruction called &lt;strong&gt;conflict detection&lt;/strong&gt;. The instruction
builds a vector containing masks that mark repeated values in the
input vector.&lt;/p&gt;
&lt;p&gt;Intel proposed a pattern that uses the gather, scatter and conflict
detection instructions to efficiently handle repeated indices. It
is described in the &lt;a class="reference external" href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html"&gt;freely available&lt;/a&gt; &amp;quot;Intel® 64 and IA-32 Architectures
Optimization Reference Manual&amp;quot;, in chapter &amp;quot;18.16.1 Vectorization with
Conflict Detection&amp;quot;.  The problem of calculating a &lt;a class="reference external" href="http://en.wikipedia.org/wiki/histogram"&gt;histogram&lt;/a&gt;
is used there as an example.&lt;/p&gt;
&lt;p&gt;The core of Intel's approach is a &lt;em&gt;conflict resolution loop&lt;/em&gt; in
which the repeated values are aggregated into a single element.
The number of iterations varies, and &lt;strong&gt;depends on data&lt;/strong&gt;: it
is 0 to 4, when we process 16-item vectors (32-bit elements).&lt;/p&gt;
&lt;p&gt;We propose a modified approach, that avoids any additional looping
at the cost of additional storage. It is faster &lt;strong&gt;1.4 times&lt;/strong&gt; than
the Intel algorithm when the input size is larger than 100,000
items.&lt;/p&gt;
&lt;p&gt;The text contains a recap of AVX-512 instructions, a detailed
overview of the Intel algorithm, the presentation of our procedure,
and evaluation results.&lt;/p&gt;
&lt;p&gt;All source codes are available.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Modern perfect hashing for strings</title>
  <link>http://0x80.pl/notesen/2023-04-30-lookup-in-strings.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-04-30-lookup-in-strings.html</guid>
  <pubDate>Sun, 30 Apr 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Looking up in a static set of strings is a common problem we encounter
when parsing any textual formats. Such sets are often keywords of a programming
language or protocol.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="2022-01-29-http-verb-parse.html"&gt;Parsing HTTP verbs&lt;/a&gt; appeared to be the fastest when we use a compile-time
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/trie"&gt;trie&lt;/a&gt;: a series of nested switch statements. I could not believe that
a &lt;a class="reference external" href="http://en.wikipedia.org/wiki/perfect_hash_function"&gt;perfect hash function&lt;/a&gt; is not better, and that led to a novel hashing
approach that is based on the instruction &lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/PEXT.html"&gt;PEXT&lt;/a&gt; (Parallel Bits Extract).&lt;/p&gt;
&lt;p&gt;Briefly, when constructing a perfect hash function, we are looking for the
smallest set of input bytes that can be then the input for some function
combines them into a single value. The instruction &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; allows to quickly
construct any n-bit subword from a 64-bit word; the latency of the instruction is
3 CPU cycles on the current processors. This allows us to extend the schema for
looking for &lt;strong&gt;the smallest subset of bits&lt;/strong&gt;. This n-bit word is then the input
for a function that translates the word into the desired value.&lt;/p&gt;
&lt;p&gt;Instead of something like:&lt;/p&gt;
&lt;pre class="code go literal-block"&gt;
&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;// read bytes at indices a, b, c&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;// and push forward&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;hash_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;We have:&lt;/p&gt;
&lt;pre class="code go literal-block"&gt;
&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;// read bytes at indices d and e, and form a temp value&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// from the temp value (bytes d and e) extract crucial bits&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;PEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;precomputed_mask&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;hash_uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Please note that depending on the strings set, the number of bytes
read in both schemas can vary. It is not the rule that a bit-level
hash function would touch fewer bytes than a byte-level hash function.&lt;/p&gt;
&lt;p&gt;Apart from the above hashing schema, this text describes also
constructing a compile-time hash table and compile-time switch.&lt;/p&gt;
&lt;p&gt;All source codes are available.&lt;/p&gt;
&lt;div class="section" id="pext-recap"&gt;
&lt;h2&gt;PEXT recap&lt;/h2&gt;
&lt;p&gt;The instruction &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; gets two arguments: the input word and
the input mask. Bits from the input word for which the input
mask is 1 are copied to the output. For example:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
word:   0010101011010111
mask:   0011100100100010
masked: __101__0__0___1_
PEXT:   __________101001
&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD-ized faster parse of IPv4 addresses</title>
  <link>http://0x80.pl/notesen/2023-04-09-faster-parse-ipv4.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-04-09-faster-parse-ipv4.html</guid>
  <pubDate>Sun, 09 Apr 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction-1"&gt;
&lt;span id="introduction"&gt;&lt;/span&gt;&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Just for recap, an &lt;a class="reference external" href="http://en.wikipedia.org/wiki/IP_address"&gt;IPv4 address&lt;/a&gt; written in the textual form consists four
decimal numbers separated by the dot character. Each number represents an
octet (byte), that is in range from 0 to 255. Here are some examples:
&amp;quot;10.1.1.12&amp;quot;, &amp;quot;127.0.0.1&amp;quot;, &amp;quot;255.255.255.0&amp;quot;. An IPv4 address is stored in
the &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Endianness"&gt;big-endian byte order&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Parsing IPv4 addresses seems to be a trivial task. For example, the Go
builtin module &lt;a class="reference external" href="https://pkg.go.dev/net/netip"&gt;netip&lt;/a&gt; implements parsing with full validation in 35 lines.
The &lt;tt class="docutils literal"&gt;inet_pton&lt;/tt&gt; specialisation for IPv4 addresses, that can be found in &lt;a class="reference external" href="https://www.gnu.org/software/libc/"&gt;Glibc&lt;/a&gt;,
spans approx 40 lines of plain C code. For completeness, their full sources were
put in &lt;a class="reference internal" href="#appendix"&gt;appendix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These two procedures share the same schema of parsing and validation:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Read the input string byte by byte.&lt;/li&gt;
&lt;li&gt;When the current byte is an ASCII digit ('0' ... '9') add it to the
current octet. If the octet value becomes larger than 255 or the
leading zero was detected, then report an error.&lt;/li&gt;
&lt;li&gt;When the current byte is the dot ('.'), check if we read at least
one digit. If not, it's an error (for example &amp;quot;.1.1.20&amp;quot; or &amp;quot;192..0.12&amp;quot;).&lt;/li&gt;
&lt;li&gt;When the current byte is not a digit or the dot, report an error.&lt;/li&gt;
&lt;li&gt;At the very end, check if exactly four octets were read.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I bet 99.999% of ever written IPv4 parsers use similar schema. And that is pretty
hard to see any obvious inefficiency in this approach. Additionally, when we take
into account that compilers are getting smarter and smarter, we may assume
that a compiler would do a decent job for us.&lt;/p&gt;
&lt;p&gt;However, it's possible to make the conversion faster, using &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Single_instruction,_multiple_data"&gt;SIMD instructions&lt;/a&gt;.
The best solution is &lt;strong&gt;two-three times faster&lt;/strong&gt; than a reference scalar procedure.&lt;/p&gt;
&lt;p&gt;The actual C++ code snippets with &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions"&gt;SSE&lt;/a&gt;
instructions are used to illustrate solution. Full source code is &lt;a class="reference internal" href="#source"&gt;available&lt;/a&gt;.
They include a &lt;a class="reference external" href="http://en.wikipedia.org/wiki/SWAR"&gt;SWAR&lt;/a&gt; implementation and also different SSE variants.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SWAR find any byte from set</title>
  <link>http://0x80.pl/notesen/2023-03-06-swar-find-any.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-03-06-swar-find-any.html</guid>
  <pubDate>Mon, 06 Mar 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;When I was browsing the source code of project &lt;a class="reference external" href="https://github.com/ada-url/ada"&gt;Ada&lt;/a&gt; (&lt;em&gt;WHATWG-compliant
and fast URL parser written in modern C++&lt;/em&gt;) the following procedure
caught my attention:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;ada_really_inline&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;find_authority_delimiter_special&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string_view&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;noexcept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x0101010101010101&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="mh"&gt;0x8080808080808080&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;index_of_first_set_byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((((&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x101010101010101&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x101010101010101&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;56&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x101010101010101&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'&amp;#64;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'/'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'?'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'\\'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;swap_bytes_if_big_endian&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;index_of_first_set_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;swap_bytes_if_big_endian&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;index_of_first_set_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The above procedure finds the position of the first occurrence of a char
from the set &lt;tt class="docutils literal"&gt;&amp;#64;&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;/&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;?&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;/&lt;/tt&gt;. It returns the length
of input string if nothing was found.&lt;/p&gt;
&lt;p&gt;The procedure uses &lt;a class="reference external" href="http://en.wikipedia.org/wiki/SWAR"&gt;SWAR&lt;/a&gt; techniques: it processes several bytes at once,
taking advantage on the current CPUs architecture that process 64-bit values.
The procedure implementation comes from &lt;tt class="docutils literal"&gt;src/helpers.cpp&lt;/tt&gt;, and more
function from that file follow exactly the same SWAR approach.&lt;/p&gt;
&lt;p&gt;These two functions are crucial:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;has_zero_byte&lt;/tt&gt; is non-zero if a multi-byte word has at least one zero byte;
note that the procedure also keeps only the most significant bits.&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;index_of_first_set_byte&lt;/tt&gt; returns the index of first non-zero byte; it uses
the fact it is called on word formed with bytes 0x00 and 0x80.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The pattern used is quite straightforward. If we bit-xor input bytes with a
word filled with one of bytes from set, then the result has zero byte if the
byte was there.  We check then if it least one result of bit-xor has zero-byte
and if it is true, we're looking for its position.&lt;/p&gt;
&lt;p&gt;While the production code processes multi-word inputs, let's focus on a basic
building block that processes a single 64-bit word.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;find_authority_delimiter_special_reference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;noexcept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x0101010101010101&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="mh"&gt;0x8080808080808080&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;index_of_first_set_byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((((&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x101010101010101&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x101010101010101&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;56&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x101010101010101&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'&amp;#64;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'/'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'?'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'\\'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xor4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_zero_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xor4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;index_of_first_set_byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The following assembly is produced by GCC 12.2.0 from Debian for the IceLake Server architecture
(&lt;tt class="docutils literal"&gt;gcc &lt;span class="pre"&gt;-O3&lt;/span&gt; &lt;span class="pre"&gt;-march=icelake-server&lt;/span&gt;&lt;/tt&gt;).&lt;/p&gt;
&lt;pre class="code literal-block"&gt;
movabs $0x2f2f2f2f2f2f2f2f,%rax
movabs $0xfefefefefefefeff,%rsi
xor    %rdi,%rax
movabs $0xd0d0d0d0d0d0d0d0,%rcx
xor    %rdi,%rcx
add    %rsi,%rax
and    %rcx,%rax
movabs $0x4040404040404040,%rcx
mov    %rdi,%rdx
xor    %rdi,%rcx
movabs $0xbfbfbfbfbfbfbfbf,%rdi
xor    %rdx,%rdi
add    %rsi,%rcx
and    %rdi,%rcx
or     %rcx,%rax
movabs $0x3f3f3f3f3f3f3f3f,%rcx
xor    %rdx,%rcx
movabs $0xc0c0c0c0c0c0c0c0,%rdi
add    %rsi,%rcx
xor    %rdx,%rdi
and    %rdi,%rcx
or     %rcx,%rax
movabs $0x5c5c5c5c5c5c5c5c,%rcx
xor    %rdx,%rcx
add    %rsi,%rcx
movabs $0xa3a3a3a3a3a3a3a3,%rsi
xor    %rsi,%rdx
and    %rcx,%rdx
or     %rdx,%rax
movabs $0x8080808080808080,%rdx
and    %rdx,%rax
je     &amp;lt;_Z42find_authority_delimiter_special_referencem+0xc0&amp;gt;
movabs $0x101010101010101,%rdx
dec    %rax
and    %rdx,%rax
imul   %rdx,%rax
shr    $0x38,%rax
dec    %eax
ret
mov    $0xffffffff,%eax
ret
&lt;/pre&gt;
&lt;p&gt;The assembly contains:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;11 x constants,&lt;/li&gt;
&lt;li&gt;6 x xor,&lt;/li&gt;
&lt;li&gt;6 x and,&lt;/li&gt;
&lt;li&gt;4 x add,&lt;/li&gt;
&lt;li&gt;3 x or,&lt;/li&gt;
&lt;li&gt;1 x multiplication (&lt;tt class="docutils literal"&gt;imul&lt;/tt&gt;),&lt;/li&gt;
&lt;li&gt;1 x shift right,&lt;/li&gt;
&lt;li&gt;1 x branch.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: finding first byte in lanes</title>
  <link>http://0x80.pl/notesen/2023-02-06-avx512-find-first-byte-in-lane.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-02-06-avx512-find-first-byte-in-lane.html</guid>
  <pubDate>Mon, 06 Feb 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The problem is defined as follows: we have separate lanes (32-bit or 64-bit)
and want to find the position of the first occurrence of the given byte
in each lane.&lt;/p&gt;
&lt;p&gt;For example, when we look for byte &lt;tt class="docutils literal"&gt;0xaa&lt;/tt&gt; in 32-bit lanes:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
   lane 0        lane 1        lane 2       lane 4  ...
[00|aa|aa|11] [aa|aa|aa|aa] [aa|11|11|22] [11|22|33|44]
       ^^               ^^   ^^
  position 1    position 0   position 3    position 4 (not found)
&lt;/pre&gt;
&lt;p&gt;The result should be a vector of &lt;tt class="docutils literal"&gt;uint32 = {1, 0, 3, 4, &lt;span class="pre"&gt;...}&lt;/span&gt;&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;With AVX512 an obvious solution would be producing a bitmask from
byte-level comparison and then doing some permutations to convert
parts of bitmasks into 32-bit values.&lt;/p&gt;
&lt;p&gt;While it's feasible, I want to show a method that uses trick
from my previous article &lt;a class="reference external" href="2023-01-31-avx512-bsf.html"&gt;AVX512: count trailing zeros&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Finding lowest common ancestor of two nodes</title>
  <link>http://0x80.pl/notesen/2023-02-05-tree-lca.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-02-05-tree-lca.html</guid>
  <pubDate>Sun, 05 Feb 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;There are several approaches to find &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Lowest_common_ancestor"&gt;lowest common ancestor&lt;/a&gt; (LCA).&lt;/p&gt;
&lt;p&gt;The algorithm showed here does not need extra memory.  There's an assumption
that we can get the parent node of given node in constant time.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Faster fractional exponents</title>
  <link>http://0x80.pl/notesen/2023-02-05-fraction-pow.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-02-05-fraction-pow.html</guid>
  <pubDate>Sun, 05 Feb 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;A well known method of calculating powers of integers is based on the binary
representation of an exponent. Let's consider a simple example. Exponent
equals &lt;tt class="docutils literal"&gt;y = 9 = 0b1001&lt;/tt&gt;; its value can be expressed as
&lt;span class="math"&gt;1 &amp;sdot; 2&lt;sup&gt;0&lt;/sup&gt; + 0 &amp;sdot; 2&lt;sup&gt;1&lt;/sup&gt; + 0 &amp;sdot; 2&lt;sup&gt;2&lt;/sup&gt; + 1 &amp;sdot; 2&lt;sup&gt;3&lt;/sup&gt;&lt;/span&gt;; after constant folding
it simplifies to &lt;span class="math"&gt;2&lt;sup&gt;0&lt;/sup&gt; + 2&lt;sup&gt;3&lt;/sup&gt; = 1 + 8 = 9&lt;/span&gt;. Thus &lt;span class="math"&gt;&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;9&lt;/sup&gt;&lt;/span&gt; can
be expanded into &lt;span class="math"&gt;&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;sup&gt;0&lt;/sup&gt; + 2&lt;sup&gt;3&lt;/sup&gt;&lt;/sup&gt; = &lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;sup&gt;0&lt;/sup&gt;&lt;/sup&gt; &amp;sdot; &lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;sup&gt;3&lt;/sup&gt;&lt;/sup&gt; = &lt;i&gt;x&lt;/i&gt;&lt;sup&gt;1&lt;/sup&gt; &amp;sdot; &lt;i&gt;x&lt;/i&gt;&lt;sup&gt;8&lt;/sup&gt;&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;The main observation is that the product contains &lt;span class="math"&gt;&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;sup&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sup&gt;&lt;/sup&gt;&lt;/span&gt; only
if the i-th bit of exponent is 1.&lt;/p&gt;
&lt;p&gt;An algorithm utilizing this observation is quite simple:&lt;/p&gt;
&lt;pre class="code go literal-block"&gt;
&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;powint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;// x^0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// tmp is x^{2^i}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="c1"&gt;// i-th bit set?&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;// result times x^{2^i}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="c1"&gt;// square in each iteration&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="c1"&gt;// scan bits starting from the least significant one&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Exactly the same schema can be used if an exponent is fractional;
for simplicity let's assume the exponent is positive and less than 1.&lt;/p&gt;
&lt;p&gt;We also use binary representation of fraction, we just need to remember
that weights of bits are fractions: &lt;span class="math"&gt;&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;sup&gt; &amp;minus; &lt;i&gt;i&lt;/i&gt;&lt;/sup&gt;&lt;/sup&gt;&lt;/span&gt;. They equals
1/2, 1/4, 1/8, 1/16, 1/32, and so on, so forth.&lt;/p&gt;
&lt;p&gt;The algorithm is almost identical: we scan bits starting from the
most significant one &amp;mdash; bit weights are decreasing by factor 1/2.
Value &lt;span class="math"&gt;&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;&amp;frac12;&lt;/sup&gt;&lt;/span&gt; is &lt;strong&gt;a square root&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Let's assume that we have a fraction expressed as a &lt;tt class="docutils literal"&gt;uint64&lt;/tt&gt;,
where the decimal dot is &lt;strong&gt;before&lt;/strong&gt; the most significant bit. For
instance fraction &lt;span class="math"&gt;0.1010111&lt;sub&gt;2&lt;/sub&gt;&lt;/span&gt; would have the following
representation as &lt;tt class="docutils literal"&gt;uint64&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;decimal dot
|
[10101110|00000000|00000000|00000000|00000000|00000000|00000000|00000000]
 ||│                                                                   |
 ││└- bit 61, weight 1/8                                           bit 0
 │└─╴ bit 62, weight 1/4
 └──╴ bit 63, weight 1/2&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In the next section we will see how to convert a normalized
float into that representation.&lt;/p&gt;
&lt;p&gt;The algorithm is:&lt;/p&gt;
&lt;pre class="code go literal-block"&gt;
&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;powfracaux&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fraction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="c1"&gt;// res = 2^0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nx"&gt;sq&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fraction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nx"&gt;sq&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="c1"&gt;// sq = x^(1/2^i)&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fraction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// i-th bit set (MSB)&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;sq&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;// update result&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nx"&gt;fraction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Converting binary fraction to ratio</title>
  <link>http://0x80.pl/notesen/2023-02-05-float-to-ratio.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-02-05-float-to-ratio.html</guid>
  <pubDate>Sun, 05 Feb 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Suppose we have a binary fraction, that is positive and less than 1:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
0.&lt;span style="font-weight: bold"&gt;1&lt;/span&gt;0&lt;span style="font-weight: bold"&gt;1&lt;/span&gt;0&lt;span style="font-weight: bold"&gt;11&lt;/span&gt;000┈┈┈ = 0.671875
  | | |│
  │ │ │└─╴ 1/2^6
  │ │ └──╴ 1/2^5
  │ └────╴ 1/2^3
  └──────╴ 1/2^1&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We want to express it as a ratio of two integer numbers.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: count trailing zeros</title>
  <link>http://0x80.pl/notesen/2023-01-31-avx512-bsf.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-01-31-avx512-bsf.html</guid>
  <pubDate>Tue, 31 Jan 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;AVX512 lacks of counting &lt;strong&gt;trailing zeros&lt;/strong&gt;; it supports counting of leading
zeros via instruction &lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/VPLZCNTD_Q.html"&gt;VPLZCNTD&lt;/a&gt; in 32- and 64-bit words. There is the scalar
instruction &lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/BSF.html"&gt;BSF&lt;/a&gt; (Bit Scan Forward).&lt;/p&gt;
&lt;p&gt;To recall how counting the leading zeros is supposed to work, let's see
sample 32-bit word:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
 bit 31                       bit 0
 |                                |
[0000|0001|0111|1000|0011|1100|0000|
                            ^^ ^^^^
                       6 trailing zeros
&lt;/pre&gt;
&lt;p&gt;An obvious solution would be reversing bits in a word and then use the
&lt;tt class="docutils literal"&gt;VPLZCNT&lt;/tt&gt; instruction. Source code for this approach is shown in a following
section. Basically, with &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt; we can reverse order of bytes. Then, with
two additional invocations of that instruction, we swap order of bits
within bytes.&lt;/p&gt;
&lt;p&gt;However, a solution which is really clever uses &lt;a class="reference external" href="https://www.chessprogramming.org/Population_Count"&gt;population count&lt;/a&gt;;
it is explained in the next section.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: check if value belongs to a set</title>
  <link>http://0x80.pl/notesen/2023-01-21-avx512-any-eq.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-01-21-avx512-any-eq.html</guid>
  <pubDate>Sat, 21 Jan 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;We want to check if a value belongs to a set. More formally, we
want to evaluate the following expression: &lt;cite&gt;(x == word_0) or (x == word_1)
or ... or (x == word_n)&lt;/cite&gt;, where &lt;cite&gt;x&lt;/cite&gt; is a vector of words, and &lt;cite&gt;word_i&lt;/cite&gt; is
a constant vector.&lt;/p&gt;
&lt;p&gt;For a four-element set, a naive version of AVX512 assembly code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
VPCMPD.BCST   $0, (AX),   Z1, K1        // K1 = Z1 == word_0
VPCMPD.BCST   $0, 4(AX),  Z1, K2        // K2 = Z1 == word_1
VPCMPD.BCST   $0, 8(AX),  Z1, K3        // K3 = Z1 == word_2
VPCMPD.BCST   $0, 12(AX), Z1, K4        // K4 = Z1 == word_3
KORW          K1, K2, K1                // K1 = K1 | K2
KORW          K3, K4, K3                // K3 = K3 | K4
KORW          K1, K3, K1                // K1 = K1 | K3
&lt;/pre&gt;
&lt;p&gt;The above code tests a vector register &lt;tt class="docutils literal"&gt;Z1&lt;/tt&gt; against const values stored in
an array pointed by &lt;tt class="docutils literal"&gt;AX&lt;/tt&gt;, and sets result in kreg &lt;tt class="docutils literal"&gt;K1&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;The tool &lt;a class="reference external" href="https://uica.uops.info/"&gt;uICA&lt;/a&gt; reports the following timings (for Skylake-X):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
Throughput (in cycles per iteration): 4.00
Bottlenecks: Decoder, Ports

┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7 │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                     1                          1                      │ vpcmpeqd k1, zmm1, dword ptr [rax]{1to16}
│  2                    │   2    │   2   │                              1                 1                      │ vpcmpeqd k2, zmm1, dword ptr [rax+0x4]{1to16}
│  2                    │   2    │   2   │                     1                          1                      │ vpcmpeqd k3, zmm1, dword ptr [rax+0x8]{1to16}
│  2                    │   2    │   2   │                              1                 1                      │ vpcmpeqd k4, zmm1, dword ptr [rax+0xc]{1to16}
│  1                    │   1    │   1   │   1                                                                   │ korw k1, k2, k1
│  1                    │   1    │   1   │   1                                                                   │ korw k3, k4, k3
│  1                    │   1    │   1   │   1                                                                   │ korw k1, k3, k1
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│  11                   │   11   │  11   │   3                 2        2                 4                      │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: generating constants</title>
  <link>http://0x80.pl/notesen/2023-01-19-avx512-consts.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-01-19-avx512-consts.html</guid>
  <pubDate>Thu, 19 Jan 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;AVX512 code often needs constants that repeat in 64-, 32-, 16- or 8-bit lanes.
Instead of pre-computing such constants as whole 64-byte arrays, we can reduce
memory consumption by using explicit or implicit broadcasts.&lt;/p&gt;
&lt;p&gt;While broadcasts have high throughput, their latencies are quite high. According
to &lt;a class="reference external" href="https://uops.info/table.html?search=broadcastq%20(zmm%2C%20R64&amp;amp;cb_lat=on&amp;amp;cb_tp=on&amp;amp;cb_SKX=on&amp;amp;cb_CNL=on&amp;amp;cb_ICL=on&amp;amp;cb_ADLP=on&amp;amp;cb_measurements=on&amp;amp;cb_doc=on&amp;amp;cb_avx512=on"&gt;uops.info&lt;/a&gt;, latencies are 5 cycles (on Skylake-X, Cannon Lake, Ice Lake,
Adler Lake-P) when broadcasting from either a 32-bit or 64-bit register.
When the source is memory location, latencies are even higher.&lt;/p&gt;
&lt;p&gt;When an AVX512 procedure is quite short, or often loads different constants
from memory, broadcast latencies might become visible. To overcome this
problem, we might &lt;strong&gt;compute&lt;/strong&gt; some values using few cheap instructions.&lt;/p&gt;
&lt;p&gt;We can quickly fill an AVX512 all zero bits (with XOR) or ones (with ternary
log instruction) and then use shifts, bit operations and other instructions
to construct desired value.&lt;/p&gt;
&lt;p&gt;This article show some ways to calculate different simple constants.
Examples are focused mostly on 32-bit values, although in most cases we
might generalize the methods to other lane widths.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: histogram of sixteen nibbles</title>
  <link>http://0x80.pl/notesen/2023-01-06-avx512-popcount-4bit.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2023-01-06-avx512-popcount-4bit.html</guid>
  <pubDate>Fri, 06 Jan 2023 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;At &lt;a class="reference external" href="https://sneller.io"&gt;Sneller&lt;/a&gt; we had the following problem: there are sixteen 4-bit values,
we need a histogram for that limited set.&lt;/p&gt;
&lt;p&gt;Since the input set has the fixed size of 64 bits, the problem is not
that difficult as a generic case.&lt;/p&gt;
&lt;p&gt;The basic problem we solve is counting how many nibbles are present in the
given 64-bit set.  Then, the solution to the initial problem is performing the
basic step for each possible nibble value. And it's is done in parallel.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Faster hack</title>
  <link>http://0x80.pl/notesen/2022-01-31-faster-hack.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2022-01-31-faster-hack.html</guid>
  <pubDate>Mon, 31 Jan 2022 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The other day I came across the following line:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;11124811248484&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="sc"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;My first reaction was &amp;quot;WTF&amp;quot;, but later I realized that such a hackish code
has to be a response to poor compiler optimizations. And taking into account
that the code might be quite old, this is a perfect solution. Only
a bit unreadable.&lt;/p&gt;
&lt;p&gt;In fact we have something like &lt;tt class="docutils literal"&gt;len * coefficient(type) &amp;gt; 4&lt;/tt&gt;, where
&lt;tt class="docutils literal"&gt;coefficient&lt;/tt&gt; is a value from the set {1, 2, 4, 8}.&lt;/p&gt;
&lt;p&gt;I want to show that this problem can be solved not only cleaner but also
faster.&lt;/p&gt;
&lt;p&gt;First, let's examine the compiler output for the original expression. We
assume both variables are unsigned integers.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fun1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;11124811248484&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="sc"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;And the output from &lt;tt class="docutils literal"&gt;gcc &lt;span class="pre"&gt;-O3&lt;/span&gt; &lt;span class="pre"&gt;-march=skylake&lt;/span&gt; &lt;span class="pre"&gt;-s&lt;/span&gt;&lt;/tt&gt;; GCC version is 10.2.1.:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
    cmpl    $13, %esi
    ja      .L2
    movl    %esi, %esi
    leaq    .LC0(%rip), %rax        # coef = &amp;quot;11124811248484&amp;quot;[type] if type &amp;lt; 14
    movsbl  (%rax,%rsi), %eax
    subl    $48, %eax               # coef -= '0'
    imull   %eax, %edi              # edi = len * coef
.L2:
    cmpl    $4, %edi                # len * coef &amp;gt; 4
    seta    %al
    ret
&lt;/pre&gt;
&lt;p&gt;It's worth to note that the compiler knows that for &lt;tt class="docutils literal"&gt;type &amp;gt;= 14&lt;/tt&gt;, we
always fetch the value &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;&amp;quot;11124811248484&amp;quot;[0]&lt;/span&gt; - '0'&lt;/tt&gt; that equals one. Thus,
jumps instantly to the comparison.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Optimization #1&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We may omit subtraction of constant &lt;tt class="docutils literal"&gt;'0'&lt;/tt&gt;, if we don't use ASCII
digits, but hex strings. Maybe for older compiler we need to use
octal digits. Either way, we avoid a subtraction.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fun2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\x01\x01\x01\x02\x04\x08\x01\x01\x02\x04\x08\x04\x08\x04&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The assembly code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
    cmpl    $13, %esi
    ja      .L5
    movl    %esi, %esi
    leaq    .LC1(%rip), %rax
    movsbl  (%rax,%rsi), %eax
    imull   %eax, %edi
.L5:
    cmpl    $4, %edi
    seta    %al
    ret
&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Optimization #2&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We removed one instruction. Can we do it better? The comparison &lt;tt class="docutils literal"&gt;type &amp;lt; 14&lt;/tt&gt;
cannot be avoided,b ut the constants are {1, 2, 4, 8}. All are powers of two,
thus we can replace the multiplication with a binary shift left just by
adjusting the constants.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fun3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;shift&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\x00\x00\x00\x01\x02\x03\x00\x00\x00\x01\x02\x03\x02\x03\x02&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The assembly code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
    cmpl    $13, %esi
    ja      .L7
    movl    %esi, %esi
    leaq    .LC2(%rip), %rax
    movsbl  (%rax,%rsi), %eax
    shlx    %eax, %edi, %edi
.L7:
    cmpl    $4, %edi
    seta    %al
    ret
&lt;/pre&gt;
&lt;p&gt;Since we are targeting the Skylake, it was possible to use the instruction
&lt;tt class="docutils literal"&gt;SHLX&lt;/tt&gt; from the BMI extensions.  The instruction performs shift left, but
does not alter the CPU flags register. By not doing this, it does not create
any indirect dependencies between instructions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Optimization #3&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Seems we reached the end? Not really. We are calculating four cases:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
len * 1 &amp;gt; 4
len * 2 &amp;gt; 4
len * 4 &amp;gt; 4
len * 8 &amp;gt; 4
&lt;/pre&gt;
&lt;p&gt;Remembering that &lt;tt class="docutils literal"&gt;len&lt;/tt&gt; is an unsigned integer, we may rewrite the
expressions:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
len &amp;gt; 5     # 5 * 1 &amp;gt; 4
len &amp;gt; 3     # 3 * 2 &amp;gt; 4
len &amp;gt; 2     # 2 * 4 &amp;gt; 4
len &amp;gt; 1     # 1 * 8 &amp;gt; 4
&lt;/pre&gt;
&lt;p&gt;We just figured out the minimum value of &lt;tt class="docutils literal"&gt;len&lt;/tt&gt; for which the expression &lt;tt class="docutils literal"&gt;len
* coefficient &amp;gt; 4&lt;/tt&gt; is true. As a result, we may get rid of
multiplication/shift.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fun4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bound&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\x05\x05\x05\x03\x02\x01\x05\x05\x03\x02\x01\x02\x01\x02&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bound&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The compiler output:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
    movl    $5, %eax            # bound = lookup[0]
    cmpl    $13, %esi           # type &amp;gt;= 14
    ja      .L9
    movl    %esi, %esi
    leaq    .LC3(%rip), %rax
    movsbl  (%rax,%rsi), %eax   # bound = lookup[type]
.L9:
    cmpl    %eax, %edi          # len &amp;gt; cond
    seta    %al
    ret
&lt;/pre&gt;
&lt;p&gt;Conclusions:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;All in all, the trick with subscripting a string is neat. This is
what a compiler &lt;strong&gt;may&lt;/strong&gt; do underneath for &lt;tt class="docutils literal"&gt;switch&lt;/tt&gt; statements,
but it is not guaranteed.&lt;/li&gt;
&lt;li&gt;Clang uses conditional moves instead of jumps.&lt;/li&gt;
&lt;/ol&gt;
  </description>
 </item>
 <item>
  <title>Fast parsing HTTP verbs</title>
  <link>http://0x80.pl/notesen/2022-01-29-http-verb-parse.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2022-01-29-http-verb-parse.html</guid>
  <pubDate>Sat, 29 Jan 2022 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;When we started to use &lt;a class="reference external" href="https://www.boost.org/doc/libs/1_75_0/libs/beast/doc/html/index.html"&gt;boost::beast&lt;/a&gt; library at work, obviously I downloaded
its source code. Can't say I am good at navigation across boost libraries, I
was just opening random files waiting for &lt;a class="reference external" href="https://xkcd.com/303/"&gt;compilation completion&lt;/a&gt;. My
attention was caught by procedure &lt;tt class="docutils literal"&gt;string_to_verb&lt;/tt&gt;, that translates a
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol"&gt;HTTP verb&lt;/a&gt; into a number, in fact an enum.
The most common verbs are &lt;tt class="docutils literal"&gt;GET&lt;/tt&gt; &lt;tt class="docutils literal"&gt;POST&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;PUT&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;DELETE&lt;/tt&gt;, however in
total there are 33 verbs, the longest ones have 12 characters.&lt;/p&gt;
&lt;p&gt;Why the boost implementation seemed to me to be odd? It's basically a hardcoded
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Trie"&gt;Trie&lt;/a&gt;. There is the main &lt;tt class="docutils literal"&gt;switch&lt;/tt&gt; statement that selects a subtree based
on the &lt;strong&gt;first character&lt;/strong&gt;. Then:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;When there is exactly one verb starting with given letter, just equality
of the suffix is checked. For example &amp;quot;BIND&amp;quot; or &amp;quot;TRACE&amp;quot; are handled like
this.&lt;/li&gt;
&lt;li&gt;When more words start with the same letter but they don't share
any prefix &amp;mdash; there's a plain ladder of &lt;tt class="docutils literal"&gt;if&lt;/tt&gt; checking equality of
the suffix. For example pairs of verbs &amp;quot;LINK&amp;quot; and &amp;quot;LOCK&amp;quot; or &amp;quot;SEARCH&amp;quot;
and &amp;quot;SUBSCRIBE&amp;quot; are matched in this way.&lt;/li&gt;
&lt;li&gt;Otherwise, there are more switches resolving the subsequent letters.
This is how a group of verbs &amp;quot;PATCH&amp;quot;, &amp;quot;POST&amp;quot;, &amp;quot;PROPFIND&amp;quot;, &amp;quot;PROPPATCH&amp;quot;,
&amp;quot;PURGE&amp;quot; and &amp;quot;PUT&amp;quot; is matched.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This looks like a lot of branches and we do know that branches might be a
source of performance problems. A mispredicted branch penalty is several CPU
cycles. I thought it would be good to check if other solutions would be better.&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The first solution is to match not a single letter prefix, but take
into account four or eight characters at time. It costs exactly one
load and comparison. The only drawback is padding with zeros strings
shorter than 4 chars.&lt;/li&gt;
&lt;li&gt;Since the set of verbs is small and given statically, we may build
a minimal perfect hash function (MPH). &lt;a class="reference external" href="https://www.gnu.org/software/gperf/"&gt;GNU gperf&lt;/a&gt; can be used to
generate a C++ program implementing a MPH. The major drawback of
gperf is that it generates function &amp;quot;exists&amp;quot;, while we need a &amp;quot;lookup&amp;quot;.
I had to manually edit the generated program.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>DDoS-ed by a service runs on AWS</title>
  <link>http://0x80.pl/notesen/2022-01-29-ddos.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2022-01-29-ddos.html</guid>
  <pubDate>Sat, 29 Jan 2022 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Every year in January I download the Apache logs for my home page.  I run a
simple script on the logs to see which articles were popular last year.
Sometimes I'm surprised that some old stuff is still being read.&lt;/p&gt;
&lt;p&gt;The script counts the total number of visits and the unique number of
visitors.  And this year something strange happened. Below is the head
of list:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
total  uniq
651522 7864   /notesen/2021-12-22-test-and-clear-bit.html
9008   5142   /articles/simd-strfind.html
5478   3521   /notesen/2019-01-07-cpp-read-file.html
4414   3348   /notesen/2021-01-18-autovectorization-gcc-clang.html
3773   2008   /articles/simd-parsing-int-sequences.html
3270   1897   /notesen/2018-10-03-simd-index-of-min.html
2948   1870   /notesen/2019-02-02-autovectorization-gcc-clang.html
3539   1825   /articles/index.html
3843   1617   /articles/sse-popcount.html
1908   1287   /notesen/2021-03-11-any-word-is-zero.html
1982   1244   /notesen/2016-01-12-sse-base64-encoding.html
2202   1229   /articles/simd-byte-lookup.html
2060   1220   /notesen/2018-10-24-sse-sumbytes.html
2072   1202   /notesen/2021-02-02-all-bytes-in-reg-are-equal.html
1998   1103   /articles/avx512-ternary-functions.html
&lt;/pre&gt;
&lt;p&gt;The trend is clear &amp;mdash; a thousand visits is normal, few thousands is something
very popular. But &lt;a class="reference external" href="2021-12-22-test-and-clear-bit.html"&gt;the top article&lt;/a&gt; got visited 600 thousands times! It was
added on &lt;a class="reference external" href="https://news.ycombinator.com/item?id=29654056"&gt;Hacker News&lt;/a&gt;, but never hit its main page (gained only ~30
upvotes).&lt;/p&gt;
&lt;p&gt;Below is a daily number of visits:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
/notesen/2021-12-22-test-and-clear-bit.html
2021-12-22:     360
2021-12-23:   69468 =============================
2021-12-24:  141027 ============================================================
2021-12-25:  138335 ==========================================================
2021-12-26:  137856 ==========================================================
2021-12-27:  136672 ==========================================================
2021-12-28:   27736 ===========
2021-12-29:      37
2021-12-30:      17
2021-12-31:      14
total: 651522
&lt;/pre&gt;
&lt;p&gt;For whole four days, every ~2 seconds, somebody performed HTTP GET on this very
URL.&lt;/p&gt;
&lt;p&gt;I collected the IPs visiting the article. Almost all of the IPs are registered
by &lt;strong&gt;Amazon Web Services&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Below is the detailed histogram of visits per IP. An catching-eye pattern is
that the number of visits is shared across groups of different IPs. Looks like
somebody did a silly synchronisation mistake and let workers pick the same same
job over and over.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
count    IP
5724     44.201.39.42
5724     3.94.193.105
4770     44.201.33.147
3816     54.91.111.64
3816     54.84.75.180
3816     54.221.21.59
3816     54.196.220.147
3816     54.175.245.212
3816     44.192.16.27
3816     3.87.247.68
3816     3.80.213.10
3816     3.237.21.87
3816     3.236.190.167
3816     3.236.104.169
3816     3.235.181.101
3816     18.204.227.120
2862     54.87.77.182
2862     54.85.161.10
2862     54.221.43.86
2862     54.209.115.198
2862     54.164.131.116
2862     54.161.64.49
2862     52.91.89.169
2862     52.87.191.130
2862     52.21.158.202
2862     44.200.72.132
2862     44.200.26.192
2862     44.200.187.40
2862     44.200.143.177
2862     44.197.188.220
2862     3.90.183.142
2862     3.88.173.14
2862     3.85.125.218
2862     3.84.155.160
2862     34.239.146.120
2862     34.228.40.63
2862     3.239.96.83
2862     3.239.219.229
2862     3.236.241.160
2862     3.236.207.16
2862     3.236.141.1
2862     3.236.131.194
2862     3.236.119.211
2862     3.233.219.239
2862     3.230.127.79
2862     3.227.254.57
2862     3.227.235.211
2862     3.227.11.80
2862     3.226.76.105
2862     3.220.231.2
2862     3.216.79.148
2862     18.233.148.19
2862     18.215.147.137
2862     18.214.37.67
2862     18.207.252.143
2862     18.205.159.11
2862     100.26.61.210
2862     100.26.185.13
1909     3.236.27.213
1908     54.242.255.17
1908     54.242.228.33
1908     54.242.163.248
1908     54.236.36.193
1908     54.205.82.147
1908     54.175.4.154
1908     54.174.240.173
1908     54.174.150.40
1908     54.172.41.64
1908     54.167.8.206
1908     54.167.39.17
1908     54.166.44.219
1908     54.161.206.139
1908     54.160.234.43
1908     54.159.84.255
1908     54.147.117.226
1908     50.19.161.70
1908     44.201.62.226
1908     44.201.58.131
1908     44.200.86.209
1908     44.200.65.76
1908     44.200.40.176
1908     44.200.239.183
1908     44.200.230.223
1908     44.200.227.211
1908     44.200.205.85
1908     44.200.188.45
1908     44.200.173.245
1908     44.200.171.203
1908     44.199.232.199
1908     44.197.133.168
1908     44.195.38.186
1908     44.193.83.155
1908     44.192.82.83
1908     44.192.58.57
1908     44.192.38.32
1908     44.192.131.229
1908     3.95.182.93
1908     3.95.179.4
1908     3.92.88.127
1908     3.92.190.61
1908     3.91.63.171
1908     3.91.132.53
1908     3.89.228.171
1908     3.89.141.217
1908     3.88.7.101
1908     3.88.66.222
1908     3.88.40.185
1908     3.87.80.192
1908     3.86.216.188
1908     3.85.59.14
1908     3.85.224.21
1908     3.85.162.102
1908     3.85.13.203
1908     3.84.19.250
1908     3.83.133.228
1908     3.80.68.17
1908     35.175.110.3
1908     35.174.4.162
1908     35.173.1.105
1908     35.171.146.181
1908     35.170.198.64
1908     35.168.58.80
1908     35.168.23.194
1908     35.153.39.124
1908     34.238.193.239
1908     34.236.171.38
1908     34.236.144.152
1908     34.231.21.251
1908     34.230.79.25
1908     34.229.193.190
1908     34.228.141.13
1908     34.227.111.192
1908     34.205.4.165
1908     34.201.24.65
1908     34.201.21.89
1908     34.200.237.89
1908     3.239.98.150
1908     3.239.27.216
1908     3.239.236.9
1908     3.239.195.119
1908     3.238.30.227
1908     3.238.28.93
1908     3.238.239.14
1908     3.238.233.181
1908     3.238.149.106
1908     3.237.30.128
1908     3.237.28.58
1908     3.237.197.197
1908     3.236.88.202
1908     3.236.29.198
1908     3.236.172.81
1908     3.236.113.37
1908     3.235.232.232
1908     3.234.141.7
1908     3.231.60.92
1908     3.231.219.59
1908     3.231.208.60
1908     3.227.252.65
1908     3.221.155.113
1908     3.210.205.79
1908     3.209.12.110
1908     23.20.132.70
1908     184.73.60.65
1908     184.73.57.197
1908     18.234.236.146
1908     18.234.111.35
1908     18.213.118.61
1908     18.212.159.84
1908     18.209.237.95
1908     18.208.249.16
1908     18.208.209.143
1908     18.207.104.102
1908     18.207.100.231
1908     18.206.228.51
1908     18.205.238.85
1908     18.205.107.9
1908     18.204.35.218
1908     18.204.215.122
1908     18.204.17.190
1908     100.26.147.44
1908     100.24.45.26
1908     100.24.18.194
 955     44.200.107.134
 954     54.91.114.177
 954     54.90.80.165
 954     54.90.201.186
 954     54.90.124.119
 954     54.89.101.235
 954     54.88.23.164
 954     54.86.217.46
 954     54.86.115.15
 954     54.83.241.217
 954     54.82.90.177
 954     54.82.164.242
 954     54.242.227.105
 954     54.242.140.162
 954     54.236.217.111
 954     54.227.40.231
 954     54.226.62.158
 954     54.226.122.88
 954     54.226.0.34
 954     54.221.180.78
 954     54.211.152.69
 954     54.209.207.116
 954     54.209.189.47
 954     54.208.171.144
 954     54.175.53.23
 954     54.175.195.174
 954     54.174.44.66
 954     54.166.7.228
 954     54.166.159.234
 954     54.163.187.47
 954     54.160.0.105
 954     54.157.193.120
 954     54.152.146.248
 954     54.145.24.108
 954     54.145.223.209
 954     52.91.93.152
 954     52.91.61.246
 954     52.91.226.148
 954     52.91.196.215
 954     52.90.54.249
 954     52.90.3.62
 954     52.90.192.144
 954     52.87.250.143
 954     52.87.249.92
 954     52.72.121.162
 954     52.55.65.234
 954     52.23.167.2
 954     52.23.158.119
 954     52.207.62.132
 954     52.207.250.70
 954     52.207.228.111
 954     52.202.9.106
 954     50.19.41.32
 954     44.201.7.198
 954     44.201.5.78
 954     44.201.56.67
 954     44.201.34.223
 954     44.201.31.199
 954     44.201.3.1
 954     44.201.18.138
 954     44.201.10.172
 954     44.201.0.117
 954     44.200.53.196
 954     44.200.50.199
 954     44.200.40.24
 954     44.200.39.160
 954     44.200.32.176
 954     44.200.251.208
 954     44.200.228.186
 954     44.200.218.184
 954     44.200.195.52
 954     44.200.144.107
 954     44.200.143.106
 954     44.200.137.243
 954     44.200.112.188
 954     44.199.250.90
 954     44.199.236.15
 954     44.198.59.240
 954     44.197.250.137
 954     44.197.245.138
 954     44.193.2.184
 954     44.192.84.100
 954     44.192.76.80
 954     44.192.57.38
 954     3.94.202.83
 954     3.93.81.203
 954     3.92.45.247
 954     3.92.217.211
 954     3.92.207.122
 954     3.92.132.193
 954     3.91.55.178
 954     3.91.45.166
 954     3.91.219.210
 954     3.89.55.35
 954     3.89.223.189
 954     3.89.101.5
 954     3.88.173.142
 954     3.87.219.112
 954     3.87.201.92
 954     3.87.152.148
 954     3.85.241.232
 954     3.85.13.102
 954     3.84.30.31
 954     3.83.161.181
 954     3.82.139.5
 954     3.80.79.219
 954     3.80.5.219
 954     3.80.220.39
 954     3.80.118.234
 954     35.175.200.233
 954     35.173.48.63
 954     35.173.47.212
 954     35.173.250.34
 954     35.173.134.240
 954     35.172.219.110
 954     35.172.217.161
 954     35.170.73.138
 954     35.170.64.84
 954     34.239.175.219
 954     34.239.130.54
 954     34.239.125.189
 954     34.238.83.186
 954     34.238.247.118
 954     34.238.151.112
 954     34.236.143.245
 954     34.234.223.149
 954     34.230.80.248
 954     34.229.175.140
 954     34.229.172.210
 954     34.229.168.118
 954     34.228.143.140
 954     34.227.11.197
 954     34.227.110.19
 954     34.224.41.142
 954     34.224.21.214
 954     34.207.93.237
 954     34.204.200.137
 954     34.204.189.163
 954     34.204.186.183
 954     34.203.245.120
 954     34.203.200.103
 954     34.203.199.145
 954     34.203.14.139
 954     34.200.222.190
 954     3.239.79.175
 954     3.239.76.99
 954     3.239.51.238
 954     3.239.119.61
 954     3.239.116.62
 954     3.239.11.11
 954     3.238.249.185
 954     3.238.244.136
 954     3.238.23.31
 954     3.238.228.216
 954     3.238.221.170
 954     3.238.194.160
 954     3.238.177.225
 954     3.238.150.175
 954     3.238.13.205
 954     3.238.113.12
 954     3.237.238.142
 954     3.237.10.223
 954     3.237.10.2
 954     3.237.10.145
 954     3.236.80.163
 954     3.236.54.153
 954     3.236.16.119
 954     3.236.142.65
 954     3.236.121.4
 954     3.236.120.156
 954     3.236.115.165
 954     3.236.101.79
 954     3.235.226.5
 954     3.235.199.76
 954     3.235.193.114
 954     3.235.192.58
 954     3.235.182.96
 954     3.235.178.120
 954     3.234.244.141
 954     3.234.239.170
 954     3.233.241.219
 954     3.233.234.70
 954     3.231.21.251
 954     3.230.159.197
 954     3.227.242.133
 954     3.227.11.186
 954     3.224.147.147
 954     3.223.3.71
 954     3.223.195.119
 954     3.219.217.120
 954     3.215.22.124
 954     3.209.10.196
 954     3.208.15.210
 954     184.72.114.127
 954     18.234.227.246
 954     18.234.139.137
 954     18.232.54.171
 954     18.232.177.100
 954     18.215.63.197
 954     18.215.189.70
 954     18.215.163.85
 954     18.215.159.83
 954     18.214.26.230
 954     18.212.242.2
 954     18.209.59.182
 954     18.208.195.208
 954     18.208.148.189
 954     18.207.241.17
 954     18.207.160.105
 954     18.207.143.8
 954     18.207.129.148
 954     18.206.254.83
 954     18.206.16.17
 954     18.205.105.108
 954     107.22.155.204
 954     107.22.134.56
 954     107.21.166.150
 954     100.27.49.48
 954     100.27.21.180
 954     100.26.35.56
 954     100.26.185.73
 954     100.24.25.254
 954     100.24.21.216
 954     100.24.209.103
 240     34.226.188.26
 238     3.218.212.35
 228     158.69.195.206
 173     35.203.245.218
 124     136.243.70.68
&lt;/pre&gt;
&lt;p&gt;The end. :)&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>AVX512VBMI2 and packed varuint format</title>
  <link>http://0x80.pl/notesen/2022-01-24-avx512vbmi2-varuint.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2022-01-24-avx512vbmi2-varuint.html</guid>
  <pubDate>Mon, 24 Jan 2022 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;A quite popular &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Variable-length_quantity"&gt;varuint format&lt;/a&gt; lets to save
an arbitrary integer number on a sequence of bytes. Each byte stores seven bits
of information, and the most significant bit indicates whether the given byte is
the last one.&lt;/p&gt;
&lt;p&gt;Decoding such numbers is quite easy, but is not fast. This is the reason why
Google came up with their &lt;strong&gt;packed varint&lt;/strong&gt; format, that stores four numbers
(from 1 to 4 byte each).  In this format control bits and data bits are
separated. The control bits are grouped into single byte: four pairs of bits
encode lengths of four numbers.&lt;/p&gt;
&lt;p&gt;Handling this format is way easier and is &lt;strong&gt;vectorizable&lt;/strong&gt;. The control
byte is used to fetch a shuffle pattern, which is then issued to &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt;.
Then, this single instruction expands 4-16 data bytes into sixteen
32-bit numbers. Details are shown in the next section.&lt;/p&gt;
&lt;p&gt;The packed format can be slightly modified to utilize the instruction
&lt;tt class="docutils literal"&gt;VPEXPANDB&lt;/tt&gt; defined in AVX512VBMI2. The instruction expands bytes according
to an AVX512 write mask &amp;mdash; it's exactly what the &lt;strong&gt;packed varint&lt;/strong&gt; format
needs.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Parsing hex numbers with validation</title>
  <link>http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html</guid>
  <pubDate>Mon, 17 Jan 2022 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;A non-validating parsing from hexadecimal string can be vectorized: &lt;a class="reference external" href="2014-10-22-sse-convert-hex-to-ascii.html"&gt;Using SSE
to convert from hexadecimal ASCII to number&lt;/a&gt;. This text shows two validating
and vectorized approaches.&lt;/p&gt;
&lt;p&gt;Parsing algorithms convert 16-byte inputs. They consists two major parts:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Validate and convert ASCII digits and letters into
4-bit values (value [0..15]) stored on separate bytes.&lt;/li&gt;
&lt;li&gt;Merge the 4-bit words into a continues 64-bit word.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Below are the valid ASCII hexadecimal digits:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0&lt;/tt&gt; ... &lt;tt class="docutils literal"&gt;9&lt;/tt&gt; &amp;mdash; 0x30 ... 0x31,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;a&lt;/tt&gt; ... &lt;tt class="docutils literal"&gt;f&lt;/tt&gt; &amp;mdash; 0x61 ... 0x66,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;A&lt;/tt&gt; ... &lt;tt class="docutils literal"&gt;F&lt;/tt&gt; &amp;mdash; 0x41 ... 0x46.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can note that:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Valid characters have higher nibbles equal to 0x3, 0x4 or 0x6.&lt;/li&gt;
&lt;li&gt;Values of decimal digits equal to the lower nibble.&lt;/li&gt;
&lt;li&gt;Values of hex digits (above 9) equal to the lower nibble plus 9.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Bit test and reset vs compilers</title>
  <link>http://0x80.pl/notesen/2021-12-22-test-and-clear-bit.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-12-22-test-and-clear-bit.html</guid>
  <pubDate>Wed, 22 Dec 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;The problem: there is a bitmask (16-, 32-, 64-bit). We need to scan it
backward, starting from the most significant bit. We finish at the last set
bit. For instance we scan bits from 15 to 11 in a 16-bit mask
&lt;tt class="docutils literal"&gt;0b1100'1000'0000'0000&lt;/tt&gt;.  Depending on the bit's value we perform different
tasks.&lt;/p&gt;
&lt;p&gt;Since x86 has instruction &lt;a class="reference external" href="https://hjlebbink.github.io/x86doc/./html/BTR.html"&gt;BTR&lt;/a&gt; it was obvious for me that I should use the
idiom bit-test-and-reset. Thus my initial code was straightforward.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;loop_v1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_and_clear_bit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;func_true&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;func_false&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Function &lt;tt class="docutils literal"&gt;test_and_clear_bit&lt;/tt&gt; wraps the &lt;strong&gt;BTR&lt;/strong&gt; instruction. Below is
an example how this function behaves.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
uint16_t w = 0b0000'0010'1110'0100;
bool b;

// bit #1 is zero:
// - b == false
// - w == 0b0000'0010'1110'0100 &amp;mdash; unchanged
b = test_and_clear_bit(w, 1);

// bit #2 is set:
// b == true
// - w == 0b0000'0010'1110'0000
b = test_and_clear_bit(w, 2);

// bit #10 is set:
// - b == true
// - w == 0b0000'0000'1110'0000
b = test_and_clear_bit(w, 2);
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Conversion uint32 into decimal without division nor multiplication</title>
  <link>http://0x80.pl/notesen/2021-11-23-uint-to-ascii.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-11-23-uint-to-ascii.html</guid>
  <pubDate>Tue, 23 Nov 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is follow up to Daniel Lemire's &lt;a class="reference external" href="https://lemire.me/blog/2021/11/18/converting-integers-to-fix-digit-representations-quickly/"&gt;Converting integers to fix-digit representations quickly&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The method described here &lt;strong&gt;does not&lt;/strong&gt; use multiplication nor division
instructions.  It relies only on addition and byte-level comparison.
It's weird and slow, though.&lt;/p&gt;
&lt;p&gt;The main idea is to work directly on the BCD representation. First,
we pre-calculate BCD images (16-byte arrays) for individual bytes of
a 32-bit number. The following values are considered:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x00&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;0x01&lt;/tt&gt;, ..., &lt;tt class="docutils literal"&gt;0xff&lt;/tt&gt;;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x0000&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;0x0100&lt;/tt&gt;, ..., &lt;tt class="docutils literal"&gt;0xff00&lt;/tt&gt;;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x000000&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;0x010000&lt;/tt&gt;, ..., &lt;tt class="docutils literal"&gt;0xff0000&lt;/tt&gt;;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x00000000&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;0x01000000&lt;/tt&gt;, ..., &lt;tt class="docutils literal"&gt;0xff000000&lt;/tt&gt;;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then, when converting a number, we fetch the BCD images and add
them together.&lt;/p&gt;
&lt;p&gt;The next step of algorithm is fixing up the sum, as some bytes
might be greater then 9. After this step all bytes are in range
0 .. 9.&lt;/p&gt;
&lt;p&gt;The last step is simple conversion into ASCII by adding &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;ord('0')&lt;/span&gt; = 0x30&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;Sample scalar implementation is shown below.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;itoa_divless&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;union&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// 1. obtain BCD representation of all bytes
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lookup0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lookup1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lookup2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte2&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lookup2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte2&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lookup3&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte3&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;qword&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lookup3&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte3&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// 2. fixup BCD &amp;amp; store result
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sc"&gt;'0'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;div class="section" id="example"&gt;
&lt;h2&gt;Example&lt;/h2&gt;
&lt;p&gt;Let &lt;tt class="docutils literal"&gt;x = 20211121 = 0x13465b1&lt;/tt&gt;. We split the value into separate
bytes &lt;tt class="docutils literal"&gt;0xb1&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;0x65&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;0x34&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;0x01&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;Then, for each byte, we fetch the appropriate BCD image:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0xb1&lt;/tt&gt; =&amp;gt; &lt;tt class="docutils literal"&gt;[ 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 7| 7]&lt;/tt&gt; (177)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x65&lt;/tt&gt; =&amp;gt; &lt;tt class="docutils literal"&gt;[ 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 2| 5| 8| 5| 6]&lt;/tt&gt; (101 * 256 = 25'856)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x34&lt;/tt&gt; =&amp;gt; &lt;tt class="docutils literal"&gt;[ 0| 0| 0| 0| 0| 0| 0| 0| 0| 3| 4| 0| 7| 8| 7| 2]&lt;/tt&gt; (52 * 65536 = 3'407'872)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0x01&lt;/tt&gt; =&amp;gt; &lt;tt class="docutils literal"&gt;[ 0| 0| 0| 0| 0| 0| 0| 0| 1| 6| 7| 7| 7| 2| 1| 6]&lt;/tt&gt; (1 * 16777216 = 16'777'216)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This step requires six 64-bit loads. For byte #0 and byte #1 the higher 8 bytes
of BCD image are always zero. For bytes #2 and #3 all 16 bytes of images are
required.&lt;/p&gt;
&lt;p&gt;Once we have all the BCD images, we simply add them together. We have four
inputs, where none of bytes exceed 9, thus it's safe to perform 64-bit additions.&lt;/p&gt;
&lt;p&gt;For our sample data we have:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
[ 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 7| 7]
[ 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 2| 5| 8| 5| 6] +
[ 0| 0| 0| 0| 0| 0| 0| 0| 0| 3| 4| 0| 7| 8| 7| 2] +
[ 0| 0| 0| 0| 0| 0| 0| 0| 1| 6| 7| 7| 7| 2| 1| 6] +
--------------------------------------------------
[ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|11| 9|19|19|20|21]
&lt;/pre&gt;
&lt;p&gt;There are some bytes greater than 9, we need to fix them up:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
t0 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|11| 9|19|19|20|21]
t1 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|11| 9|19|19|22| 1] -- carry 2 from #0 to #1
t3 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|11| 9|19|21| 2| 1] -- carry 2 from #1 to #2
t4 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|11| 9|21| 1| 2| 1] -- carry 2 from #2 to #3
t5 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|11|11| 1| 1| 2| 1] -- carry 2 from #3 to #4
t6 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1| 9|12| 1| 1| 1| 2| 1] -- carry 1 from #4 to #5
t7 = [ 0| 0| 0| 0| 0| 0| 0| 0| 1|10| 2| 1| 1| 1| 2| 1] -- carry 1 from #5 to #6
t8 = [ 0| 0| 0| 0| 0| 0| 0| 0| 2| 0| 2| 1| 1| 1| 2| 1] -- carry 1 from #6 to #7
&lt;/pre&gt;
&lt;p&gt;The carry value between bytes never exceeds 3. Since we have four inputs, then
maximum value of byte at 0th position is 4*9 = 36. Any subsequent carry value
cannot be greater than 3, as 4*9 + 3 is 39.&lt;/p&gt;
&lt;p&gt;This means that the carry value can be obtained with a series of comparisons.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>How to check if any word is zero</title>
  <link>http://0x80.pl/notesen/2021-03-11-any-word-is-zero.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-03-11-any-word-is-zero.html</guid>
  <pubDate>Thu, 11 Mar 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;We want to check if at least one integer value is zero. In other words,
we are evaluating following expression (for reasonably small N):&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;xN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;For a three-argument expression GCC 10.2 produces (with &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-O3&lt;/span&gt;&lt;/tt&gt; switch)
the following x86 code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
testl   %esi, %esi  ; esi == 0?
sete    %al         ; al = 1 if the above condition is true, 0 otherwise
testl   %edx, %edx
sete    %dl
orl     %edx, %eax  ; plain `or`
testl   %edi, %edi
sete    %dl
orl     %edx, %eax
&lt;/pre&gt;
&lt;p&gt;Clang 11.0 generates almost identical code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
testl   %edi, %edi
sete    %al
testl   %esi, %esi
sete    %cl
orb     %al, %cl
testl   %edx, %edx
sete    %al
orb     %cl, %al
&lt;/pre&gt;
&lt;p&gt;ICC and MSVC added some jumps, but generally also use basic building block:
&lt;tt class="docutils literal"&gt;test&lt;/tt&gt; followed by a conditional set &lt;tt class="docutils literal"&gt;sete&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Autovectorization status in MSVC in 2021</title>
  <link>http://0x80.pl/notesen/2021-02-17-autovectorization-msvc.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-02-17-autovectorization-msvc.html</guid>
  <pubDate>Wed, 17 Feb 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This year I re-checked the status of autovectorization in the &lt;a class="reference external" href="2021-01-18-autovectorization-gcc-clang.html"&gt;latest GCC and
Clang&lt;/a&gt;. MSVC was omitted because I didn't see any new version of this
compiler on &lt;a class="reference external" href="https://godbolt.org"&gt;godbolt&lt;/a&gt;. More precisely, I didn't believe that there is a
difference between versions 19.28 and 19.16 (that was tested &lt;a class="reference external" href="2019-02-02-autovectorization-gcc-clang.html"&gt;two years
ago&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://twitter.com/HaroldAptroot/status/1351316270101233664"&gt;Harold Aptroot&lt;/a&gt; pointed out that there are some differences in code
generated for the AVX2 target. Additionally, in 2020 MSVC started to
support &lt;a class="reference external" href="https://devblogs.microsoft.com/cppblog/avx-512-auto-vectorization-in-msvc"&gt;AVX512&lt;/a&gt;. These two reasons forced me to recheck MSVC too.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Counting byte in byte stream with AVX512BW instructions</title>
  <link>http://0x80.pl/notesen/2021-02-14-avx512bw-count-bytes.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-02-14-avx512bw-count-bytes.html</guid>
  <pubDate>Sun, 14 Feb 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is a follow up to article &lt;a class="reference external" href="2019-01-29-simd-count-byte.html"&gt;SIMDized counting byte in byte stream&lt;/a&gt;. In
this article only AVX512BW variants are discussed. Performance is analyzed only
for the Skylake-X CPU.&lt;/p&gt;
&lt;p&gt;We want to count how many times given byte appears in a byte stream.
The following C++ code shows the naive algorithm:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;countbyte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>How to detect if all bytes in SIMD register are the same?</title>
  <link>http://0x80.pl/notesen/2021-02-02-all-bytes-in-reg-are-equal.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-02-02-all-bytes-in-reg-are-equal.html</guid>
  <pubDate>Tue, 02 Feb 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;We want to detect if all bytes stored in a SIMD register (SSE, AVX2, AVX512,
Neon etc.) are the same.  For example for byte layout in an SSE register like
this:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
[42|42|42|42|42|42|42|42|42|42|42|42|42|42|42|42]
&lt;/pre&gt;
&lt;p&gt;We see that all bytes are equal to 42. For this one not all bytes have the same
value:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
[42|42|42|42|42|42|42|42|42|42|42|42|03|42|42|42]
&lt;/pre&gt;
&lt;p&gt;The algorithm which uses basic vector operations:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;broadcast the 0th byte of register into a new vector:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
input     = [42|42|42|42|42|42|42|42|42|42|42|42|03|42|42|42]
broadcast = [42|42|42|42|42|42|42|42|42|42|42|42|42|42|42|42]
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;perform a vector-wide compare for equality:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
cmp       = (input == broadcast)
          = [ff|ff|ff|ff|ff|ff|ff|ff|ff|ff|ff|ff|00|ff|ff|ff]
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;check whether all elements of &lt;tt class="docutils literal"&gt;cmp&lt;/tt&gt; vector are &amp;quot;true&amp;quot;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Depending on a SIMD flavour, these simple steps may not be that simple.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Autovectorization status in GCC &amp; Clang in 2021</title>
  <link>http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clang.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clang.html</guid>
  <pubDate>Mon, 18 Jan 2021 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Almost two years ago I did an &lt;a class="reference external" href="2019-02-02-autovectorization-gcc-clang.html"&gt;in-depth comparison&lt;/a&gt; of autovectorization
abilities of popular compilers: &lt;strong&gt;GCC&lt;/strong&gt;, &lt;strong&gt;clang&lt;/strong&gt;, &lt;strong&gt;ICC&lt;/strong&gt; and &lt;strong&gt;MSVC&lt;/strong&gt;.  In
this text only &lt;strong&gt;GCC&lt;/strong&gt; and &lt;strong&gt;clang&lt;/strong&gt; are considered, as I don't see any new
versions of ICC nor MSVC on &lt;a class="reference external" href="https://godbolt.org"&gt;godbolt.org&lt;/a&gt; (drop me a line if I got lost in the
multitude of compiler versions). &lt;strong&gt;Update 2021-02-17&lt;/strong&gt;: &lt;a class="reference external" href="2021-02-17-autovectorization-msvc.html"&gt;MSVC 19.28 status&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The question is: &amp;quot;what has changed between GCC 9 and GCC 10, and between clang
9 and clang 11?&amp;quot;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Use AVX512 to calculate binomial coefficient</title>
  <link>http://0x80.pl/notesen/2020-03-21-avx512-binomial-coefficient.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2020-03-21-avx512-binomial-coefficient.html</guid>
  <pubDate>Sat, 21 Mar 2020 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The value of &lt;a class="reference external" href="http://en.wikipedia.org/wiki/binomial_coefficient"&gt;binomial coefficient&lt;/a&gt; &lt;span class="math"&gt;&lt;i&gt;k&lt;/i&gt;&lt;i&gt;over&lt;/i&gt;&lt;i&gt;n&lt;/i&gt;&lt;/span&gt; can be expressed as
&lt;span class="math"&gt;&lt;i&gt;n&lt;/i&gt;!/(&lt;i&gt;k&lt;/i&gt;! &amp;sdot; (&lt;i&gt;n&lt;/i&gt; &amp;minus; &lt;i&gt;k&lt;/i&gt;)!)&lt;/span&gt;. This can be simplified to
&lt;span class="math"&gt;[(&lt;i&gt;n&lt;/i&gt; &amp;minus; &lt;i&gt;p&lt;/i&gt;) &amp;sdot; (&lt;i&gt;n&lt;/i&gt; &amp;minus; &lt;i&gt;p&lt;/i&gt; + 1) &amp;sdot; &amp;hellip; &amp;sdot; (&lt;i&gt;n&lt;/i&gt;)]/&lt;i&gt;p&lt;/i&gt;!&lt;/span&gt;, where &lt;span class="math"&gt;&lt;i&gt;p&lt;/i&gt; = max(&lt;i&gt;k&lt;/i&gt;, &lt;i&gt;n&lt;/i&gt; &amp;minus; &lt;i&gt;k&lt;/i&gt;)&lt;/span&gt;.
Daniel Lemire showed in article &lt;a class="reference external" href="https://lemire.me/blog/2020/02/26/fast-divisionless-computation-of-binomial-coefficients/"&gt;Fast divisionless computation of binomial
coefficients&lt;/a&gt; how efficiently evaluate the latter expression.&lt;/p&gt;
&lt;p&gt;Can SIMD instructions be utilized to get binomial coefficients? I wish I could
write &amp;quot;yes, they can&amp;quot;, but the answer is not optimistic.  SIMD instructions can
be used to perform in parallel several pairs of multiplications, however it's a
quest to properly setup registers and deal with different numbers of arguments
that depend on the &lt;span class="math"&gt;&lt;i&gt;n&lt;/i&gt;&lt;/span&gt; and &lt;span class="math"&gt;&lt;i&gt;k&lt;/i&gt;&lt;/span&gt;.  That's the first option, which I
didn't check.&lt;/p&gt;
&lt;p&gt;What I checked and described here is utilization of AVX512 with a different
numeric system. An important fact is that calculation of binomial coefficients
involves only multiplication and (integer) division.&lt;/p&gt;
&lt;p&gt;We do know that all natural numbers can be &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Factorization"&gt;factorized&lt;/a&gt;,
i.e. represent as a product of prime numbers raised to some non-negative
powers.  For instance, 20200320 is equal to &lt;span class="math"&gt;2&lt;sup&gt;7&lt;/sup&gt; &amp;sdot; 3&lt;sup&gt;3&lt;/sup&gt; &amp;sdot; 5 &amp;sdot; 7 &amp;sdot; 167&lt;/span&gt;,
similarly 7812 is &lt;span class="math"&gt;2&lt;sup&gt;2&lt;/sup&gt; &amp;sdot; 3&lt;sup&gt;2&lt;/sup&gt; &amp;sdot; 7 &amp;sdot; 31&lt;/span&gt;.  Now a product of two factored
numbers can be calculated by adding the exponents of corresponding primes. For
example &lt;span class="math"&gt;20200320 &amp;sdot; 7812 = 2&lt;sup&gt;(7 + 2)&lt;/sup&gt; &amp;sdot; 3&lt;sup&gt;(2 + 3)&lt;/sup&gt; &amp;sdot; 5 &amp;sdot; 7&lt;sup&gt;(1 + 1)&lt;/sup&gt; &amp;sdot; 31 &amp;sdot; 167 = 2&lt;sup&gt;9&lt;/sup&gt; &amp;sdot; 3&lt;sup&gt;5&lt;/sup&gt; &amp;sdot; 5 &amp;sdot; 6&lt;sup&gt;2&lt;/sup&gt; &amp;sdot; 31 &amp;sdot; 167&lt;/span&gt;.  Likewise, division requires subtraction
of exponents.&lt;/p&gt;
&lt;p&gt;The core idea is to represent input integers as &lt;strong&gt;factored numbers&lt;/strong&gt;. We setup
the fixed number of primes and then operate only on exponents. We make sure
that exponents fit in 8-bit values, thus in case of AVX512 we have 64-element
products.  With such representation, the vector addition and subtraction
instructions are sufficient to calculate binomial coefficients.&lt;/p&gt;
&lt;p&gt;So far everything might sounds nice, but unfortunately there are some serious
problems:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Factorization is not cheap, we must cache exponents. Since we must
cache intermediate values, why not pre-calculate binomial coefficients?&lt;/li&gt;
&lt;li&gt;Getting back from factorized representation is also not cheap. It
requires multiplication and getting integer powers (also multiplication).&lt;/li&gt;
&lt;li&gt;To my surprise, the range of inputs covered by a SIMD-ized algorithm
is smaller than a scalar version. I supposed that cancellation of
primes present in both nominator and denominator would be beneficial.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Use AVX512 Galois field affine transformation for bit shuffling</title>
  <link>http://0x80.pl/notesen/2020-01-19-avx512-galois-field-for-bit-shuffling.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2020-01-19-avx512-galois-field-for-bit-shuffling.html</guid>
  <pubDate>Sun, 19 Jan 2020 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This article was inspired by Geoff Langdale's text &lt;a class="reference external" href="https://branchfree.org/2019/05/29/why-ice-lake-is-important-a-bit-bashers-perspective/"&gt;Why Ice Lake is
Important (a bit-basher’s perspective)&lt;/a&gt;. I'm also grateful &lt;a class="reference external" href="https://github.com/zwegner/"&gt;Zach Wegner&lt;/a&gt;
for an inspiring discussion.&lt;/p&gt;
&lt;p&gt;The AVX512 extension GFNI adds three instructions related to &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Finite_field"&gt;Galois field&lt;/a&gt;:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;VGF2P8MULB&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_gf2p8mul_epi8&lt;/tt&gt;) &amp;mdash; multiply 8-bit integers
in the field &lt;span class="math"&gt;GF(2&lt;sup&gt;8&lt;/sup&gt;)&lt;/span&gt;;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;VGF2P8AFFINEINVQB&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_gf2p8affineinv_epi64_epi8&lt;/tt&gt;) &amp;mdash; inverse
affine transformation in the field &lt;span class="math"&gt;GF(2&lt;sup&gt;8&lt;/sup&gt;)&lt;/span&gt;;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;VGF2P8AFFINEQB&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_gf2p8affine_epi64_epi8&lt;/tt&gt;) &amp;mdash; affine transformation
in the field &lt;span class="math"&gt;GF(2&lt;sup&gt;8&lt;/sup&gt;)&lt;/span&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;While the two first instructions perform quite specific algorithms, the third
one is the most generic and promising.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512 8-bit positional population count procedure</title>
  <link>http://0x80.pl/notesen/2019-12-31-avx512-pospopcnt-8bit.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-12-31-avx512-pospopcnt-8bit.html</guid>
  <pubDate>Tue, 31 Dec 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Positional population count&lt;/strong&gt; (pospopcnt) is a procedure that calculates the
&lt;em&gt;histogram&lt;/em&gt; for bits placed at given position in a byte, word or double word etc.
from larger stream of such entities.&lt;/p&gt;
&lt;p&gt;This is a very naive implementation of 8-bit pospopcnt:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;pospopcnt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x01&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;For example [3, 3, 2, 4, 1, 2, 3, 1] is the pospopcnt result for
following five bytes:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0b0110'1001&lt;/tt&gt;,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0b1100'1000&lt;/tt&gt;,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0b0000'1111&lt;/tt&gt;,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0b0001'0011&lt;/tt&gt;,&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;0b0110'1110&lt;/tt&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A few 16-bit pospopcnt procedures were described in article &lt;a class="reference external" href="https://arxiv.org/abs/1911.02696"&gt;Efficient
Computation of Positional Population Counts Using SIMD Instructions&lt;/a&gt; by
&lt;a class="reference external" href="http://marcusklarqvist.com/"&gt;Marcus D. R. Klarqvist&lt;/a&gt;, &lt;a class="reference external" href="http://lemire.me"&gt;Daniel Lemire&lt;/a&gt; and me. The &lt;a class="reference external" href="https://github.com/mklarqvist/positional-popcount"&gt;library&lt;/a&gt; maintained
by Marcus provides pospopcnt procedures for 8, 16 and 32-bit words.&lt;/p&gt;
&lt;p&gt;This article shows a neat utilization of &lt;strong&gt;SAD&lt;/strong&gt; instruction to calculate
8-bit pospopcnt. It's not the fastest one, but I really like the whole
algorithm.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMDization of switch statements</title>
  <link>http://0x80.pl/notesen/2019-02-03-simd-switch-implementation.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-02-03-simd-switch-implementation.html</guid>
  <pubDate>Sun, 03 Feb 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;There are two main purposes of a &lt;tt class="docutils literal"&gt;switch&lt;/tt&gt; statement:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Express simple function that translate from one set of values into another,
like getting a string representation of enum values.&lt;/li&gt;
&lt;li&gt;Dispatch different code sequences based on switch argument, as an alternative
to &amp;quot;if-ladder&amp;quot;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Compilers usually transform &lt;tt class="docutils literal"&gt;switch&lt;/tt&gt; statements using following approaches:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Binary search on constant keys: a compiler emits series of comparisons
and jumps interleaved with &lt;tt class="docutils literal"&gt;case&lt;/tt&gt; code.&lt;/li&gt;
&lt;li&gt;When the key values span a small range (even non-continuous one), the values
are used to index a lookup table of jump targets.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Of course, a compiler might optimize some specific cases, for some neat
examples look at &lt;a class="reference external" href="https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-switch-conversion.cc"&gt;tree-switch-conversion.cc&lt;/a&gt; from GCC.&lt;/p&gt;
&lt;p&gt;However, switch statements can be expressed also with SIMD instructions. Vector
instructions are used used to translate from the argument value into case
ordinal number. Then, the index is used either to 1) fetch a value from a
precalculated table, or 2) get a jump target (the address) which is used to
dispatch code fragments.&lt;/p&gt;
&lt;div class="section" id="example-of-binary-search-code"&gt;
&lt;span id="binary-search"&gt;&lt;/span&gt;&lt;h2&gt;Example of binary search code&lt;/h2&gt;
&lt;p&gt;Following C++ code is compiled by GCC 7.3.0 into a binary search procedure.&lt;/p&gt;
&lt;table width="100%"&gt;
&lt;tr&gt;&lt;td width="50%" valign="top"&gt;&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;enum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Colour&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;RED&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00ff0000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;GREEN&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x0000ff00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;BLUE&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x000000ff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;WHITE&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00ffffff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;GRAY0&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00333333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;GRAY1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00aaaaaa&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;GRAY2&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00dddddd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;BLACK&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00000000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;palette&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Colour&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;RED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;GREEN&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;BLUE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;WHITE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;GRAY0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;GRAY1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;GRAY2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Colour&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;BLACK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/td&gt;&lt;td width="50%" valign="top"&gt;&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nl"&gt;_Z7palette6Colour:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;3355443&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;je&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L3&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;jbe&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nv"&gt;.L24&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;14540253&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;je&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L21&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;jbe&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nv"&gt;.L25&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;xorl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;16711680&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;je&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L21&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;16777215&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;jne&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nv"&gt;.L2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L21:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L24:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;je&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L21&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;65280&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;je&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L21&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;testl&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;je&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L26&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L2:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L25:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;11184810&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;jne&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nv"&gt;.L2&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L3:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L26:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;
&lt;/pre&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;
&lt;div class="section" id="example-of-jump-lookup"&gt;
&lt;span id="target-lookup"&gt;&lt;/span&gt;&lt;h2&gt;Example of jump lookup&lt;/h2&gt;
&lt;table width="100%"&gt;
&lt;tr&gt;&lt;td width="50%" valign="top"&gt;&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;cstdio&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;code_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;puts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;zero&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;puts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;one&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;puts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;two&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;puts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;three&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;puts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;four&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;puts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;five&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/td&gt;&lt;td width="50%" valign="top"&gt;&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nl"&gt;.LC0:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;zero&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.LC1:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;one&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.LC2:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;two&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.LC3:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;three&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.LC4:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;four&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.LC5:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;five&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.text&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;_Z10code_blocki:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;cmpl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ja&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;.L13&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdx&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;edi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;subq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movslq&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;jmp&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;*%&lt;/span&gt;&lt;span class="nb"&gt;rax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.section&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nv"&gt;.rodata&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L4:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L10&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L10&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L5&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L6&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L10&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L10&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L7&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L8&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L10&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L10&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.long&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nv"&gt;.L9&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nv"&gt;.L4&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;.text&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L9:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.LC5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;puts&amp;#64;PLT&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L11:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L3:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.LC0&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;puts&amp;#64;PLT&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L5:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.LC1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;puts&amp;#64;PLT&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L6:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.LC2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;puts&amp;#64;PLT&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L7:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.LC3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;puts&amp;#64;PLT&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L8:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;leaq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;.LC4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nv"&gt;rip&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rdi&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;puts&amp;#64;PLT&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;addq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rsp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L10:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;jmp&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nv"&gt;.L11&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;.L13:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;movl&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;eax&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;ret&lt;/span&gt;
&lt;/pre&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Malloc internal memory fragmentation footprint</title>
  <link>http://0x80.pl/notesen/2019-02-03-malloc-internal-memory-fragmentation.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-02-03-malloc-internal-memory-fragmentation.html</guid>
  <pubDate>Sun, 03 Feb 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;When we allocate memory using &lt;tt class="docutils literal"&gt;malloc&lt;/tt&gt; or another interface, like operator
&lt;tt class="docutils literal"&gt;new&lt;/tt&gt; in C++, we get a pointer and promise that nobody else would acquire
the same memory area. But underneath, more memory is needed. For instance
the allocator has to keep the size of block to implement &lt;tt class="docutils literal"&gt;realloc&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;More important is that the allocator unlikely allocate the exact number of
bytes we requested, rather it will round the size up.  &lt;strong&gt;Internal memory
fragmentation&lt;/strong&gt; is how we call this extra memory &amp;mdash; which is allocated, but
not legally available for a program.&lt;/p&gt;
&lt;p&gt;This text shows internal memory fragmentation effects in different synthetic
scenarios.  However, memory fragmentation is a real problem. I came across
this issue when was developing &lt;a class="reference external" href="https://pypi.org/project/pyahocorasick/"&gt;pyahcorasic&lt;/a&gt; module, that is built around
multi-way trees &amp;mdash; &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Trie"&gt;tries&lt;/a&gt;. To build a trie we need to allocate a large
number of quite small fixed-size nodes associated with (also rather small)
edge structures of variable size. It appeared that while theoretical size
of all structures is smaller than memory I had available in my system,
&lt;tt class="docutils literal"&gt;malloc&lt;/tt&gt; reported no memory.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Auto-vectorization status in GCC, Clang, ICC and MSVC</title>
  <link>http://0x80.pl/notesen/2019-02-02-autovectorization-gcc-clang.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-02-02-autovectorization-gcc-clang.html</guid>
  <pubDate>Sat, 02 Feb 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Update &lt;strong&gt;2021-02-17&lt;/strong&gt;: please check the newest status of &lt;a class="reference external" href="2021-01-18-autovectorization-gcc-clang.html"&gt;GCC &amp;amp; Clang&lt;/a&gt;, and &lt;a class="reference external" href="2021-02-17-autovectorization-msvc.html"&gt;MSVC&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The term &amp;quot;auto-vectorization&amp;quot; means the ability of a compiler to &lt;strong&gt;transform&lt;/strong&gt;
given scalar algorithm into vectorized one, i.e. express dominating
operation(s) using SIMD instruction.&lt;/p&gt;
&lt;p&gt;I'm sure nobody would argue that auto-vectorization is as important as scalar
optimizations performed by compilers. Now vectorization &lt;strong&gt;is&lt;/strong&gt; a must.&lt;/p&gt;
&lt;p&gt;From what I can gather, one of the first commonly used compiler of C/C++ which
automatically vectorized code was Intel compiler; by the time, luckily for us,
&lt;strong&gt;GCC&lt;/strong&gt; and &lt;strong&gt;Clang&lt;/strong&gt; caught up.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMDized counting byte in byte stream</title>
  <link>http://0x80.pl/notesen/2019-01-29-simd-count-byte.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-01-29-simd-count-byte.html</guid>
  <pubDate>Tue, 29 Jan 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;span id="scalar"&gt;&lt;/span&gt;&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;We want to count how many times given byte occurs in a byte stream;
here is a C program doing this:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;stddef.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;countbyte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;div class="section" id="gcc-vectorization"&gt;
&lt;span id="gcc"&gt;&lt;/span&gt;&lt;h2&gt;GCC vectorization&lt;/h2&gt;
&lt;p&gt;The current GCC vectorization algorithm is able to handle the presented
procedure, but its output is not optimal. For an AVX2 target GCC keeps &lt;strong&gt;four
64-bit sub-counters&lt;/strong&gt; which are updated in every iteration and then added in
the end.&lt;/p&gt;
&lt;p&gt;In a single iteration 32 bytes are loaded and then compared with vector filled
with given byte:&lt;/p&gt;
&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nf"&gt;vpcmpeqb&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;rax&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Result of this operation is vector of bytes filled with either ones (0xff)
or zeros. Then 0xff are converted to ones by bit-and operation:&lt;/p&gt;
&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nf"&gt;vpand&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;; ymm6 = _mm256_set1_epi8(0x01)&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The last step of algorithm is casting these 8-bit numbers to 64-bit value
and updating the mentioned counters.&lt;/p&gt;
&lt;p&gt;The conversion is done gradually: first from 8-bit numbers to 16-bit ones; then
from 16-bit to 32-bit and finally from 32-bit to 64-bit values.  This
conversion must be done for each 4-byte subarray of input register.  It means
that following code is repeated &lt;strong&gt;eight times&lt;/strong&gt;:&lt;/p&gt;
&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nf"&gt;vpmovzxbw&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;; 8-bit -&amp;gt; 16-bit numbers&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;; 16-bit -&amp;gt; 32-bit numbers&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;; 32-bit -&amp;gt; 64-bit numbers&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;; update the sub-counters&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Below is the full dissasembly from GCC 9 (snapshot from 2019-01-17, options:
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-O3&lt;/span&gt; &lt;span class="pre"&gt;-march=cannonlake&lt;/span&gt;&lt;/tt&gt;).&lt;/p&gt;
&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nf"&gt;vpmovzxbw&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxbw&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxdq&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddq&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm5&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>std::function and overloaded functions</title>
  <link>http://0x80.pl/notesen/2019-01-23-std-function-problems.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-01-23-std-function-problems.html</guid>
  <pubDate>Wed, 23 Jan 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="overloaded-functions"&gt;
&lt;h1&gt;Overloaded functions&lt;/h1&gt;
&lt;p&gt;Let us consider this simple use case, where we want to invoke a function:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;invoke_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Everything works fine when a callback is a lambda.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;){};&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;invoke_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;When we have &lt;strong&gt;overloaded functions&lt;/strong&gt;, there are problems, as the compiler
is not able to select a proper overload.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;overloaded_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;overloaded_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;// ...
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;invoke_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;overloaded_function&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The above code leads does not compile, error report from GCC is:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
error: cannot resolve overloaded function ‘overloaded_function’
based on conversion to type ‘std::function&amp;lt;void(int, int)&amp;gt;’
&lt;/pre&gt;
&lt;p&gt;To make this compilable we need to insert a weird casting to &lt;strong&gt;pointer to
function&lt;/strong&gt;. As far I know it's not possible to obtain from &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::fuction&lt;/span&gt;&lt;/tt&gt;
any member type for this, so retyping the whole function type is required.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;invoke_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;overloaded_function&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;//                           ^^^           ^^^&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;This is pretty verbose.&lt;/p&gt;
&lt;p&gt;I ended up with bare pointers to function in the signature of
&lt;tt class="docutils literal"&gt;invoke_callback&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>pyahocorasick stabilisation story</title>
  <link>http://0x80.pl/notesen/2019-01-08-pyahocorasick-debugging.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-01-08-pyahocorasick-debugging.html</guid>
  <pubDate>Tue, 08 Jan 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/WojciechMula/pyahocorasick"&gt;pyahocorasick&lt;/a&gt; is a python module I started in 2011. That time I was
interested in &lt;a class="reference external" href="http://en.wikipedia.org/wiki/String_(computer_science)#String_processing_algorithms"&gt;stringology&lt;/a&gt;
and &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Aho-Corasick%20algorithm"&gt;the Aho-Corasick algorithm&lt;/a&gt; appeared to be
quite challenging. It was a sufficient reason to program it. However, I also
decided that the result shouldn't be another proof-of-concept, that nobody ---
except me &amp;mdash; would use. Since I like Python, I chose form of a C extension,
which nicely combines a friendly Python API with an efficient C implementation.&lt;/p&gt;
&lt;p&gt;Moving fast forward, the module gained a few users worldwide. Maybe this
is not the most popular package on &lt;a class="reference external" href="https://pypi.org/project/pyahocorasick/"&gt;pypi&lt;/a&gt;, but people keep installing it.
Many users contributed to the code, documentation and infrastructure, or
reported bugs and helped with debugging. &lt;a class="reference external" href="https://github.com/pombredanne"&gt;Philippe Ombredanne&lt;/a&gt; helped
a lot with different aspects of development &amp;mdash; without him the project
wouldn't be so great.&lt;/p&gt;
&lt;p&gt;This text is a result of recent work on stabilisation the module, that
was propelled by fixing &lt;a class="reference external" href="https://github.com/WojciechMula/pyahocorasick/issues/50"&gt;a long-standing bug&lt;/a&gt;. The bug was driving me
crazy for more than a year. I want to show what means were used to
eliminate this and many other bugs. And also how the code quality was
improved as a side effect. I hope some of you find an inspiration
or solution.&lt;/p&gt;
&lt;div class="section" id="the-bug"&gt;
&lt;h2&gt;The bug&lt;/h2&gt;
&lt;p&gt;Before we start I have to describe the bug, nobody should repeat my stupid
mistake.&lt;/p&gt;
&lt;p&gt;The bug was caused by misuse of python function &lt;a class="reference external" href="https://docs.python.org/3/c-api/arg.html"&gt;PyBuild_Value&lt;/a&gt;, which is
used by a pickling mechanism. Basically, pickling used to be done as a simple
memory dump &amp;mdash; the module created &lt;strong&gt;single memory area&lt;/strong&gt; filled with some
binary data.&lt;/p&gt;
&lt;p&gt;The invocation &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;Py_BuildValue(&amp;quot;y#&amp;quot;,&lt;/span&gt; ptr, size)&lt;/tt&gt; constructs a &lt;a class="reference external" href="https://docs.python.org/3/c-api/bytes.html"&gt;bytes&lt;/a&gt; object
with a copy of memory pointed by &lt;tt class="docutils literal"&gt;ptr&lt;/tt&gt;, having given size. The problem is
that such a format string gets size of type &lt;tt class="docutils literal"&gt;int&lt;/tt&gt;. I wrongly assumed that on
64-bit machines &lt;tt class="docutils literal"&gt;int&lt;/tt&gt; is a 64-bit number. It's not true, &lt;tt class="docutils literal"&gt;int&lt;/tt&gt; has only 32
bits.  Because of that, when size of the memory area was larger than 2GB,
strange things happen, as shown in the table below.&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="50%" /&gt;
&lt;col width="50%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;64-bit size&lt;/th&gt;
&lt;th class="head"&gt;outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;range 0 to &lt;tt class="docutils literal"&gt;0x7fffffff&lt;/tt&gt; (up to 2GB)&lt;/td&gt;
&lt;td&gt;no errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;from &lt;tt class="docutils literal"&gt;0x80000000&lt;/tt&gt; to &lt;tt class="docutils literal"&gt;0xffffffff&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;int&lt;/tt&gt; is negative, empty buffer created but no &lt;strong&gt;error is reported&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;anything larger than 4GB, but bit 32th equals zero&lt;/td&gt;
&lt;td&gt;created buffer of &lt;tt class="docutils literal"&gt;size &amp;amp; 0x7fffffff&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;anything larger than 4GB, but bit 32th one&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;int&lt;/tt&gt; is negative, empty buffer created but no &lt;strong&gt;error is reported&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The solution was pretty simple: a huge memory area is split into several
smaller regions and the list of such regions is pickled. The size of single
region is limited to a few megabytes, it will never be close to the 2GB
boundary (although all data still can be larger than 2GB).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>C++ --- how to read a file into a string</title>
  <link>http://0x80.pl/notesen/2019-01-07-cpp-read-file.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-01-07-cpp-read-file.html</guid>
  <pubDate>Mon, 07 Jan 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;To my surprise I quite often need to read the whole contents of a file into
a string. Sometimes it's easier to generate data with an external program,
sometimes unittests require to read generated file, etc.&lt;/p&gt;
&lt;p&gt;A signature of such loader function is:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;load_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;In C++ an official way to deal with files are streams. There are at least two
methods to load data into a string:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;use &lt;a class="reference external" href="https://en.cppreference.com/w/cpp/iterator/istreambuf_iterator"&gt;streambuf iterators&lt;/a&gt; to construct a string,&lt;/li&gt;
&lt;li&gt;use an auxilary &lt;a class="reference external" href="https://en.cppreference.com/w/cpp/io/basic_stringstream"&gt;string stream&lt;/a&gt; to handle a &lt;a class="reference external" href="https://en.cppreference.com/w/cpp/io/basic_streambuf"&gt;streambuf&lt;/a&gt; object.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;load1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ifstream&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;istreambuf_iterator&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;istreambuf_iterator&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;load2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ss&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ostringstream&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ifstream&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;ss&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rdbuf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Both functions do their jobs, but reportedly are slow. While C++ still exposes
good old C API, i.e. &lt;tt class="docutils literal"&gt;fread&lt;/tt&gt; (libc) or &lt;tt class="docutils literal"&gt;read&lt;/tt&gt; (POSIX), I compared performance
of all solutions.  Although the C solution using &lt;tt class="docutils literal"&gt;fread&lt;/tt&gt; &amp;mdash; which is shown
below &amp;mdash; is much longer than the C++ counterparts, its performance is significantly
better than anything based on C++ streams. Performance of &lt;tt class="docutils literal"&gt;read&lt;/tt&gt; is almost
identical to &lt;tt class="docutils literal"&gt;fread&lt;/tt&gt;, differences are negligible.&lt;/p&gt;
&lt;p&gt;Of course, the performance boost highly depends on a machine type, hard drive,
etc., but clearly the overhead of C++ streams is really huge compared to libc
and POSIX calls.&lt;/p&gt;
&lt;p&gt;Implementation using &lt;tt class="docutils literal"&gt;fread&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;load3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;close_file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[](&lt;/span&gt;&lt;span class="kt"&gt;FILE&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;&lt;span class="n"&gt;fclose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);};&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;holder&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unique_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;FILE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;decltype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;close_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_str&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;rb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;close_file&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;holder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;FILE&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;holder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// in C++17 following lines can be folded into std::filesystem::file_size invocation
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fseek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SEEK_END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ftell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fseek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SEEK_SET&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// C++17 defines .data() which returns a non-const pointer
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;fread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Implementation using &lt;tt class="docutils literal"&gt;read&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;load4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_str&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;O_RDONLY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;stat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;fstat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;st_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;st_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512VBMI --- remove spaces from text</title>
  <link>http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html</guid>
  <pubDate>Sat, 05 Jan 2019 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Removing spaces from a string is a common task in text processing. Instead
of removing single character we often want to remove all the white space
characters or the punctuation characters etc.&lt;/p&gt;
&lt;p&gt;In this article I show an AVX512VBMI implementation. The algorithm is
not faster than the scalar code for all cases. But for many it can be
significantly faster, and what is more important, in tests on real-world
texts it performs better.&lt;/p&gt;
&lt;p&gt;Update &lt;strong&gt;2019-01-13&lt;/strong&gt;: this article pop up on twitter and &lt;a class="reference external" href="https://twitter.com/trav_downs/status/1081760561082392576"&gt;Hacker News&lt;/a&gt;
where provoked an incredibly fruitful discussion. &lt;a class="reference external" href="https://news.ycombinator.com/item?id=18834741"&gt;Travis Downs&lt;/a&gt;
noticed that branch mispredictions can be compensated by unrolling the loop
in the initial algorithm. &lt;strong&gt;Zach Wegner&lt;/strong&gt; came up with an algorithm
which works in constant time by using &lt;strong&gt;PEXT&lt;/strong&gt; instruction. &lt;strong&gt;Michael
Howard&lt;/strong&gt; shared with his scalar and AVX2 variants of &amp;quot;despacing&amp;quot; procedure.
I'd like to thank all people discussed this topic both on HN and twitter.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Python --- file modification time perils</title>
  <link>http://0x80.pl/notesen/2018-11-24-python-stat-float.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-11-24-python-stat-float.html</guid>
  <pubDate>Sat, 24 Nov 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;I need to copy a file to another directory whenever it got changed. The easiest
way to do this is to check the modification time of file, a number of seconds
since epoch:&lt;/p&gt;
&lt;pre class="code python literal-block"&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_mtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;st_mtime&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;It's not the most reliable way, but in my case it was good enough. Up to the
time when I noticed that sometimes files didn't get updated.&lt;/p&gt;
&lt;p&gt;I figured out that for a reason &lt;tt class="docutils literal"&gt;get_mtime&lt;/tt&gt; returned an integer value, while
the file system was able to deal with higher resolution than a second; system
command &lt;tt class="docutils literal"&gt;stat&lt;/tt&gt; printed microseconds.&lt;/p&gt;
&lt;p&gt;The culprit was the setting of &lt;tt class="docutils literal"&gt;os&lt;/tt&gt; module. It is possible to select in
runtime, by &lt;tt class="docutils literal"&gt;os.stat_float_times(boolean)&lt;/tt&gt;, whether the module reports times
as integers or floats. For an unknown reason my python installation defaulted
to integers. Thus it was possible to have two files with different modification
times having the same integer parts.&lt;/p&gt;
&lt;p&gt;Finally, I forced float times everywhere (&lt;tt class="docutils literal"&gt;os.stat_float_times(True)&lt;/tt&gt;).&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>SIMDized sum of all bytes in the array --- part 2: signed bytes</title>
  <link>http://0x80.pl/notesen/2018-11-18-sse-sumbytes-part2.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-11-18-sse-sumbytes-part2.html</guid>
  <pubDate>Sun, 18 Nov 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction-1"&gt;
&lt;span id="introduction"&gt;&lt;/span&gt;&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is the second part of &lt;a class="reference external" href="2018-10-24-sse-sumbytes.html"&gt;SIMDized sum of all bytes in the array&lt;/a&gt;.  The
first part describes summing unsigned bytes, here we're going to experiment
with summing of signed bytes.&lt;/p&gt;
&lt;p&gt;The baseline C implementation is:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sumbytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;And the C++ implementation:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;numeric&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sumbytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;accumulate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;div class="section" id="algorithm-used-by-gcc"&gt;
&lt;h2&gt;Algorithm used by GCC&lt;/h2&gt;
&lt;p&gt;Below is the assembly code of the main loop compiled for Skylake by
GCC 7.3.0 with flags &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-O3&lt;/span&gt; &lt;span class="pre"&gt;-march=skylake&lt;/span&gt;&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nf"&gt;vpmovsxbw&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovsxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovsxbw&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovsxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovsxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovsxwd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The approach used here by GCC is exactly the same as for summing unsigned
bytes. There are multiple 32-bit sub-accumulators in single register, i.e.
eight in case of AVX2 (four in SSE code), which are added together in
the end, forming the scalar result.&lt;/p&gt;
&lt;p&gt;To get 32-bit values there's two-step casting from &lt;tt class="docutils literal"&gt;int8_t&lt;/tt&gt; to &lt;tt class="docutils literal"&gt;int32_t&lt;/tt&gt;:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;First extend a vector of &lt;tt class="docutils literal"&gt;int8_t&lt;/tt&gt; into two vectors of &lt;tt class="docutils literal"&gt;int16_t&lt;/tt&gt;
numbers (&lt;tt class="docutils literal"&gt;VPMOVSXBW&lt;/tt&gt;).&lt;/li&gt;
&lt;li&gt;Then, get four vectors of &lt;tt class="docutils literal"&gt;int32_t&lt;/tt&gt; from the vectors obtained in the
previous step (&lt;tt class="docutils literal"&gt;VPMOVSXWD&lt;/tt&gt;).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The cast instruction &lt;tt class="docutils literal"&gt;VPMOVSX&lt;/tt&gt; extends the lower part of a register, in this
case the lower half. This is the reason why extractions of helves
(&lt;tt class="docutils literal"&gt;VEXTRACTI128&lt;/tt&gt;) are needed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>How many uops are there?</title>
  <link>http://0x80.pl/notesen/2018-11-18-skylakex-uops.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-11-18-skylakex-uops.html</guid>
  <pubDate>Sun, 18 Nov 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The current Intel CPUs translate instructions into so called uops (micro-ops),
which is a kind of internal ISA. For simple operations, like addition or
bitops, translation is one-to-one, i.e. there's exactly one uop for given
instruction.  When an instruction gets a memory argument we usually will get
two uops: one for load, another for actual operation; please note that most
instructions has many forms, usually &lt;tt class="docutils literal"&gt;reg, reg&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;reg, mem&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;I was curious how it looks in case of SIMD instructions. I used data from
&lt;a class="reference external" href="http://uops.info"&gt;uops.info&lt;/a&gt;, and picked recent SkylakeX architecture; results are from
IACA 3.0,&lt;/p&gt;
&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;90% of SIMD instructions are directly (or almost directly) translated into
simple uops. It means they're likely supported by dedicated circuits.&lt;/li&gt;
&lt;li&gt;AVX512 &lt;tt class="docutils literal"&gt;scatter&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;gather&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;conflict&lt;/tt&gt; instructions seem not to
be backed by hardware.&lt;/li&gt;
&lt;li&gt;STNI is very dead.&lt;/li&gt;
&lt;/ul&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="25%" /&gt;
&lt;col width="25%" /&gt;
&lt;col width="25%" /&gt;
&lt;col width="25%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;uops&lt;/th&gt;
&lt;th class="head"&gt;number of CPU instructions&lt;/th&gt;
&lt;th class="head"&gt;%&lt;/th&gt;
&lt;th class="head"&gt;CPU instructions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.17&lt;/td&gt;
&lt;td&gt;vgatherdps, vgatherdps, vgatherqps, vpgatherdd, vpgatherdd, vpgatherqq, vpscatterqd, vscatterqps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1752&lt;/td&gt;
&lt;td&gt;36.17&lt;/td&gt;
&lt;td&gt;&lt;em&gt;too many, omitted&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2616&lt;/td&gt;
&lt;td&gt;54.00&lt;/td&gt;
&lt;td&gt;&lt;em&gt;too many, omitted&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;234&lt;/td&gt;
&lt;td&gt;4.83&lt;/td&gt;
&lt;td&gt;&lt;em&gt;too many, omitted&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;140&lt;/td&gt;
&lt;td&gt;2.89&lt;/td&gt;
&lt;td&gt;&lt;em&gt;too many, omitted&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;0.78&lt;/td&gt;
&lt;td&gt;dpps, vdpps, vdpps, vgatherdpd, vgatherdpd, vgatherdpd, vgatherdpd,
vgatherdpd, vgatherdps, vgatherdps, vgatherdps, vgatherqpd, vgatherqpd,
vgatherqpd, vgatherqpd, vgatherqpd, vgatherqps, vgatherqps, vgatherqps,
vgatherqps, vmovdqu8, vpgatherdd, vpgatherdd, vpgatherdd, vpgatherdq,
vpgatherdq, vpgatherdq, vpgatherdq, vpgatherdq, vpgatherqd, vpgatherqd,
vpgatherqd, vpgatherqd, vpgatherqd, vpgatherqq, vpgatherqq, vpgatherqq,
vpgatherqq&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;vpscatterdq, vpscatterqq, vscatterdpd, vscatterqpd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0.21&lt;/td&gt;
&lt;td&gt;pcmpestri, rex.w pcmpestri, rex.w vpcmpestri, vpcmpestri, vpconflictd,
vpconflictd, vpscatterqd, vpscatterqd, vscatterqps, vscatterqps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.17&lt;/td&gt;
&lt;td&gt;pcmpestri, pcmpestrm, rex.w pcmpestri, rex.w pcmpestrm, rex.w vpcmpestri,
rex.w vpcmpestrm, vpcmpestri, vpcmpestrm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;pcmpestrm, rex.w pcmpestrm, rex.w vpcmpestrm, vpcmpestrm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0.12&lt;/td&gt;
&lt;td&gt;vaeskeygenassist, vaeskeygenassist, vpscatterdq, vpscatterqq, vscatterdpd, vscatterqpd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;vpscatterdd, vscatterdps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;vpconflictd, vpconflictq&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;vpconflictq, vpconflictq&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;vzeroall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;vpscatterdq, vpscatterqq, vscatterdpd, vscatterqpd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;vpscatterdd, vscatterdps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;vpconflictd, vpconflictq&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;vpconflictd, vpconflictd, vpconflictq, vpconflictq&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;vpconflictd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;vpconflictd, vpconflictd, vpscatterdd, vscatterdps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Scripts used to collect the data are &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/uops-histogram"&gt;available&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>A short report from code::dive 2018</title>
  <link>http://0x80.pl/notesen/2018-11-15-code-dive-2018.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-11-15-code-dive-2018.html</guid>
  <pubDate>Sun, 18 Nov 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In November this year there was new edition of &lt;a class="reference external" href="https://codedive.pl/"&gt;code::dive&lt;/a&gt;, an IT conference
in Wrocław, Poland.  Practically nothing has changed from &lt;a class="reference external" href="2017-11-26-code-dive-2017.html"&gt;the previous
edition&lt;/a&gt;: it is still great.  The place is perfect, it's in a huge cinema
located in the city center. There were free snack and water, and amazing coffee
at decent price (the only downside was huge queues). As always, there were a
lot of interesting talks; a new thing were &amp;quot;lighting talks&amp;quot;, run during lunch
break.&lt;/p&gt;
&lt;p&gt;This edition was a bit different, though. Although Nokia still sponsors the
conference, the organizers asked participants to buy tickets (25 złotych, 5
euro) and then gave all income to Polish Association of the Blind; I totally
love this approach.  BTW, approx 38,000 złotych were collected.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;&lt;em&gt;Disclaimer&lt;/em&gt;: I'm Nokia employee right now, but am writing this text in my spare
time. The employer was so kind that I went to the conference during working
hours.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;p&gt;The talks I attended:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&amp;quot;The Hitchhiker's Guide to Faster Builds&amp;quot; (two parts) by &lt;strong&gt;Viktor Kirilov&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Clean code in Go&amp;quot; by &lt;strong&gt;Mateusz Dymiński&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;GoLand Tips &amp;amp; Tricks&amp;quot; by &lt;strong&gt;Florin Pățan&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Taming dynamic memory - An introduction to custom allocators in C++&amp;quot; by &lt;strong&gt;Andreas Weis&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Python as C++’s limiting case&amp;quot; by &lt;strong&gt;Brandon Rhodes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;How to do practical Data Science? From real-world examples to recommendations&amp;quot; by &lt;strong&gt;Artur Suchwałko&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;C/C++ vs Security!&amp;quot; by &lt;strong&gt;Gynvael Coldwind&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;A trusted trip in the cloud &amp;mdash; working with trusted hardware in practice&amp;quot; by &lt;strong&gt;Gabriela Limonta&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Augmented Reality - The State of Play&amp;quot; by &lt;strong&gt;Rafał Legiędź&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Why algebraic data types are important&amp;quot; by &lt;strong&gt;Bartosz Milewski&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Speeding up multiple vector operations using SIMD</title>
  <link>http://0x80.pl/notesen/2018-11-14-simd-multiple-vector-ops.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-11-14-simd-multiple-vector-ops.html</guid>
  <pubDate>Wed, 14 Nov 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;One step of &lt;a class="reference external" href="http://en.wikipedia.org/wiki/K-means"&gt;k-means&lt;/a&gt; algorithm is calculating the distance between
all &lt;strong&gt;centroids&lt;/strong&gt; and all &lt;strong&gt;samples&lt;/strong&gt;. Then centroids are recalculated
and samples re-assigned. Centroids and also samples are &lt;strong&gt;vectors of
fixed size&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I was curious how SIMD might help in this task (or similar ones).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD --- why you shouldn't use static vector constants</title>
  <link>http://0x80.pl/notesen/2018-10-28-cpp-static-vectors.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-10-28-cpp-static-vectors.html</guid>
  <pubDate>Sun, 28 Oct 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;When work with SSE/AVX2/AVX512 it's virtually impossible not to use some vector
constants, which are defined by &lt;tt class="docutils literal"&gt;_mm_set_epi32&lt;/tt&gt; or similar intrinsic functions.&lt;/p&gt;
&lt;p&gt;If your program is written in C++ &lt;strong&gt;NEVER EVER&lt;/strong&gt; use &lt;tt class="docutils literal"&gt;static const&lt;/tt&gt; for such
constants. Why? From what I can gather, a compiler treats vector types not as PODs
(&lt;em&gt;Plain-Old-Data&lt;/em&gt;), but as fully-featured classes that have to be constructed
and destructed by some additional code.&lt;/p&gt;
&lt;p&gt;I checked this on GCC 7.3.0 from Debian, and then confirmed on GCC 8.2.0 and
Clang 7.0.0 on &lt;a class="reference external" href="https://godbolt.org/"&gt;godbolt.org&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMDized sum of all bytes in the array</title>
  <link>http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html</guid>
  <pubDate>Wed, 24 Oct 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction-1"&gt;
&lt;span id="introduction"&gt;&lt;/span&gt;&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;I was curious how GCC vectorizes function that sums bytes from an array.
Below is a loop-based implementation.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sumbytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The same algorithm can be expressed with following C++ code.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;numeric&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sumbytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;accumulate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;When I saw the assembly generated by GCC I was sure that it's possible to make
it better and faster.  This text summarizes my findings.&lt;/p&gt;
&lt;p&gt;I focus solely on Skylake performance and AVX2 code. The &lt;a class="reference internal" href="#sources"&gt;sources&lt;/a&gt; have got
also implementations of SSE procedures and &lt;a class="reference internal" href="#experiments"&gt;experiments&lt;/a&gt;  include timings
from an older CPU.&lt;/p&gt;
&lt;div class="section" id="algorithm-used-by-gcc"&gt;
&lt;h2&gt;Algorithm used by GCC&lt;/h2&gt;
&lt;p&gt;Below is the assembly code of the main loop compiled for Skylake by
GCC 7.3.0 with flags &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-O3&lt;/span&gt; &lt;span class="pre"&gt;-march=skylake&lt;/span&gt;&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class="code nasm literal-block"&gt;
&lt;span class="nf"&gt;vpmovzxbw&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxbw&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;$&lt;/span&gt;&lt;span class="mh"&gt;0x1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpmovzxwd&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;vpaddd&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nb"&gt;ymm3&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;GCC nicely vectorized the algorithm: it keeps multiple 32-bit sub-accumulators
in single register, i.e. eight in case of AVX2 (four in SSE code).  These 32-bit
numbers are added together in the end, forming the scalar result.&lt;/p&gt;
&lt;p&gt;Now, let's look how the type casting is done. Although AVX2 has variant of
instruction &lt;tt class="docutils literal"&gt;VPMOVZXBD&lt;/tt&gt; that converts directly from &lt;tt class="docutils literal"&gt;uint8_t&lt;/tt&gt; to
&lt;tt class="docutils literal"&gt;uint32_t&lt;/tt&gt; (intrinsic &lt;tt class="docutils literal"&gt;_mm256_cvtepu8_epi32&lt;/tt&gt;) the compiler does the
conversion in two steps:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;First, it extends a vector of &lt;tt class="docutils literal"&gt;uint8_t&lt;/tt&gt; into two vectors of &lt;tt class="docutils literal"&gt;uint16_t&lt;/tt&gt;
numbers (&lt;tt class="docutils literal"&gt;VPMOVZXBW&lt;/tt&gt;).&lt;/li&gt;
&lt;li&gt;Then, gets four vectors of &lt;tt class="docutils literal"&gt;uint32_t&lt;/tt&gt; from the vectors obtained in the
previous step (&lt;tt class="docutils literal"&gt;VPMOVZXWD&lt;/tt&gt;).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The cast instruction &lt;tt class="docutils literal"&gt;VPMOVZX&lt;/tt&gt; extends the lower part of a register, in this
case the lower half. This is the reason why extractions of helves
(&lt;tt class="docutils literal"&gt;VEXTRACTI128&lt;/tt&gt;) are needed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMDized check which bytes are in a set</title>
  <link>http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html</guid>
  <pubDate>Thu, 18 Oct 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The problem is defined as follows: there's &lt;strong&gt;a stream of bytes&lt;/strong&gt; and we want to
get a byte-mask (or a bit-mask) that indicates which bytes are in &lt;strong&gt;the
predefined set&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Thanks to SIMD instructions this task can be performed faster than scalar code.
Jobs like input validation or parsing (for instance CSV files), might benefit
from a vectorized approach.&lt;/p&gt;
&lt;p&gt;In this text I show several SIMD methods:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;The universal algorithm&lt;/strong&gt; that can handle arbitrary sets (from 1 to 255
elements) with a few instructions.&lt;/li&gt;
&lt;li&gt;Several specialized algorithms, that handle small sets of peculiar
properties. They require fewer instructions than the universal algorithm.
However, the algorithms are rather meant for compilers/code generators, where
we can statically determine the best code sequence for a predefined set.&lt;/li&gt;
&lt;li&gt;For sake of completeness I describe &lt;a class="reference internal" href="#basicmethods"&gt;basic SIMD methods&lt;/a&gt;. If the set has
a few elements then no fancy algorithm is needed. Likewise, if the set
can be represented as a union of ranges the code is also not complicated.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Finding index of the minimum value using SIMD instructions</title>
  <link>http://0x80.pl/notesen/2018-10-03-simd-index-of-min.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-10-03-simd-index-of-min.html</guid>
  <pubDate>Wed, 03 Oct 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The goal is to find the first index of the minimum value in a non-empty
sequence.&lt;/p&gt;
&lt;p&gt;Following C code shows the idea.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;min_index_scalar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minindex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minvalue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minvalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;minvalue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;minindex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minindex&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The C++ standard library allows to express the same algorithm in one line.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;algorithm&amp;gt;&lt;/span&gt;&lt;span class="c1"&gt; // for std::min_element&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;iterator&amp;gt;&lt;/span&gt;&lt;span class="c1"&gt;  // for std::distance&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;min_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()));&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The current versions of compilers (GCC 8.2, clang 7.0.0) are not able to
autovectorize the code. However, they do autovectorize finding the minimum
value, i.e. statement like &lt;tt class="docutils literal"&gt;return &lt;span class="pre"&gt;*std::min_element(v.begin(),&lt;/span&gt; &lt;span class="pre"&gt;v.end())&lt;/span&gt;&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512 mask registers support in compilers</title>
  <link>http://0x80.pl/notesen/2018-05-18-avx512-ktest-in-compilers.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-05-18-avx512-ktest-in-compilers.html</guid>
  <pubDate>Fri, 18 May 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/AVX-512"&gt;AVX-512&lt;/a&gt; introduced the set of 64-bit &lt;strong&gt;mask registers&lt;/strong&gt;, called in assembler
&lt;tt class="docutils literal"&gt;k0&lt;/tt&gt; ... &lt;tt class="docutils literal"&gt;k7&lt;/tt&gt;.  A mask can be used to:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Conditionally update elements in a destination register; it's an incredibly
powerful feature, as virtually all vector instructions support it.&lt;/li&gt;
&lt;li&gt;Hold the result of vector comparison.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The latter is also useful, as there is instruction &lt;tt class="docutils literal"&gt;ktest&lt;/tt&gt; that updates
the flags register, &lt;tt class="docutils literal"&gt;EFLAGS&lt;/tt&gt;. Prior to AVX512 an extra instruction &amp;mdash; like
&lt;tt class="docutils literal"&gt;pmovmskb&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;ptest&lt;/tt&gt; (SSE 4.1) &amp;mdash; has to be used in order to alter
control flow based on vectors content.&lt;/p&gt;
&lt;p&gt;There are four variants of &lt;tt class="docutils literal"&gt;ktest kx, ky&lt;/tt&gt; that operates on 8, 16, 32 or 64
bits of mask registers, but basically they perform the same operation:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
ZF := (kx AND ky) == 0
CF := (kx AND NOT ky) == 0
&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;2018-05-22&lt;/strong&gt; update: unfortunately the instruction is not available in
AVX512F; 8- and 16-bit variants are available in AVX512DQ, 32- and 64-bit
in AVX512BW.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512 implementation of JPEG zigzag transformation</title>
  <link>http://0x80.pl/notesen/2018-05-13-avx512-jpeg-zigzag-transform.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-05-13-avx512-jpeg-zigzag-transform.html</guid>
  <pubDate>Sun, 13 May 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;One of steps in &lt;a class="reference external" href="http://en.wikipedia.org/wiki/JPEG#Entropy_coding"&gt;JPEG compression&lt;/a&gt; is &lt;strong&gt;zigzag transformation&lt;/strong&gt; which
linearises pixels from 8x8 block into 64-byte array. It's possible to vectorize
this transformation. This short text shows &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions"&gt;SSE&lt;/a&gt; implementation, then its
translation into &lt;a class="reference external" href="http://en.wikipedia.org/wiki/AVX-512"&gt;AVX512BW&lt;/a&gt; instruction set, and finally AVX512VBMI code.&lt;/p&gt;
&lt;p&gt;The order of items in a block after transformation is shown below:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
[  0  1  5  6 14 15 27 28 ]
[  2  4  7 13 16 26 29 42 ]
[  3  8 12 17 25 30 41 43 ]
[  9 11 18 24 31 40 44 53 ]
[ 10 19 23 32 39 45 52 54 ]
[ 20 22 33 38 46 51 55 60 ]
[ 21 34 37 47 50 56 59 61 ]
[ 35 36 48 49 57 58 62 63 ]
&lt;/pre&gt;
&lt;img alt="2018-05-13-avx512-jpeg-zigzag-transform/zigzag.png" class="align-center" src="2018-05-13-avx512-jpeg-zigzag-transform/zigzag.png" /&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Be careful with directory_iterator</title>
  <link>http://0x80.pl/notesen/2018-04-28-be-careful-with-dir-iterator.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-04-28-be-careful-with-dir-iterator.html</guid>
  <pubDate>Sat, 28 Apr 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;C++17 finally introduced &lt;a class="reference external" href="http://en.cppreference.com/w/cpp/header/filesystem"&gt;filesystem&lt;/a&gt; library, which is pretty nice.&lt;/p&gt;
&lt;p&gt;This text shows a caveat my colleague bumped into recently. He wanted to perform
a set of different operations on files from a directory; it was something like:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;filesystem&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;filesystem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Foo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;perfrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;perform_impl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_iterator&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_iterator&lt;/span&gt;&lt;span class="p"&gt;{});&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;private&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;perform_impl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_iterator&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_iterator&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_iterator&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;do_foo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_iterator&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;do_bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;do_foo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_entry&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;de&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;do_bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;directory_entry&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;de&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;In method &lt;tt class="docutils literal"&gt;perform_impl&lt;/tt&gt; we iterate twice over the range defined by two
&lt;a class="reference external" href="http://en.cppreference.com/w/cpp/filesystem/directory_iterator"&gt;directory_iterators&lt;/a&gt;.  Well, we suppose so. Although iterator &lt;tt class="docutils literal"&gt;first&lt;/tt&gt; is
copied, the copy operation is peculiar.  It doesn't copy &lt;strong&gt;the state of
iterator&lt;/strong&gt;, we get merely a new &amp;quot;handle&amp;quot; to an existing, single state.
Standard libraries from GCC and Clang keep a &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::shared_ptr&lt;/span&gt;&lt;/tt&gt;, which holds
an instance of internal class responsible for iterating.&lt;/p&gt;
&lt;p&gt;What it means? When the first loop executes, then &lt;tt class="docutils literal"&gt;first == end&lt;/tt&gt;. Thus, the
second loop never runs.&lt;/p&gt;
&lt;p&gt;In my opinion this behaviour is counter-intuitive. If the copy operator doesn't
really make a copy, it should be disabled in API (it can be done with &lt;tt class="docutils literal"&gt;=
delete&lt;/tt&gt; put next to the operator declaration). People will be forced to pass
the iterator by reference and, thanks to that, will be aware of the iterator
traits.&lt;/p&gt;
&lt;p&gt;A funny side-effect of the current language feature is that even iterators
passed by const reference change their visible state.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Parsing series of integers with SIMD</title>
  <link>http://0x80.pl/notesen/2018-04-19-simd-parsing-int-sequences.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-04-19-simd-parsing-int-sequences.html</guid>
  <pubDate>Thu, 19 Apr 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;&lt;a class="toc-backref" href="#toc-entry-1"&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;Introduction&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;While conversion from a string into an integer value is feasible with SIMD
instructions, this application is unpractical. For typical cases, when a single
value is parsed, scalar procedures &amp;mdash; like the standard &lt;tt class="docutils literal"&gt;atoi&lt;/tt&gt; or
&lt;tt class="docutils literal"&gt;strtol&lt;/tt&gt; &amp;mdash; are faster than any fancy SSE code.&lt;/p&gt;
&lt;p&gt;However, SIMD procedures can be really fast and convert &lt;strong&gt;in parallel&lt;/strong&gt; several
numbers. There is only one &amp;quot;but&amp;quot;: the input data has to be regular and valid,
i.e. the input string must contain only ASCII digits. Recently, I updated
article about &lt;a class="reference internal" href="#internal-links"&gt;SSE parsing&lt;/a&gt; with the benchmark results.  The
speed-ups are really impressive, for example the SSSE3 parser is 7 to 9 times
faster than a naive, scalar code.&lt;/p&gt;
&lt;p&gt;The obvious question is how these powerful SIMD procedures can be used to
convert real data? By &lt;em&gt;real&lt;/em&gt; I mean possibly broken inputs that contain series
of numbers of different length separated with characters from a predefined set.&lt;/p&gt;
&lt;p&gt;In this text I try to answer that question; major contributions of this
article are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Methods to efficiently &lt;strong&gt;parse and validate&lt;/strong&gt; such strings using SSE
instructions. There is a special variant that handles only unsigned
numbers and also a fully featured variant for signed numbers.&lt;/li&gt;
&lt;li&gt;Boosting the SSE procedure with AVX2 or AVX512 instructions when
possible.&lt;/li&gt;
&lt;li&gt;A way to combine some of SIMD techniques with scalar conversions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The article starts with unsigned conversion, because it is easier than a signed
one. The signed conversion shares the core idea, it just adds some extra steps.&lt;/p&gt;
&lt;p&gt;The text is accompanied with BSD-licensed software, that includes fully
functional implementations alongside the programs which validate and benchmark
the procedures.&lt;/p&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="auto-toc simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction" id="toc-entry-1"&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#parser-specification" id="toc-entry-2"&gt;2&amp;nbsp;&amp;nbsp;&amp;nbsp;Parser specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-conversion-capabilities" id="toc-entry-3"&gt;3&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE conversion capabilities&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#range-checking" id="toc-entry-4"&gt;3.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Range checking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sample-implementation" id="toc-entry-5"&gt;3.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Sample implementation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#parsing-and-conversions-of-unsigned-numbers" id="toc-entry-6"&gt;4&amp;nbsp;&amp;nbsp;&amp;nbsp;Parsing and conversions of unsigned numbers&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm-overview" id="toc-entry-7"&gt;4.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Algorithm overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#normalizing-input" id="toc-entry-8"&gt;4.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Normalizing input&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#precalculating-data" id="toc-entry-9"&gt;4.3&amp;nbsp;&amp;nbsp;&amp;nbsp;Precalculating data&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#example" id="toc-entry-10"&gt;4.3.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-algorithm-outline" id="toc-entry-11"&gt;4.4&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE algorithm outline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#detecting-invalid-inputs" id="toc-entry-12"&gt;4.5&amp;nbsp;&amp;nbsp;&amp;nbsp;Detecting invalid inputs&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#detecting-digits-1" id="toc-entry-13"&gt;4.5.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Detecting digits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#detecting-characters-from-set" id="toc-entry-14"&gt;4.5.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Detecting characters from set&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-avx2" id="toc-entry-15"&gt;4.5.2.1&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE &amp;amp; AVX2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse4-2" id="toc-entry-16"&gt;4.5.2.2&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE4.2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#caveats" id="toc-entry-17"&gt;4.6&amp;nbsp;&amp;nbsp;&amp;nbsp;Caveats&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#parsing-and-conversion-of-signed-numbers" id="toc-entry-18"&gt;5&amp;nbsp;&amp;nbsp;&amp;nbsp;Parsing and conversion of signed numbers&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm" id="toc-entry-19"&gt;5.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Algorithm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#implementation-note" id="toc-entry-20"&gt;5.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Implementation note&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#detecting-invalid-inputs-1" id="toc-entry-21"&gt;5.3&amp;nbsp;&amp;nbsp;&amp;nbsp;Detecting invalid inputs&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#avx512vbmi" id="toc-entry-22"&gt;5.3.1&amp;nbsp;&amp;nbsp;&amp;nbsp;AVX512VBMI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-algorithm-outline-1" id="toc-entry-23"&gt;5.4&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE algorithm outline&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#processing-larger-inputs" id="toc-entry-24"&gt;6&amp;nbsp;&amp;nbsp;&amp;nbsp;Processing larger inputs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#scalar-hybrid-1" id="toc-entry-25"&gt;7&amp;nbsp;&amp;nbsp;&amp;nbsp;Scalar hybrid&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#example-1" id="toc-entry-26"&gt;7.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#appendix-a-conversion-of-three-digit-numbers" id="toc-entry-27"&gt;8&amp;nbsp;&amp;nbsp;&amp;nbsp;Appendix A &amp;mdash; conversion of three-digit numbers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#appendix-b-conversion-of-two-four-digit-numbers" id="toc-entry-28"&gt;9&amp;nbsp;&amp;nbsp;&amp;nbsp;Appendix B &amp;mdash; conversion of two four-digit numbers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#reference-scalar-procedures" id="toc-entry-29"&gt;10&amp;nbsp;&amp;nbsp;&amp;nbsp;Reference scalar procedures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#experiments" id="toc-entry-30"&gt;11&amp;nbsp;&amp;nbsp;&amp;nbsp;Experiments&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-conversion-execution-statistics-1" id="toc-entry-31"&gt;11.1&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE conversion &amp;mdash; execution statistics&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#overview" id="toc-entry-32"&gt;11.1.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-routines-calls" id="toc-entry-33"&gt;11.1.2&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE routines calls&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-conversion-runtime-analysis" id="toc-entry-34"&gt;11.2&amp;nbsp;&amp;nbsp;&amp;nbsp;SSE conversion &amp;mdash; runtime analysis&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#input-size-4-096-bytes" id="toc-entry-35"&gt;11.2.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Input size 4,096 bytes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#input-size-65-536-bytes" id="toc-entry-36"&gt;11.2.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Input size 65,536 bytes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#performance-comparison" id="toc-entry-37"&gt;11.3&amp;nbsp;&amp;nbsp;&amp;nbsp;Performance comparison&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tested-procedures" id="toc-entry-38"&gt;11.3.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Tested procedures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#tests-setup" id="toc-entry-39"&gt;11.3.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Tests setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#core-i7-results-1" id="toc-entry-40"&gt;11.3.3&amp;nbsp;&amp;nbsp;&amp;nbsp;Core i7 results&lt;/a&gt;&lt;ul class="auto-toc"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#input-size-4096-bytes" id="toc-entry-41"&gt;11.3.3.1&amp;nbsp;&amp;nbsp;&amp;nbsp;Input size 4096 bytes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#input-size-65536-bytes" id="toc-entry-42"&gt;11.3.3.2&amp;nbsp;&amp;nbsp;&amp;nbsp;Input size 65536 bytes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#conclusions" id="toc-entry-43"&gt;12&amp;nbsp;&amp;nbsp;&amp;nbsp;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#acknowledgements" id="toc-entry-44"&gt;13&amp;nbsp;&amp;nbsp;&amp;nbsp;Acknowledgements&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#see-also" id="toc-entry-45"&gt;14&amp;nbsp;&amp;nbsp;&amp;nbsp;See also&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#source-code" id="toc-entry-46"&gt;15&amp;nbsp;&amp;nbsp;&amp;nbsp;Source code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Accidental recursion</title>
  <link>http://0x80.pl/notesen/2018-04-14-accidental-recursion.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-04-14-accidental-recursion.html</guid>
  <pubDate>Sat, 14 Apr 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Another bug I bumped into. There is the enum:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;enum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;red&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;green&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;And the associated &lt;tt class="docutils literal"&gt;ostream&lt;/tt&gt; operator, which is merely a switch; all
operators for enums look like this.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ostream&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;operator&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ostream&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Color&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;red&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;Color&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;green&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;green&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;invalid &amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The author clearly wanted to handle all possible cases, including invalid enum
values, that might appear due to memory corruption or simply as a result of
invalid casting. Sadly, the handler would fail in such an erroneous case ---
there is &lt;strong&gt;an infinite recursion&lt;/strong&gt; in the &lt;tt class="docutils literal"&gt;default&lt;/tt&gt; case.&lt;/p&gt;
&lt;p&gt;The correct solution would be extract enumeration type and print it as a numeric value:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;unknown &amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;underlying_type&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Color&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;::&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;This is not the nicest code in the world, but does the job well.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Is sorted using SIMD instructions</title>
  <link>http://0x80.pl/notesen/2018-04-11-simd-is-sorted.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-04-11-simd-is-sorted.html</guid>
  <pubDate>Wed, 11 Apr 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Recently, I came across a function that checks whether an array is sorted, i.e.
if there is no element which would be greater than its successor. Below is
a sample implementation:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is_sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;I was sure that such a trivial loop is autovectorized by all decent compilers.
I checked this on &lt;a class="reference external" href="https://godbolt.org/"&gt;Compiler Explorer&lt;/a&gt; and to my surprise &lt;strong&gt;none of compilers&lt;/strong&gt;
does it. This is the state for GCC 7.3 (and upcoming GCC 8.0), clang 6.0 and ICC 19.&lt;/p&gt;
&lt;p&gt;This text explores possible vectorization schemas.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>When lock does not lock --- C++ story</title>
  <link>http://0x80.pl/notesen/2018-03-26-when-lock-does-not-lock.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-03-26-when-lock-does-not-lock.html</guid>
  <pubDate>Mon, 26 Mar 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;A few days ago I had compiled fresh GCC 8.0 from trunk and then compiled our
product, just to see what we will have to fix in the future. And I found a
nasty mistake.&lt;/p&gt;
&lt;p&gt;Lets start with a class that perhaps every project has.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_mutex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mutex&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_lock&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mutex&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;At first glance &lt;tt class="docutils literal"&gt;get()&lt;/tt&gt; protects the shared resource; thanks to &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Run-time_type_information"&gt;RTTI&lt;/a&gt; we
don't have to care about locking and unlocking our &lt;tt class="docutils literal"&gt;mutex&lt;/tt&gt;, as
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::shared_lock&lt;/span&gt;&lt;/tt&gt; internals care about this.&lt;/p&gt;
&lt;p&gt;However, the code doesn't work that way. Indeed, the line&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_lock&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mutex&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;declares &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::shared_lock&lt;/span&gt;&lt;/tt&gt; variable. But the name of variable is &lt;tt class="docutils literal"&gt;mutex&lt;/tt&gt;
and the lock is constructed with &lt;strong&gt;no associated mutex&lt;/strong&gt;. Thus, the lock
doesn't lock anything.&lt;/p&gt;
&lt;p&gt;Of course, the correct declaration should be:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_lock&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_mutex&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mutex&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Can we detected this kind of mistake? Yes, GCC has flag &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-Wshadow&lt;/span&gt;&lt;/tt&gt; that
warns when &amp;quot;a local variable or type declaration shadows another variable,
parameter, type, or class member (in C++), or whenever a built-in function
is shadowed&amp;quot;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
file.cpp: warning: declaration of ‘mutex’ shadows a member of ‘Resource’ [-Wshadow]
     std::shared_lock&amp;lt;std::shared_mutex&amp;gt;(mutex);
&lt;/pre&gt;
&lt;p&gt;However, I found the mistake thanks to more aggressive &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-Wparentheses&lt;/span&gt;&lt;/tt&gt; flag
in GCC 8:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
file.cpp: warning: unnecessary parentheses in declaration of ‘mutex’ [-Wparentheses]
     std::shared_lock&amp;lt;std::shared_mutex&amp;gt;(mutex);
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>An awful part of C++17</title>
  <link>http://0x80.pl/notesen/2018-03-16-awful-part-of-cpp.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-03-16-awful-part-of-cpp.html</guid>
  <pubDate>Fri, 16 Mar 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="the-current-state"&gt;
&lt;h1&gt;The current state&lt;/h1&gt;
&lt;p&gt;I was really happy when saw that C++17 finally introduced standard functions to
parse integers and floats. It is a group of functions &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::from_chars&lt;/span&gt;&lt;/tt&gt;
defined in the header &lt;a class="reference external" href="http://en.cppreference.com/w/cpp/header/charconv"&gt;charconv&lt;/a&gt;. Unfortunately, it was a fleeting moment of
happiness. The proposed API quickly appeared to be awful.  Lets look how the
integer parser is defined (the floating-point parsers are similar):&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;from_chars_result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;errc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;from_chars_result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;from_chars&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                             &lt;/span&gt;&lt;span class="cm"&gt;/*integer type*/&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The API resembles old good C, with one important exception: it's not good at all.
How one is supposed to use it in C++?&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;string&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;charconv&amp;gt;&lt;/span&gt;&lt;span class="c1"&gt; // from_chars&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;// ...
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;from_chars&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;*&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;*&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_error_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Yes, to get a char pointer from a string iterator one need to write
&lt;tt class="docutils literal"&gt;&amp;amp;*it&lt;/tt&gt;. The alternative invocation is &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::from_chars(input.c_str(),&lt;/span&gt;
input.c_str() + &lt;span class="pre"&gt;input.size(),&lt;/span&gt; &lt;span class="pre"&gt;...)&lt;/span&gt;&lt;/tt&gt;. Both are ugly, aren't they?&lt;/p&gt;
&lt;p&gt;To make things funnier, &lt;tt class="docutils literal"&gt;from_chars&lt;/tt&gt; recognizes the minus character,
but not the plus character. Yes, it's not a misake &amp;mdash; you can convert
string &amp;quot;-42&amp;quot; but for &amp;quot;+42&amp;quot; you'll get an error.&lt;/p&gt;
&lt;p&gt;Compare this with &lt;tt class="docutils literal"&gt;strtol&lt;/tt&gt;, which was defined several decades ago:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;string&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;cstdio&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;cstdlib&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;// ...
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;strtol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errno&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;strerror&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errno&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Intersection of ordered sets</title>
  <link>http://0x80.pl/notesen/2018-03-14-set-intersection.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-03-14-set-intersection.html</guid>
  <pubDate>Wed, 14 Mar 2018 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The intersection of two sets represented by sorted collections (like lists or
arrays) can be done in linear time. If we label with &lt;span class="math"&gt;&lt;i&gt;k&lt;/i&gt;&lt;/span&gt; the size of one
collection, and with &lt;span class="math"&gt;&lt;i&gt;n&lt;/i&gt;&lt;/span&gt; the size of another collection, then the
complexity of intersection is &lt;span class="math"&gt;O(&lt;i&gt;n&lt;/i&gt; + &lt;i&gt;k&lt;/i&gt;)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Below is a naive C++ implementation; the C++ standard library comes with
&lt;a class="reference external" href="http://en.cppreference.com/w/cpp/algorithm/set_intersection"&gt;std::set_intersection&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;INSERTER&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;custom_set_intersection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;INSERTER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="c1"&gt;// A[i] == B[j]
&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;

            &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;If there is a huge difference between sizes of collections, then we might
use two different approaches described below. I use terms &amp;quot;smaller&amp;quot;
(&lt;span class="math"&gt;&lt;i&gt;k&lt;/i&gt;&lt;/span&gt; items) and &amp;quot;larger&amp;quot; collection/set (&lt;span class="math"&gt;&lt;i&gt;n&lt;/i&gt;&lt;/span&gt; items).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE/AVX: absolute value of difference of unsigned integers</title>
  <link>http://0x80.pl/notesen/2018-03-11-sse-abs-unsigned.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-03-11-sse-abs-unsigned.html</guid>
  <pubDate>Sun, 11 Mar 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;With signed numbers it is really easy. We can subtract two numbers (instruction
&lt;tt class="docutils literal"&gt;psub&lt;/tt&gt;) and then calculate &lt;tt class="docutils literal"&gt;abs&lt;/tt&gt; directly, as all Intel's SIMD instruction
sets support the &lt;tt class="docutils literal"&gt;abs&lt;/tt&gt; operation.&lt;/p&gt;
&lt;p&gt;To calculate the abs of difference of two unsigned numbers we can use the
&lt;strong&gt;saturated arithmetic&lt;/strong&gt;. The saturated subtract (instructions &lt;tt class="docutils literal"&gt;psubusX&lt;/tt&gt;) is
equivalent to &lt;cite&gt;max(a - b, 0)&lt;/cite&gt;; it means that whenever subtraction would yield a
negative result, the final result is zero.&lt;/p&gt;
&lt;p&gt;We need to calculate two saturated subtracts, one for &lt;cite&gt;a - b&lt;/cite&gt;, another for &lt;cite&gt;b -
a&lt;/cite&gt;; then merge them with bitwise or -- it's safe, because one of the subtract
results is zero.&lt;/p&gt;
&lt;p&gt;Below is an SSE implementation; full source code is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/blob/master/sse/simd-abs-sub-uint.c"&gt;available&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;abs_sub_epu8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ab&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_mm_subs_epu8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ba&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_mm_subs_epu8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_mm_or_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ab&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ba&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Is power of two --- BMI1 version</title>
  <link>http://0x80.pl/notesen/2018-03-11-is-power-of-two-bmi1.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2018-03-11-is-power-of-two-bmi1.html</guid>
  <pubDate>Sun, 11 Mar 2018 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;To check if a number is a power of two, the instruction BLSR from &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#BMI1_(Bit_Manipulation_Instruction_Set_1)"&gt;BMI1&lt;/a&gt; extension
can be used. The instruction resets the least set bit of a number, i.e. calculates
&lt;tt class="docutils literal"&gt;(x - 1) and x&lt;/tt&gt;. A sample C procedure that use the bit-trick:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is_power_of_two&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;If a number has exactly one bit set then BLSR yields &lt;strong&gt;zero&lt;/strong&gt;.  However, when
input of BLSR is zero, the instruction also yields zero.  Fortunately, BLSR
sets CPU flags in following way:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;ZF is set when result is zero,&lt;/li&gt;
&lt;li&gt;CF is set when input is zero (note that if CF is set, ZF is also set).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thanks to that we can properly handle all cases. Below is an assembly code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
blsr %eax, %eax

// result = (ZF == 1) and (CF == 0)
setz %al                // al = ZF
sbb  $0, %al            // al = ZF - CF
movzx %al, %eax         // cast
&lt;/pre&gt;
&lt;p&gt;Sample program is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/bit-twiddling"&gt;available&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>A short report from code::dive 2017</title>
  <link>http://0x80.pl/notesen/2017-11-26-code-dive-2017.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2017-11-26-code-dive-2017.html</guid>
  <pubDate>Sun, 26 Nov 2017 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="fpga-for-a-software-developer"&gt;
&lt;h1&gt;FPGA for a software developer&lt;/h1&gt;
&lt;p&gt;It was one of my favourite talks. Miodrag presented an architecture of typical
FPGA, he also showed a number of open source tools that enables us to play with
FPGAs. He also got through core parts of Verilog, using as an example a simple
8-bit CPU. The CPU has a fully functional design -- it has got an ALU, a
control unit; it was able to decode variable-length opcodes and execute them.
At the end the speaker run a sample program on the CPU realized on a real,
cheap FPGA hardware. Wow!&lt;/p&gt;
&lt;p&gt;Although I have already known some bits of FPGA and Verilog, the presentation
gave me a grasp of the whole stack needed to use the programmable arrays in
practice.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>ARM Neon and Base64 encoding &amp; decoding</title>
  <link>http://0x80.pl/notesen/2017-01-07-base64-simd-neon.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2017-01-07-base64-simd-neon.html</guid>
  <pubDate>Sat, 07 Jan 2017 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Base64 algorithms were subject of my interest last year and I proposed a few
different SIMD approaches for both encoding and encoding. This text sums up my
experiments with ARM Neon implementation. For algorithms' details please refer
to &lt;a class="reference external" href="2016-01-12-sse-base64-encoding.html"&gt;Base64 encoding with SIMD instructions&lt;/a&gt; and &lt;a class="reference external" href="2016-01-17-sse-base64-decoding.html"&gt;Base64 decoding with SIMD
instructions&lt;/a&gt;.&lt;/p&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction" id="toc-entry-1"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#arm-neon" id="toc-entry-2"&gt;ARM Neon&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#load-and-store" id="toc-entry-3"&gt;Load and store&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#conditions" id="toc-entry-4"&gt;Conditions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#bit-operations" id="toc-entry-5"&gt;Bit operations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#base64-encoding" id="toc-entry-6"&gt;Base64 encoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#base64-decoding" id="toc-entry-7"&gt;Base64 decoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#performance-results" id="toc-entry-8"&gt;Performance results&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#encoding-quadwords" id="toc-entry-9"&gt;Encoding (quadwords)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#decoding-doublewords" id="toc-entry-10"&gt;Decoding (doublewords)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#see-also" id="toc-entry-11"&gt;See also&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#source-code" id="toc-entry-12"&gt;Source code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512 --- first bit set in a large array</title>
  <link>http://0x80.pl/notesen/2016-12-22-avx512-sparse-bfs.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-12-22-avx512-sparse-bfs.html</guid>
  <pubDate>Thu, 22 Dec 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;There is an array of 64-bit values, we want to find the first bit set; the array
is sparse.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SWAR check if all chars are digits</title>
  <link>http://0x80.pl/notesen/2016-12-21-swar-digits-validate.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-12-21-swar-digits-validate.html</guid>
  <pubDate>Wed, 21 Dec 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;We have a string and want to check if all its characters are ASCII digits.&lt;/p&gt;
&lt;p&gt;It's a remnant of my experiments in number parsing: I was curious if separating
validation from actual conversion would be profitable.  The answer is no in a
generic case. For really long numbers there might be some improvement, but in
reality inputs are usually short.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Population count using XOP instructions</title>
  <link>http://0x80.pl/notesen/2016-12-16-xop-popcnt.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-12-16-xop-popcnt.html</guid>
  <pubDate>Fri, 16 Dec 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/XOP_instruction_set"&gt;AMD XOP&lt;/a&gt; defines instruction &lt;tt class="docutils literal"&gt;VPTERNB&lt;/tt&gt; which does lookup in a pair of SSE
registers. The instruction is similar to &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt;, but apart of wider,
5-bit index, it allows to perform several extra operations based on the
higher 3-bits.&lt;/p&gt;
&lt;p&gt;I showed that &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt; can be used to &lt;a class="reference external" href="2008-05-24-sse-popcount.html"&gt;implement population count
procedure&lt;/a&gt;. With 4-bit indices such procedure is straightforward. We split
a byte vector into two halves, and invoke the instruction twice, getting
popcount for both nibbles. Next these popcounts are added together, forming
8-bit counters, which are added in the end.&lt;/p&gt;
&lt;p&gt;Similar procedure can be build around &lt;tt class="docutils literal"&gt;VPTERNB&lt;/tt&gt;. However, to fully utilize
5-bit indices a slightly different strategy is needed. We process two vectors
in one step treating byte pairs as 16-bit words. We call &lt;tt class="docutils literal"&gt;VPTERNB&lt;/tt&gt; to
calculate popcount of three 5-bit fields, one remaining bit of the 16-bit word
is counted separately.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD-friendly algorithms for substring searching</title>
  <link>http://0x80.pl/notesen/2016-11-28-simd-strfind.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-11-28-simd-strfind.html</guid>
  <pubDate>Mon, 28 Nov 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Popular programming languages provide methods or functions which locate a
substring in a given string. In C it is the function &lt;tt class="docutils literal"&gt;strstr&lt;/tt&gt;, the C++
class &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::string&lt;/span&gt;&lt;/tt&gt; has the method &lt;tt class="docutils literal"&gt;find&lt;/tt&gt;, Python's &lt;tt class="docutils literal"&gt;string&lt;/tt&gt; has methods
&lt;tt class="docutils literal"&gt;pos&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;index&lt;/tt&gt;, and so on, so forth. All these APIs were designed for
&lt;strong&gt;one-shot searches&lt;/strong&gt;.  During past decades several algorithms to solve this
problem were designed, an excellent page by &lt;strong&gt;Christian Charras&lt;/strong&gt; and
&lt;strong&gt;Thierry Lecroq&lt;/strong&gt; &lt;a class="reference external" href="http://www-igm.univ-mlv.fr/~lecroq/string/"&gt;lists most of them&lt;/a&gt; (if not all). Basically these
algorithms could be split into two major categories: (1) based on
Deterministic Finite Automaton, like Knuth-Morris-Pratt, Boyer Moore, etc.,
and (2) based on a simple comparison, like the Karp-Rabin algorithm.&lt;/p&gt;
&lt;p&gt;The main problem with these standard algorithms is a silent assumption
that comparing a pair of characters, looking up in an extra table and
conditions are cheap, while comparing two substrings is expansive.&lt;/p&gt;
&lt;p&gt;But current desktop CPUs do not meet this assumption, in particular:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;There is no difference in comparing one, two, four or 8 bytes on a 64-bit
CPU.  When a processor supports SIMD instructions, then comparing vectors
(it means 16, 32 or even 64 bytes) is as cheap as comparing a single byte.&lt;/li&gt;
&lt;li&gt;Thus comparing short sequences of chars can be faster than fancy algorithms
which avoids such comparison.&lt;/li&gt;
&lt;li&gt;Looking up in a table costs one memory fetch, so at least a L1 cache round
(~3 cycles). Reading char-by-char also cost as much cycles.&lt;/li&gt;
&lt;li&gt;Mispredicted jumps cost several cycles of penalty (~10-20 cycles).&lt;/li&gt;
&lt;li&gt;There is a short chain of dependencies: read char, compare it, conditionally
jump, which make hard to utilize out-of-order execution capabilities present
in a CPU.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction" id="toc-entry-1"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#solution" id="toc-entry-2"&gt;Solution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm-1-generic-simd" id="toc-entry-3"&gt;Algorithm 1: Generic SIMD&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm" id="toc-entry-4"&gt;Algorithm&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#example" id="toc-entry-5"&gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#first-and-last" id="toc-entry-6"&gt;First and last?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#implementation" id="toc-entry-7"&gt;Implementation&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-avx2" id="toc-entry-8"&gt;SSE &amp;amp; AVX2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#swar" id="toc-entry-9"&gt;SWAR&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#avx512f" id="toc-entry-10"&gt;AVX512F&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#arm-neon-32-bit-code" id="toc-entry-11"&gt;ARM Neon (32 bit code)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#aarch64-64-bit-code" id="toc-entry-12"&gt;AArch64 (64 bit code)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm-2-sse-specific-mpsadbw" id="toc-entry-13"&gt;Algorithm 2: SSE-specific (MPSADBW)&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm-1" id="toc-entry-14"&gt;Algorithm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#implementation-1" id="toc-entry-15"&gt;Implementation&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse" id="toc-entry-16"&gt;SSE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#avx512f-1" id="toc-entry-17"&gt;AVX512F&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm-3-sse4-2-specific-pcmpestrm" id="toc-entry-18"&gt;Algorithm 3: SSE4.2-specific (PCMPESTRM)&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#algorithm-2" id="toc-entry-19"&gt;Algorithm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#implementation-2" id="toc-entry-20"&gt;Implementation&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#sse-1" id="toc-entry-21"&gt;SSE&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#performance-results" id="toc-entry-22"&gt;Performance results&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#x64-computers" id="toc-entry-23"&gt;x64 computers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#arm-computers" id="toc-entry-24"&gt;ARM computers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#conclusions-and-remarks" id="toc-entry-25"&gt;Conclusions and remarks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#acknowledgments" id="toc-entry-26"&gt;Acknowledgments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#source-code" id="toc-entry-27"&gt;Source code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#history" id="toc-entry-28"&gt;History&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>What does AVX512 conflict detection do?</title>
  <link>http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html</guid>
  <pubDate>Sun, 23 Oct 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="avx512cd"&gt;
&lt;h1&gt;AVX512CD&lt;/h1&gt;
&lt;p&gt;AVX512CD, or &lt;strong&gt;conflict detection&lt;/strong&gt;, is a subset of AVX512 introducing
following instructions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;broadcast bit-mask to byte-mask
(&lt;tt class="docutils literal"&gt;vpbroadcastmw2d&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;vpbroadcastmb2q&lt;/tt&gt;);&lt;/li&gt;
&lt;li&gt;parallel lzcnt, i.e. counting leading zeros
(&lt;tt class="docutils literal"&gt;vplzcntd&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;vplzcntq&lt;/tt&gt;);&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;conflict detection&lt;/strong&gt;
(&lt;tt class="docutils literal"&gt;vpconflictd&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;vpconflictq&lt;/tt&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first two are not very interesting. Converting bit-mask into
byte-mask saves in fact one move (&lt;tt class="docutils literal"&gt;vmovdqa32&lt;/tt&gt; with zero mask and a
constant), it isn't too innovative. I can't find any real usage for
lzcnt. Well, I wrote parallel popcount using this instruction (yes, 16
while-loops...), but it was just to show that I could.&lt;/p&gt;
&lt;p&gt;In my opinion the most interesting are conflict detection instructions.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Detecting bit patterns with series of zeros followed by ones</title>
  <link>http://0x80.pl/notesen/2016-10-16-detecting-bit-pattern.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-10-16-detecting-bit-pattern.html</guid>
  <pubDate>Sun, 16 Oct 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;We want to detect if a pattern is a sequence of zeros followed by ones.
Table below lists all 8-bit patterns having this form.&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="69%" /&gt;
&lt;col width="31%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;bin&lt;/th&gt;
&lt;th class="head"&gt;hex&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;1000_0000&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1100_0000&lt;/td&gt;
&lt;td&gt;c0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1110_0000&lt;/td&gt;
&lt;td&gt;e0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1111_0000&lt;/td&gt;
&lt;td&gt;f0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1111_1000&lt;/td&gt;
&lt;td&gt;f8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1111_1100&lt;/td&gt;
&lt;td&gt;fc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1111_1110&lt;/td&gt;
&lt;td&gt;fe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1111_1111&lt;/td&gt;
&lt;td&gt;ff&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Byte-wise alignr in AVX512F</title>
  <link>http://0x80.pl/notesen/2016-10-16-avx512-byte-alignr.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-10-16-avx512-byte-alignr.html</guid>
  <pubDate>Sun, 16 Oct 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The instruction &lt;tt class="docutils literal"&gt;alignr&lt;/tt&gt; in Intel SIMD builds a new vector from a
subrange of two concatenated vectors; its downside is accepting only
compile-time constants.  AVX512F lacks of byte-wise instructions, an
available variant of &lt;tt class="docutils literal"&gt;alignr&lt;/tt&gt; works at level of 32-bit words.&lt;/p&gt;
&lt;p&gt;Byte-wise &lt;tt class="docutils literal"&gt;alignr&lt;/tt&gt; is viable in AVX512F, using techniques used to
handle so called long-numbers. We can do shifts at 32-bit word
granulation using &lt;tt class="docutils literal"&gt;vpalignr&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_alignr_epi32&lt;/tt&gt;), then byte-wide
shift inside each 32-bit word is possible. To perform the latter shift
we need bytes from the next 32-bit word.&lt;/p&gt;
&lt;p&gt;This force us to build two vectors, having current and next words at
corresponding positions. Then these words are shifted accordingly and
finally merged into one 32-bit word.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>GNU std::string::find is very slow</title>
  <link>http://0x80.pl/notesen/2016-10-08-slow-std-string-find.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-10-08-slow-std-string-find.html</guid>
  <pubDate>Sat, 08 Oct 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="see-also"&gt;
&lt;h1&gt;See also&lt;/h1&gt;
&lt;p&gt;The problem has been &lt;a class="reference external" href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66414"&gt;already reported&lt;/a&gt; (bug 66414).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Sorting an AVX512 register</title>
  <link>http://0x80.pl/notesen/2016-10-08-avx512-sort-register.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-10-08-avx512-sort-register.html</guid>
  <pubDate>Sat, 08 Oct 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Presented method allows to sort a whole AVX512 register or its subrange, it is
a variant of &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Counting_sort"&gt;counting sort&lt;/a&gt;. The time complexity is linear, moreover method
works entirely on registers, no extra memory operations are done. It may also
be easily extended to sorting more than one register.&lt;/p&gt;
&lt;p&gt;The method is suitable for sorting 32- and 64-bit integers, and also floating
point numbers, both single and double precision.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512F base64 coding and decoding</title>
  <link>http://0x80.pl/notesen/2016-09-17-avx512-foundation-base64.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-09-17-avx512-foundation-base64.html</guid>
  <pubDate>Sat, 17 Sep 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Both base64 coding and decoding algorithms can be vectorized, i.e.  SIMD
instructions can be utilized, gaining significant speed-up over plain, scalar
versions.  I've shown different vectorization approaches in a series of
articles:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference external" href="2016-01-12-sse-base64-encoding.html"&gt;Base64 encoding with SIMD instructions&lt;/a&gt;,&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="2016-01-17-sse-base64-decoding.html"&gt;Base64 decoding with SIMD instructions&lt;/a&gt; and&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="2016-04-03-avx512-base64.html"&gt;Base64 encoding &amp;amp; decoding using AVX512BW instructions&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/AVX-512"&gt;AVX-512&lt;/a&gt; is the recent extension to the Intel's ISA, unfortunately the
extension is split into several subextensions. In the last article from the
list I described usage of subextension AVX512BW (&lt;strong&gt;Byte-Word&lt;/strong&gt;), at the
moment of writing both articles AVX512BW was not available.&lt;/p&gt;
&lt;p&gt;However, in 2016 on the market has appeared processors having subextension
AVX512F (&lt;strong&gt;Foundation&lt;/strong&gt;). Among many advantages of AVX512F there is one
serious problem: lack of instructions working at byte and word level.
The minimum vector element's size is 32 bits.&lt;/p&gt;
&lt;p&gt;This article is a study of base64 algorithms realisation with foundation
instructions AVX512F, major contributions are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The new, binary search vectorized lookup for base64 encoding.&lt;/li&gt;
&lt;li&gt;Evidence that &lt;a class="reference external" href="http://en.wikipedia.org/wiki/SWAR"&gt;SWAR&lt;/a&gt; techniques, even seem not optimistic at the
first glance, are beneficial.&lt;/li&gt;
&lt;li&gt;Use of &lt;a class="reference external" href="2015-03-22-avx512-ternary-functions.html"&gt;a ternary logic instruction&lt;/a&gt; makes code simpler and faster.&lt;/li&gt;
&lt;li&gt;Evaluation of gather instruction in the context of lookup-based methods.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measurements from a real machine&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The text is split into four parts:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;description of SWAR techniques required for algorithms;&lt;/li&gt;
&lt;li&gt;details of base64 encoding;&lt;/li&gt;
&lt;li&gt;details of base64 decoding;&lt;/li&gt;
&lt;li&gt;experiment results and final remarks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As I don't want to repeat myself too much, please refer to the linked
articles for other algorithms for SSE and AVX2 and their analysis.&lt;/p&gt;
&lt;p&gt;2016-12-18 note: in the initial version of this text I wrongly assumed
order of input words, &lt;strong&gt;Alfred Klomp&lt;/strong&gt; noted that the standard imposes
a specific order. Today's change fixes this error.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;div class="contents topic" id="table-of-contents"&gt;
&lt;p class="topic-title"&gt;Table of contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction" id="toc-entry-1"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#before-we-start-swar-within-an-avx512-register" id="toc-entry-2"&gt;Before we start: SWAR within an AVX512 register&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#unsigned-compare-with-constant" id="toc-entry-3"&gt;Unsigned compare with constant&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#building-a-mask" id="toc-entry-4"&gt;Building a mask&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#adding-modulo-256" id="toc-entry-5"&gt;Adding modulo 256&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#c-notes" id="toc-entry-6"&gt;C++ notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#base64-encoding" id="toc-entry-7"&gt;Base64 encoding&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#loading-data" id="toc-entry-8"&gt;Loading data&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#gather-based" id="toc-entry-9"&gt;Gather-based&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#vectorized-approach" id="toc-entry-10"&gt;Vectorized approach&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#unpacking-6-bit-words-into-bytes" id="toc-entry-11"&gt;Unpacking 6-bit words into bytes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#vectorized-lookup-incremental" id="toc-entry-12"&gt;Vectorized lookup &amp;mdash; incremental&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction-1" id="toc-entry-13"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#swar-notes" id="toc-entry-14"&gt;SWAR notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pseudocode" id="toc-entry-15"&gt;Pseudocode&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#c-implementation" id="toc-entry-16"&gt;C++ implementation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#improved-c-implementation" id="toc-entry-17"&gt;Improved C++ implementation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#vectorized-lookup-binary-search" id="toc-entry-18"&gt;Vectorized lookup &amp;mdash; binary search&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction-2" id="toc-entry-19"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#c-implementation-1" id="toc-entry-20"&gt;C++ implementation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#storing-data" id="toc-entry-21"&gt;Storing data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#base64-decoding" id="toc-entry-22"&gt;Base64 decoding&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#loading-data-1" id="toc-entry-23"&gt;Loading data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#vectorized-lookup" id="toc-entry-24"&gt;Vectorized lookup&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#introduction-3" id="toc-entry-25"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#swar-notes-1" id="toc-entry-26"&gt;SWAR notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#pseudocode-1" id="toc-entry-27"&gt;Pseudocode&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#c-implementation-2" id="toc-entry-28"&gt;C++ implementation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#packing-data" id="toc-entry-29"&gt;Packing data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#storing-data-1" id="toc-entry-30"&gt;Storing data&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#scatter-based" id="toc-entry-31"&gt;Scatter-based&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#vectorized-approach-1" id="toc-entry-32"&gt;Vectorized approach&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#base64-coding-and-decoding-using-gather" id="toc-entry-33"&gt;Base64 coding and decoding using gather&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#experiment-results" id="toc-entry-34"&gt;Experiment results&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="reference internal" href="#encoding" id="toc-entry-35"&gt;Encoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#decoding" id="toc-entry-36"&gt;Decoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#conclusions" id="toc-entry-37"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#further-work" id="toc-entry-38"&gt;Further work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#acknowledgments" id="toc-entry-39"&gt;Acknowledgments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#source-code" id="toc-entry-40"&gt;Source code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#changes" id="toc-entry-41"&gt;Changes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD bit mask</title>
  <link>http://0x80.pl/notesen/2016-09-14-simd-bit-mask.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-09-14-simd-bit-mask.html</guid>
  <pubDate>Wed, 14 Sep 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="problem"&gt;
&lt;h1&gt;Problem&lt;/h1&gt;
&lt;p&gt;There is a SIMD register (128-, 256-, 512-bit width), we want to set all
&lt;strong&gt;bits&lt;/strong&gt; above the given position &lt;tt class="docutils literal"&gt;k&lt;/tt&gt;; &lt;tt class="docutils literal"&gt;k&lt;/tt&gt; is in range from 0 to the
register's width.&lt;/p&gt;
&lt;p&gt;Of course a lookup table could be used, but it's not a interesting (maybe a
little.)&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Building a bitmask</title>
  <link>http://0x80.pl/notesen/2016-09-14-building-bitmask.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-09-14-building-bitmask.html</guid>
  <pubDate>Wed, 14 Sep 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="the-problem"&gt;
&lt;h1&gt;The problem&lt;/h1&gt;
&lt;p&gt;There is an array of 32-bit integers and a key &amp;mdash; a specific value. The
result have to be a bit vector with bits set on these position where the
key is equal to array items. Pseudocode:&lt;/p&gt;
&lt;pre class="code ada literal-block"&gt;
&lt;span class="c1"&gt;-- n - array size
&lt;/span&gt;&lt;span class="kr"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;..&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="kr"&gt;loop&lt;/span&gt;
    &lt;span class="kr"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kr"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="kr"&gt;then&lt;/span&gt;
        &lt;span class="n"&gt;bitvector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="kr"&gt;else&lt;/span&gt;
        &lt;span class="n"&gt;bitvector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="kr"&gt;end&lt;/span&gt; &lt;span class="nf"&gt;for&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;A C++ interface:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;bitmask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bitvector&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Base64 encoding &amp; decoding using AVX512BW instructions</title>
  <link>http://0x80.pl/notesen/2016-04-03-avx512-base64.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-04-03-avx512-base64.html</guid>
  <pubDate>Sun, 03 Apr 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The SIMD versions of base64 conversion algorithms were described in
&lt;a class="reference external" href="2016-01-12-sse-base64-encoding.html"&gt;Base64 encoding with SIMD instructions&lt;/a&gt; and &lt;a class="reference external" href="2016-01-17-sse-base64-decoding.html"&gt;Base64 decoding with
SIMD instructions&lt;/a&gt;.  I also described realization of both encoding and
decoding using &lt;a class="reference external" href="2016-09-17-avx512-foundation-base64.html"&gt;AVX512F (Foundation)&lt;/a&gt; instructions.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/AVX-512"&gt;AVX512BW (Byte &amp;amp; Word)&lt;/a&gt; comes with a great number of new instructions;
following instructions can help base64-related problems:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpshufb&lt;/tt&gt; (intrinsic &lt;tt class="docutils literal"&gt;_mm512_shuffle_epi8&lt;/tt&gt;) &amp;mdash; does a lookup
in 128-bit lanes. For base64 algorithm it's sufficient;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpermd&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_permutexvar_epi32&lt;/tt&gt;) &amp;mdash; moves 32-bit
words &lt;strong&gt;across&lt;/strong&gt; the 128-bit lanes;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpsllvw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_sllv_epi16&lt;/tt&gt;) and &lt;tt class="docutils literal"&gt;vpsrlvw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_srlv_epi16&lt;/tt&gt;)
--- shifts individual 16-bit words by a variable amount, saved in a ZMM
register.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The extension &lt;strong&gt;AVX512VBMI&lt;/strong&gt; adds even more powerful instructions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpermb&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_permutexvar_epi8&lt;/tt&gt;) &amp;mdash; does a lookup in a 64-byte table
(a ZMM register). Unlike &lt;tt class="docutils literal"&gt;pshufb&lt;/tt&gt; it doesn't destroy the lookup register;&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpermi2b&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_permutex2var_epi8&lt;/tt&gt;) &amp;mdash; does a lookup in a 128-byte
table formed by two ZMM registers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The extension &lt;strong&gt;AVX512VL&lt;/strong&gt; adds just one, but really nice instruction:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpmultishiftqb&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm512_multishift_epi64_epi8&lt;/tt&gt;) &amp;mdash; moves 8-bit subwords
onto selected bytes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2018-04-18&lt;/strong&gt;: In the earlier versions of this text I wrongly assumed that
instructions &lt;tt class="docutils literal"&gt;vpermb&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;vpermi2b&lt;/tt&gt; are part of AVX512BW. Sorry for
that.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Implementing byte-wise lookup table with PSHUFB</title>
  <link>http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html</guid>
  <pubDate>Sun, 13 Mar 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In articles about base64 &lt;a class="reference external" href="2016-01-12-sse-base64-encoding.html"&gt;encoding&lt;/a&gt; and &lt;a class="reference external" href="2016-01-17-sse-base64-decoding.html"&gt;decoding&lt;/a&gt; I've showed how to implement
SIMD version of a lookup table using basic vector instruction. This text
describes another technique which employs my favourite instruction &lt;tt class="docutils literal"&gt;pshufb&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;The task is defined as follows:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;input range (0..255) is split into several subranges;&lt;/li&gt;
&lt;li&gt;for each subrange a predefined value is assigned.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now, for an input value a proper subrange is determined, and the value associated
to the subrange is returned. Of course, everything is done in parallel.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Base64 decoding with SIMD instructions</title>
  <link>http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html</guid>
  <pubDate>Sun, 17 Jan 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Surprisingly good results of &lt;a class="reference external" href="2016-01-12-sse-base64-encoding.html"&gt;base64 encoding with SIMD instructions&lt;/a&gt; forced
me to check the opposite algorithm, i.e. the decoding. The decoding is slightly more
complicated as it has to check the input's validity.&lt;/p&gt;
&lt;p&gt;A decoder must also consider character '=' at the end of input, but since it's
done once, I didn't bother with this in a sample code.&lt;/p&gt;
&lt;p&gt;2016-12-18 note: in the initial version of this text I wrongly assumed
order of input words, &lt;strong&gt;Alfred Klomp&lt;/strong&gt; noted that the standard imposes
a specific order. Today's change fixes this error.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Base64 encoding with SIMD instructions</title>
  <link>http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html</guid>
  <pubDate>Tue, 12 Jan 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;I had supposed that SIMD-ization of a base64 encoder is not worth to bother
with, and &lt;strong&gt;I was wrong&lt;/strong&gt;. When compared to scalar code, an SSE code is
&lt;strong&gt;2 times&lt;/strong&gt; faster on Core i5 (Westmere), and around &lt;strong&gt;2.5 times&lt;/strong&gt; faster
on Core i7 (Haswell &amp;amp; Skylake).  An AVX2 code is &lt;strong&gt;3.5 times&lt;/strong&gt; faster on
Core i7 (Skylake).&lt;/p&gt;
&lt;p&gt;2016-12-18 note: in the initial version of this text I wrongly assumed
order of input words, &lt;strong&gt;Alfred Klomp&lt;/strong&gt; noted that the standard imposes
a specific order. Today's change fixes this error.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Speeding up letter case conversion</title>
  <link>http://0x80.pl/notesen/2016-01-06-swar-swap-case.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2016-01-06-swar-swap-case.html</guid>
  <pubDate>Wed, 06 Jan 2016 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The aim of this text is to show how simple procedure which change case
of letters could be rewritten to SWAR version gaining significant boost.
In the article method &amp;quot;to lower case&amp;quot; is explored, however the opposite
conversion is very easy to derive.&lt;/p&gt;
&lt;p&gt;To be honest I have no idea if changing latter case is crucial task in
any problem. My knowledge and experiences suggest that the answer is
&amp;quot;no&amp;quot;, but who knows.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Fast conversion of floating-point values to string</title>
  <link>http://0x80.pl/notesen/2015-12-29-float-to-string.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-12-29-float-to-string.html</guid>
  <pubDate>Tue, 29 Dec 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The conversion of floating-point numbers to a string representation is not an
easy task. Such procedure must deal with different special FP values, perform
proper rounding and so on. The paper &lt;a class="reference external" href="http://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf"&gt;Printing Floating-Point Numbers Quickly
and Accurately&lt;/a&gt; [PDF] by Robert G. Burger &amp;amp; R. Kent Dybvig describes a
procedure which solves the problem correctly.&lt;/p&gt;
&lt;p&gt;However, in some applications (mostly logging, debugging) rounding and accuracy
are not as important as the speed. Sometimes we simply want to know if a number
was 1000.5 or 0.5 and even if we read &amp;quot;0.499999&amp;quot; nothing wrong would happen.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Base64 encoding --- implementation study</title>
  <link>http://0x80.pl/notesen/2015-12-27-base64-encoding.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-12-27-base64-encoding.html</guid>
  <pubDate>Sun, 27 Dec 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In a basic step of the Base64 encoding three bytes are processed producing
four output bytes. The step consist following stages:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;load 3 bytes;&lt;/li&gt;
&lt;li&gt;split these 3 bytes to four 6-bit indices;&lt;/li&gt;
&lt;li&gt;translate the indices using a lookup table;&lt;/li&gt;
&lt;li&gt;save four bytes.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In this text different &lt;strong&gt;implementations&lt;/strong&gt; of the procedure are examined.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Benefits from the obsession</title>
  <link>http://0x80.pl/notesen/2015-12-13-obsession.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-12-13-obsession.html</guid>
  <pubDate>Sun, 13 Dec 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="case-1"&gt;
&lt;h1&gt;Case 1&lt;/h1&gt;
&lt;p&gt;On the other day I saw such innocent code:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;decode_base64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// some validation
&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// decoding stuff
&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The size of &lt;tt class="docutils literal"&gt;result&lt;/tt&gt; is three fouth of the &lt;tt class="docutils literal"&gt;base64&lt;/tt&gt;'s size, so it is
obviously smaller than the maximum value of &lt;tt class="docutils literal"&gt;size_t&lt;/tt&gt; which is used to
store the string size. However, the overflow can occur during multiplying by
3. On 64-bit machines the size of base64 have to be really huge to trigger
the error &amp;mdash; it is 1.23e+19. But on 32-bit machines it's &amp;quot;merely&amp;quot; 28 GB
(try to imagine base64-encoded movie sent via e-mail...) Despite the CPU
architecture, the problem still exists. And the solution is not very
complicated.&lt;/p&gt;
&lt;p&gt;Expression &lt;tt class="docutils literal"&gt;3/4 * x&lt;/tt&gt; is equivalent to &lt;tt class="docutils literal"&gt;x - 1/4 * x&lt;/tt&gt;. Dividing &lt;tt class="docutils literal"&gt;x&lt;/tt&gt; by 4
never cause an overflow, but since we operate on integers these two
expressions are not equal. The latter expression have to be corrected
(rounded up) with following conditional expression &lt;tt class="docutils literal"&gt;(x % 4 != 0) ? 1 : 0&lt;/tt&gt;.
Thus the final expression is:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The expression requires: 2 additions, 1 right shift, 1 comparison and 1
condition expression. And it's perfectly safe.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Implicit conversion --- the enemy</title>
  <link>http://0x80.pl/notesen/2015-11-28-implicit-conversion.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-11-28-implicit-conversion.html</guid>
  <pubDate>Sat, 28 Nov 2015 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;I wrote:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
result += string_utils::pad_left(string, '0');
&lt;/pre&gt;
&lt;p&gt;I forget that &lt;tt class="docutils literal"&gt;pad_left&lt;/tt&gt; signature is &lt;tt class="docutils literal"&gt;string, int, char&lt;/tt&gt; and the char parameter has a default value. My mistake, without doubts.&lt;/p&gt;
&lt;p&gt;This is another example of dark sides of the implicit conversions. C++ converts between characters and integers seamlessly. These two beast are distinct in the nature. Of course characters &lt;strong&gt;are represented&lt;/strong&gt; by the numbers, however it's an implementation detail.&lt;/p&gt;
&lt;p&gt;One can say: you made a mistake and now blame the language. No, I blame language's design. I'm afraid that we end up with something like &lt;tt class="docutils literal"&gt;Integer&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;int&lt;/tt&gt; to overcome such problems.&lt;/p&gt;
&lt;p&gt;Lesson learned: never use default parameters in public API (surprise!)&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Another C++ nasty feature</title>
  <link>http://0x80.pl/notesen/2015-11-22-another-cpp-nasty-feature.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-11-22-another-cpp-nasty-feature.html</guid>
  <pubDate>Sun, 22 Nov 2015 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;I'm fond of C++ weirdness, really. This language is full of traps, and it shocks
me once in a while.&lt;/p&gt;
&lt;p&gt;Let's look at this piece of code, a part of a larger module:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
void validate_date() {

    // ...

    boost::optional&amp;lt;unsigned&amp;gt; clock_hour;
    boost::optional&amp;lt;unsigned&amp;gt; am_pm_clock;

    // ... fill these fields

    if (some sanity check failed) {

        report_error(&amp;quot;user has entered wrong time: %d %s&amp;quot;,
            *clock_hour
            *am_pm_clock ? &amp;quot;AM&amp;quot; : &amp;quot;PM&amp;quot;);
    }
}
&lt;/pre&gt;
&lt;p&gt;We would expect that in case of an error following line will be reported: &amp;quot;user
has entered wrong time: 123 PM&amp;quot;. Obvious. But please look closer at the code, do
you see any mistake? There is one... dirty... hard to notice. I'll give you a minute.&lt;/p&gt;
&lt;p&gt;So, the mistake is &lt;strong&gt;lack of comma&lt;/strong&gt; between expressions &lt;tt class="docutils literal"&gt;*clock_hour&lt;/tt&gt; and
&lt;tt class="docutils literal"&gt;*am_pm_clock&lt;/tt&gt;. However, the code is valid! It compiles! And it took me a little
longer than a minute to understand what happened. Explanation is:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;*clock_hour&lt;/tt&gt; evaluates to expression of type &lt;tt class="docutils literal"&gt;unsigned&lt;/tt&gt;;&lt;/li&gt;
&lt;li&gt;then a compiler sees &lt;tt class="docutils literal"&gt;*&lt;/tt&gt; - a multiplication operator;&lt;/li&gt;
&lt;li&gt;so the compiler checks if multiplication of &lt;tt class="docutils literal"&gt;unsigned&lt;/tt&gt; (on the left side)
with &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;boost::optional&amp;lt;unsigned&amp;gt;&lt;/span&gt;&lt;/tt&gt; (on the right side) is possible;&lt;/li&gt;
&lt;li&gt;it is, because &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;boost::optional&amp;lt;T&amp;gt;&lt;/span&gt;&lt;/tt&gt; has conversion operator to type &lt;tt class="docutils literal"&gt;T&lt;/tt&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can rewrite the whole expression, now it should be clear:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
((*clock_hour) * unsigned(am_pm_clock)) ? &amp;quot;AM&amp;quot; : &amp;quot;PM&amp;quot;
&lt;/pre&gt;
&lt;p&gt;In result method is called with a single parameter of type &lt;tt class="docutils literal"&gt;const char*&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;It's bizarre, it's terrible. A language should help a programmer. In my opinion
implicit conversions is the worst feature of C++.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Short report from code::dive 2015</title>
  <link>http://0x80.pl/notesen/2015-11-15-code-dive.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-11-15-code-dive.html</guid>
  <pubDate>Sun, 15 Nov 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="writing-fast-code"&gt;
&lt;h1&gt;Writing Fast Code&lt;/h1&gt;
&lt;p&gt;Andrei performed a nice show, however some part were... hm, confusing. For
example he claimed that code &lt;tt class="docutils literal"&gt;x = x/10&lt;/tt&gt; emits a division instruction. This is not
true, all compilers run in so called &amp;quot;release mode&amp;quot; will emit multiplication by
a reciprocal of a constant. Check this for your own. Another big
misunderstanding of the speaker was the cause of slow writing operations on the
modern hardware. He claimed that after a write request a CPU loads a line of
cache, then modifies it contents according to the request, and finally CPU
writes the cache line back to the memory. No. It doesn't work like this,
simply. Slow down is caused mostly by multicore architecture and required
synchronization among cache subsystems. But this is not very important, I think
people should remember simple fact: fewer writes means faster programs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2015-11-22&lt;/strong&gt;: my friend Piotr reminded me another funny fact,
a speculation why division of floats is faster than division of integers.
Andrei claimed that the reason is... exponential format of floats. It's just
a subtraction of exponents, a division of mantissa and viola. I'm pretty
sure that it's not the real reason.&lt;/p&gt;
&lt;p&gt;Andrei showed how he has optimized procedure converting a number to an ASCII
representation. He used few tricks, and one of them is worth to mention. He minimize
the number of &amp;quot;real&amp;quot; conversions by introducing a specialized path for smaller
values. Do the same in your program, analyze your data and use an &lt;tt class="docutils literal"&gt;if&lt;/tt&gt; instruction
to select a fast path. It usually works. Andrei gained 3-5 speedup without big effort.&lt;/p&gt;
&lt;p&gt;From perspective of a programmer who has never worked on code optimization
Andrei's advice were very valuable. For example: never measure time of debug compilation.
Compare your program with good, standard &amp;amp; proved existing solutions. Your optimization
of one module could have a negative impact on the whole application. When you measure
a time, run tests many times and get the minimum measurement. Pretty obvious, but
precious for newbies (the conference was full of students from local universities).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Boolean function for the rescue</title>
  <link>http://0x80.pl/notesen/2015-10-25-boolean-functions.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-10-25-boolean-functions.html</guid>
  <pubDate>Sun, 25 Oct 2015 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The problem is defined as follows: a set of features is saved using bit-sets (usually large),
and there is a list/map/whatever of sets containing features of different objects. We have
to find which features are unique.&lt;/p&gt;
&lt;p&gt;Naive solution is to use nested loops:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;find_unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FEATURES_COUNT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Not good, the method &lt;tt class="docutils literal"&gt;is_set&lt;/tt&gt; of bit-set is called &lt;tt class="docutils literal"&gt;size * list.size()&lt;/tt&gt; times. Even
if a compiler is able to inline the call and use simple bit tests instructions it's
still too expansive. Bit-set implementations always use arrays of integers to store the
data, thanks to that bit-operations (and, or, xor, etc.) are executed very fast. We
try to exploit this with boolean functions.&lt;/p&gt;
&lt;p&gt;Each feature could be described as:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;non-existing (count is 0),&lt;/li&gt;
&lt;li&gt;unique (count is 1),&lt;/li&gt;
&lt;li&gt;non-unique (count greater than 1).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now these states are encoded using two bits:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="15%" /&gt;
&lt;col width="15%" /&gt;
&lt;col width="70%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;H&lt;/th&gt;
&lt;th class="head"&gt;L&lt;/th&gt;
&lt;th class="head"&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;non-existing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;unique&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;non-unique&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Then we define a transition table. For example if a feature is present and the current value
is 'unique' then the next value is 'non-unique' (row 5th).&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="43%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;H&lt;/th&gt;
&lt;th class="head"&gt;L&lt;/th&gt;
&lt;th class="head"&gt;feature&lt;/th&gt;
&lt;th class="head"&gt;H'&lt;/th&gt;
&lt;th class="head"&gt;L'&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Boolean expressions are: &lt;tt class="docutils literal"&gt;L' = L or feature&lt;/tt&gt;; &lt;tt class="docutils literal"&gt;H' = H or (feature and L)&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;We also need to get single bit-set from H and L at the end:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="21%" /&gt;
&lt;col width="21%" /&gt;
&lt;col width="57%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;H&lt;/th&gt;
&lt;th class="head"&gt;L&lt;/th&gt;
&lt;th class="head"&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The final boolean expression is: &lt;tt class="docutils literal"&gt;result = L and not H&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;Now we can rewrite the code:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;unique_checker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;unique_checker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// we suppose that bit_set overloads bit-operator
&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;finalize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;find_unique2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;unique_checker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;checker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FEATURES_COUNT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;checker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;checker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finalize&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;I really like this approach.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Tricky mistake</title>
  <link>http://0x80.pl/notesen/2015-05-25-tricky-mistake.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-05-25-tricky-mistake.html</guid>
  <pubDate>Mon, 25 May 2015 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;A programmer wrote:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
class container;

class IndexOutOfBounds {
public:
    IndexOutOfBounds(const std::string&amp;amp; msg);
};

void container::remove(int index) {

    if (index &amp;lt; 0 || index &amp;gt;= size()) {
        throw new IndexOutOfBounds(&amp;quot;Invalid index: &amp;quot; + index);
    }

    // the rest of method
}
&lt;/pre&gt;
&lt;p&gt;Do you see the mistake? The programmer thought that the expression &lt;tt class="docutils literal"&gt;&amp;quot;Invalid index: &amp;quot; + index&lt;/tt&gt;
evaluates to &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::string(&amp;quot;Invalid&lt;/span&gt; index: 5&amp;quot;)&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;In fact the type of the expression &lt;tt class="docutils literal"&gt;&amp;quot;Invalid index: &amp;quot;&lt;/tt&gt; is &lt;tt class="docutils literal"&gt;char[15]&lt;/tt&gt;, so &lt;tt class="docutils literal"&gt;char[15] + integer&lt;/tt&gt;
results in &amp;mdash; more or less &amp;mdash; &lt;tt class="docutils literal"&gt;char*&lt;/tt&gt;. For &lt;tt class="docutils literal"&gt;index&lt;/tt&gt; in range [0, 15] an exception will
carry the tail of the message, for example when &lt;tt class="docutils literal"&gt;index=10&lt;/tt&gt; then message assigned to
the exception object will be &lt;tt class="docutils literal"&gt;&amp;quot;dex: &amp;quot;&lt;/tt&gt;. For indexes larger than 15 and less than 0
a program &lt;strong&gt;likely crash&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This is why I hate C++, this language has many dark corner, stupid conventions, implicit
conversion and not mention UB (&amp;quot;just&amp;quot; 150 UB, if you're curious).&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Speeding up bit-parallel population count</title>
  <link>http://0x80.pl/notesen/2015-04-13-faster-popcount-for-large-data.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-04-13-faster-popcount-for-large-data.html</guid>
  <pubDate>Mon, 13 Apr 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This &lt;a class="reference external" href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel"&gt;well know method&lt;/a&gt; requires logarithmic number of steps in term of
the word width. For example the algorithm run on a 64-bit word executes 6 steps:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
const uint64_t t1 = fetch data

const uint64_t t2 = (t1 &amp;amp; 0x5555555555555555llu) + ((t1 &amp;gt;&amp;gt;  1) &amp;amp; 0x5555555555555555llu);
const uint64_t t3 = (t2 &amp;amp; 0x3333333333333333llu) + ((t2 &amp;gt;&amp;gt;  2) &amp;amp; 0x3333333333333333llu);
const uint64_t t4 = (t3 &amp;amp; 0x0f0f0f0f0f0f0f0fllu) + ((t3 &amp;gt;&amp;gt;  4) &amp;amp; 0x0f0f0f0f0f0f0f0fllu);
const uint64_t t5 = (t4 &amp;amp; 0x00ff00ff00ff00ffllu) + ((t4 &amp;gt;&amp;gt;  8) &amp;amp; 0x00ff00ff00ff00ffllu);
const uint64_t t6 = (t5 &amp;amp; 0x0000ffff0000ffffllu) + ((t5 &amp;gt;&amp;gt; 16) &amp;amp; 0x0000ffff0000ffffllu);
const uint64_t t7 = (t6 &amp;amp; 0x00000000ffffffffllu) + ((t6 &amp;gt;&amp;gt; 32) &amp;amp; 0x00000000ffffffffllu);
&lt;/pre&gt;
&lt;p&gt;In each step &lt;tt class="docutils literal"&gt;k&lt;/tt&gt;-bit fields are summed together; we start from 1-bit
fields, then 2, 4, 8, 16 and finally 32 bits. Single step requires:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;2 bit-ands;&lt;/li&gt;
&lt;li&gt;shift right by constant amount;&lt;/li&gt;
&lt;li&gt;addition.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD-ized searching in unique constant dictionary</title>
  <link>http://0x80.pl/notesen/2015-04-08-simd-search.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-04-08-simd-search.html</guid>
  <pubDate>Wed, 08 Apr 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The problem: there is an &lt;strong&gt;ordered dictionary&lt;/strong&gt; containing only
&lt;strong&gt;unique&lt;/strong&gt; keys. The dictionary is read only, and keys are 32-bit (SSE) or
64-bit (AVX2).&lt;/p&gt;
&lt;p&gt;The obvious solution is to use &lt;a class="reference external" href="http://en.wikipedia.org/wiki/binary_search"&gt;binary search&lt;/a&gt;. Keys can be
stored in a contiguous memory thanks to that there is no internal
fragmentation, and data has cache locality. And of course indexing the
keys is done in constant time (in the terms of computational complexity) or
a single memory fetch (hardware).&lt;/p&gt;
&lt;p&gt;The time complexity of binary search is &lt;span class="math"&gt;O(log&lt;sub&gt;2&lt;/sub&gt;(&lt;i&gt;n&lt;/i&gt;))&lt;/span&gt;, i.e. for one
million elements single lookup takes up to 20 operations. Single
operation is fetching a value and comparing with the given key.&lt;/p&gt;
&lt;p&gt;Another algorithm is &lt;a class="reference external" href="http://en.wikipedia.org/wiki/linear_search"&gt;linear search&lt;/a&gt; which seems to be suitable
for &lt;strong&gt;small dictionaries&lt;/strong&gt;. Linear search could be easily SIMD-ized.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD: detecting a bit pattern</title>
  <link>http://0x80.pl/notesen/2015-03-22-simd-pattern.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-03-22-simd-pattern.html</guid>
  <pubDate>Sun, 22 Mar 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The problem: there are 64-bit values with some data bits and some
metadata bits; the metadata includes a k-bit field describing a &amp;quot;type&amp;quot;
(&lt;tt class="docutils literal"&gt;k &amp;gt;= 0&lt;/tt&gt;). The type field is located in a lower 32-bits.&lt;/p&gt;
&lt;p&gt;Procedure processes two &amp;quot;types&amp;quot;, one denoted with the code 3 and another
with 5. When all items are of type 3 then we can use a fast AVX2 path.
If there are some types 5, we have to call an additional function (a
virtual method to be precise). Pseudocode:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
for (size_t i = 0; i &amp;lt; size; i += 8) {

    // two AVX registers, or 8 elements, are processed in a single loop

    __m256i A_lo = {A64[i + 0], A64[i + 1], A64[i + 2], A64[i + 3]};
    __m256i A_hi = {A64[i + 4], A64[i + 5], A64[i + 6], A64[i + 7]};

    __m256i B_lo = {B64[i + 0], B64[i + 1], B64[i + 2], B64[i + 3]};
    __m256i B_hi = {B64[i + 4], B64[i + 5], B64[i + 6], B64[i + 7]};

    if (any element of A or B vector has type 5) { // ***

        // slow path
        for (int k = 0; k &amp;lt; 4; k++) {
            result[i + k]     = function(A_lo[i + k], B_lo[i + k]);
            result[i + k + 4] = function(A_hi[i + k], B_hi[i + k]);
        }
    } else {

        // fast path
        ...
    }

    // further processing
}
&lt;/pre&gt;
&lt;p&gt;We have to fill condition of the &lt;strong&gt;if-statement&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Compiler warnings are your future errors</title>
  <link>http://0x80.pl/notesen/2015-03-22-compiler-warnings.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-03-22-compiler-warnings.html</guid>
  <pubDate>Sun, 22 Mar 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="lesson-learned"&gt;
&lt;h1&gt;Lesson learned&lt;/h1&gt;
&lt;p&gt;Because we always have very large build logs I didn't notice the new warning.&lt;/p&gt;
&lt;p&gt;In order to prevent such errors in the future I've written a script that
extracts all warnings from the logs and prints them in a easy-to-read
form. I also fight with warnings in so called spare time.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>AVX512: ternary functions evaluation</title>
  <link>http://0x80.pl/notesen/2015-03-22-avx512-ternary-functions.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-03-22-avx512-ternary-functions.html</guid>
  <pubDate>Sun, 22 Mar 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Intel's version of SIMD offers following 2-argument (binary) boolean
functions: &lt;strong&gt;and&lt;/strong&gt;, &lt;strong&gt;or&lt;/strong&gt;, &lt;strong&gt;xor&lt;/strong&gt;, &lt;strong&gt;and not&lt;/strong&gt;. There isn't a single
argument &lt;strong&gt;not&lt;/strong&gt;, this function can be expressed with &lt;tt class="docutils literal"&gt;xor reg, ones&lt;/tt&gt;,
however it requires additional, pre-set register.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AVX512F&lt;/strong&gt; will come with a very interesting instruction called
&lt;tt class="docutils literal"&gt;vpternlog&lt;/tt&gt;. There are two variants of the instruction operating on
a packed 32-bit (&lt;tt class="docutils literal"&gt;vpternlogd&lt;/tt&gt;) or a 64-bit vector (&lt;tt class="docutils literal"&gt;vpternlogq&lt;/tt&gt;),
however they do exactly the same thing &amp;mdash; evaluate a 3-argument
(&lt;em&gt;ternary&lt;/em&gt;) boolean function on each bit of arguments, the function
is given as a truth table.&lt;/p&gt;
&lt;p&gt;The pattern of a truth table:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="47%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head" colspan="3"&gt;inputs&lt;/th&gt;
&lt;th class="head" rowspan="2"&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;th class="head"&gt;A&lt;/th&gt;
&lt;th class="head"&gt;B&lt;/th&gt;
&lt;th class="head"&gt;C&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;c&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;d&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;e&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;f&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;g&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;h&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A programmer supplies only the result column, i.e. defines values of bits
&lt;strong&gt;a&lt;/strong&gt; through &lt;strong&gt;h&lt;/strong&gt;, this is a single 8-bit value.&lt;/p&gt;
&lt;p&gt;Depending on function complexity, a single &lt;tt class="docutils literal"&gt;vpternlog&lt;/tt&gt; instruction can
replace from one up to &lt;strong&gt;eight&lt;/strong&gt; SIMD instructions.&lt;/p&gt;
&lt;p&gt;According to &lt;a class="reference external" href="https://www.agner.org/optimize/#manuals"&gt;Agner Fog's documentation&lt;/a&gt; on SkylakeX &lt;tt class="docutils literal"&gt;vpternlog&lt;/tt&gt; has
1 cycle latency and 0,5 cycle reciprocal throughput (there are two execution
units able to handle the instruction). It's pretty fast, though.&lt;/p&gt;
&lt;p&gt;Ternary logic function is available as the intrinsic function
&lt;tt class="docutils literal"&gt;_mm512_ternarylogic_epi32(a, b, c, imm8)&lt;/tt&gt;, where the argument &lt;tt class="docutils literal"&gt;a&lt;/tt&gt;
carries most significant bits, and &lt;tt class="docutils literal"&gt;c&lt;/tt&gt; least significant bits.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE/AVX2: Generating mask where n leading (trailing) bytes are set</title>
  <link>http://0x80.pl/notesen/2015-03-21-sse-generating-mask.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-03-21-sse-generating-mask.html</guid>
  <pubDate>Sat, 21 Mar 2015 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Informal specification:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask_lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="cm"&gt;/*
        __m128i result = 0;
        for (unsigned int i=0; i &amp;lt; n; i++) {
            result.byte[i] = 0xff;
        }
    */&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;switch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="c1"&gt;// ...
&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kr"&gt;__m128i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask_higher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;mask_lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Not everything in AVX2 is 256-bit</title>
  <link>http://0x80.pl/notesen/2015-03-21-avx2-is-not-256-bit.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2015-03-21-avx2-is-not-256-bit.html</guid>
  <pubDate>Sat, 21 Mar 2015 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;&lt;strong&gt;AVX2&lt;/strong&gt; has added support for 256-bit arguments for many operations on
packed integers, although not all. Some instructions accept the 256-bit
registers, but operates on &lt;strong&gt;128-bit lanes&lt;/strong&gt; rather the whole register.&lt;/p&gt;
&lt;p&gt;There are three major groups of instructions:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;packing (narrowing conversion),&lt;/li&gt;
&lt;li&gt;unpacking (interleave),&lt;/li&gt;
&lt;li&gt;and permutations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Below is a full list of instructions (with intrinsics):&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;valignr&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_alignr_epi8&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpslldq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_bslli_epi128&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpsrldq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_bsrli_epi128&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vmpsadbw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_mpsadbw_epu8&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpacksswb&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_packs_epi16&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpackssdw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_packs_epi32&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpackuswb&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_packus_epi16&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpackusdw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_packus_epi32&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vperm2i128&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_permute2x128_si256&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpermq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_permute4x64_epi64&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpermpd&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_permute4x64_pd&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpshufd&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_shuffle_epi32&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpshufb&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_shuffle_epi8&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpshufhw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_shufflehi_epi16&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpshuflw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_shufflelo_epi16&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpslldq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_slli_si256&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpsrldq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_srli_si256&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpckhwd&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpackhi_epi16&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpckhdq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpackhi_epi32&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpckhqdq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpackhi_epi64&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpckhbw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpackhi_epi8&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpcklwd&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpacklo_epi16&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpckldq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpacklo_epi32&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpcklqdq&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpacklo_epi64&lt;/tt&gt;)&lt;/li&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;vpunpcklbw&lt;/tt&gt; (&lt;tt class="docutils literal"&gt;_mm256_unpacklo_epi8&lt;/tt&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For me the most surprising are packing instructions (&lt;tt class="docutils literal"&gt;vpack*&lt;/tt&gt;) as they
require additional shuffling (after or before the instruction) if we
want to keep the order of values. In some cases the order is crucial.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Using SSE to convert from hexadecimal ASCII to number</title>
  <link>http://0x80.pl/notesen/2014-10-22-sse-convert-hex-to-ascii.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-22-sse-convert-hex-to-ascii.html</guid>
  <pubDate>Wed, 22 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;SSE procedure can convert 16- and 32-digits inputs producing
8- and 16-bytes results.&lt;/p&gt;
&lt;p&gt;To get correct result's order of input, characters have to be reversed.
In SSSE3 it can be done with &lt;tt class="docutils literal"&gt;pshufb&lt;/tt&gt;, but in the earlier versions
of SSE this is quite hard. When byte shuffling is not available
then reversing can be done on the result word using &lt;tt class="docutils literal"&gt;bswap&lt;/tt&gt;
instructions.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Parsing decimal numbers --- part 2: SSE</title>
  <link>http://0x80.pl/notesen/2014-10-15-parsing-decimal-numbers-part-2-sse.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-15-parsing-decimal-numbers-part-2-sse.html</guid>
  <pubDate>Wed, 15 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="pmaddwd-details"&gt;
&lt;h1&gt;PMADDWD details&lt;/h1&gt;
&lt;p&gt;This instruction is specialized and I guess isn't often used.
&lt;tt class="docutils literal"&gt;PMADDWD&lt;/tt&gt; operates on &lt;strong&gt;packed words&lt;/strong&gt; and performs following
algorithm:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
temp = array of 8 signed dwords

-- 1. multiply 16-bit signed numbers vertically
src = packed_word(...)
dst = packed_word(...)

for i in 0 .. 7 loop

    -- temp is 32-bit signed number

    temp[i] := signed_multiplication(src[i], dst[i])

end loop

-- 2. add adjacent 32-bit words of temp and save result to src
--    now src has type packed_dword
for i in 0 .. 3 loop

    src[i] = temp[2 * i] + temp[2 * i + 1]

end loop
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Parsing decimal numbers --- part 1: SWAR</title>
  <link>http://0x80.pl/notesen/2014-10-12-parsing-decimal-numbers-part-1-swar.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-12-parsing-decimal-numbers-part-1-swar.html</guid>
  <pubDate>Sun, 12 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Naive parsing of unsigned decimal numbers can be coded as:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
uint32_t parse(char* s) {

    uint32_t result = 0;

    while (*s) {
        const uint8_t digit = *s++;

        result = result * 10 + (digit - '0');
    }

    return result;
}
&lt;/pre&gt;
&lt;p&gt;The procedure &lt;tt class="docutils literal"&gt;parse&lt;/tt&gt; does not check validity of a string nor
check for overflow &amp;mdash; these problems won't be discussed in
this text.&lt;/p&gt;
&lt;p&gt;Processing single letter require just 3 operations:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;subtract,&lt;/li&gt;
&lt;li&gt;addition,&lt;/li&gt;
&lt;li&gt;multiplication by 10.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On x86 multiplication by 10 is cheap; multiplication
&lt;tt class="docutils literal"&gt;x * 10&lt;/tt&gt; is equivalent to &lt;tt class="docutils literal"&gt;x &amp;lt;&amp;lt; 3 + x &amp;lt;&amp;lt; 1&lt;/tt&gt;. Shift
by 8 is coded with &lt;tt class="docutils literal"&gt;LEA&lt;/tt&gt; (it's executed by addressing
unit, not ALU), and shift by 1 is simple addition
or another &lt;tt class="docutils literal"&gt;LEA&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Using PEXT to convert from hexadecimal ASCII to number</title>
  <link>http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html</guid>
  <pubDate>Thu, 09 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Naive conversion from an ASCII digit to a value could be coded with
a switch instruction or a lookup table &amp;mdash; too simple, right? This is
another note where I try to exploit &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; &amp;mdash; &lt;em&gt;parallel extract&lt;/em&gt;,
the new instruction introduced in extension &lt;strong&gt;BMI2&lt;/strong&gt; (Bit Manipulation
Instructions). By the way I present nice branchless algorithm
to convert an ASCII letter to a number.&lt;/p&gt;
&lt;p&gt;Let see which codes are assigned to hexadecimal digits:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="18%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="18%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;digit&lt;/th&gt;
&lt;th class="head"&gt;code&lt;/th&gt;
&lt;th class="head"&gt;value&lt;/th&gt;
&lt;th class="head"&gt;letter&lt;/th&gt;
&lt;th class="head"&gt;code&lt;/th&gt;
&lt;th class="head"&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;'0'&lt;/td&gt;
&lt;td&gt;0x30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;'a'&lt;/td&gt;
&lt;td&gt;0x61&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'1'&lt;/td&gt;
&lt;td&gt;0x31&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;'b'&lt;/td&gt;
&lt;td&gt;0x62&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'2'&lt;/td&gt;
&lt;td&gt;0x32&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;'c'&lt;/td&gt;
&lt;td&gt;0x63&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'3'&lt;/td&gt;
&lt;td&gt;0x33&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;'d'&lt;/td&gt;
&lt;td&gt;0x64&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'4'&lt;/td&gt;
&lt;td&gt;0x34&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;'e'&lt;/td&gt;
&lt;td&gt;0x65&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'5'&lt;/td&gt;
&lt;td&gt;0x35&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;'f'&lt;/td&gt;
&lt;td&gt;0x66&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'6'&lt;/td&gt;
&lt;td&gt;0x36&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;'A'&lt;/td&gt;
&lt;td&gt;0x41&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'7'&lt;/td&gt;
&lt;td&gt;0x37&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;'B'&lt;/td&gt;
&lt;td&gt;0x42&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'8'&lt;/td&gt;
&lt;td&gt;0x38&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;'C'&lt;/td&gt;
&lt;td&gt;0x43&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'9'&lt;/td&gt;
&lt;td&gt;0x39&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;'D'&lt;/td&gt;
&lt;td&gt;0x44&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan="3" rowspan="2"&gt;&amp;nbsp;&lt;/td&gt;
&lt;td&gt;'E'&lt;/td&gt;
&lt;td&gt;0x45&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;'F'&lt;/td&gt;
&lt;td&gt;0x46&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;for digits the value is equal to lower nibble;&lt;/li&gt;
&lt;li&gt;for letter the value is equal to lower nibble plus 9;&lt;/li&gt;
&lt;li&gt;both small and big letters has set 7-th bit (mask 0x40).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Branchless code converting single letter to numeric value:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
uint8_t hex_letter2value(uint8_t chr) {

    const uint8_t letter = chr &amp;amp; 0x40;
    const uint8_t shift  = letter &amp;gt;&amp;gt; 3 | letter &amp;gt;&amp;gt; 6; // 9 if chr is letter, 0 otherwise

    // this sum is safe -- if shift = 9, then max value in lower
    // nibble is 6, and there won't be an overflow
    const uint8 adjusted = chr + shift;

    return adjusted &amp;amp; 0xf;
}
&lt;/pre&gt;
&lt;p&gt;Following operations are performed:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;2 bit-and,&lt;/li&gt;
&lt;li&gt;2 shifts,&lt;/li&gt;
&lt;li&gt;1 bit-or,&lt;/li&gt;
&lt;li&gt;1 addition.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For a single letter it's too expansive, fortunately this algorithm could be easily
translated to SIMD and SWAR code (I hope SIMD version appear soon).&lt;/p&gt;
&lt;p&gt;32-bit SWAR version:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
#define packed32(b) (uint8_t(b) * 0x01010101)

uint16_t four_letters_to_value(char* str) {

    const uint32_t input = bswap(*(uint32_t*)str); // bswap is required

    const uint32_t letter = input &amp;amp; packed32(0x40);
    const uint32_t shift  = letter &amp;gt;&amp;gt; 3 | letter &amp;gt;&amp;gt; 6;

    const uint32_t adjusted = input + shift;

    // for example:
    //     adjusted    = 0x0b0a0d07
    //     pext result = 0x000000bad7

    return pext(adjusted, 0x0f0f0f0f);
}
&lt;/pre&gt;
&lt;p&gt;Much better &amp;mdash; now conversion from 4 letters (or 8, when operate
on 64-bit words) requires:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;1 byte swap,&lt;/li&gt;
&lt;li&gt;2 bit-and,&lt;/li&gt;
&lt;li&gt;2 shifts,&lt;/li&gt;
&lt;li&gt;1 bit-or,&lt;/li&gt;
&lt;li&gt;1 addition,&lt;/li&gt;
&lt;li&gt;1 pext.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sample implementation is available at &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/conv_from_hex"&gt;github&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Using PEXT to convert from binary ASCII to number</title>
  <link>http://0x80.pl/notesen/2014-10-06-pext-convert-ascii-bin-to-num.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-06-pext-convert-ascii-bin-to-num.html</guid>
  <pubDate>Mon, 06 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Suppose we have a string containing ASCII zeros and ones, for example
&amp;quot;11100100&amp;quot;, and we want to interpret this text as a binary number and
get value (&lt;tt class="docutils literal"&gt;0xe4&lt;/tt&gt;).&lt;/p&gt;
&lt;p&gt;New instruction &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; from &lt;strong&gt;BMI2&lt;/strong&gt; (&lt;em&gt;Binary Manipulation Instructions&lt;/em&gt;)
is perfect for this task. &lt;tt class="docutils literal"&gt;PEXT&lt;/tt&gt; &amp;mdash; parallel extract &amp;mdash; forms a word
from source bits selected by a mask, for example (32-bit arguments):&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
         MSB                               LSB
         ┌────────┬────────┬────────┬────────┐
src    = │&lt;span style="color: red"&gt;0&lt;/span&gt;01010&lt;span style="color: red"&gt;10&lt;/span&gt;│&lt;span style="color: red"&gt;1&lt;/span&gt;110110&lt;span style="color: red"&gt;1&lt;/span&gt;│&lt;span style="color: red"&gt;0001&lt;/span&gt;1011│11110&lt;span style="color: red"&gt;000&lt;/span&gt;│
         └────────┴────────┴────────┴────────┘
         ┌────────┬────────┬────────┬────────┐
mask   = │&lt;span style="font-weight: bold"&gt;1&lt;/span&gt;00000&lt;span style="font-weight: bold"&gt;11&lt;/span&gt;│&lt;span style="font-weight: bold"&gt;1&lt;/span&gt;000000&lt;span style="font-weight: bold"&gt;1&lt;/span&gt;│&lt;span style="font-weight: bold"&gt;1111&lt;/span&gt;0000│00000&lt;span style="font-weight: bold"&gt;111&lt;/span&gt;│
         └────────┴────────┴────────┴────────┘

         ┌────────┬────────┬────────┬────────┐
result = │00000000│00000000│0000&lt;span style="color: red; font-weight: bold"&gt;0101&lt;/span&gt;│&lt;span style="color: red; font-weight: bold"&gt;10001000&lt;/span&gt;│
         └────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This is exactly what the conversion needs &amp;mdash; since the code of ASCII '0' is 0x30
and '1' is 0x31 we need to extract the  lowest bit of each byte (of course
if we're sure that input is valid).&lt;/p&gt;
&lt;p&gt;Example string &amp;quot;11100100&amp;quot; is encoded as 0x3131313030313030:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
src  = 0x3131313030313030 // 64-bit word
mask = 0x0101010101010101 // 64-bit word

result = pext(src, mask)
&lt;/pre&gt;
&lt;p&gt;The value of result is &lt;tt class="docutils literal"&gt;0xe4&lt;/tt&gt; = &lt;tt class="docutils literal"&gt;0b11100100&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;Working example is available at &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/pext_soft_emu"&gt;github&lt;/a&gt; (see &lt;tt class="docutils literal"&gt;parse_string.c&lt;/tt&gt;).&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Conversion numbers to octal representation</title>
  <link>http://0x80.pl/notesen/2014-10-02-convert-to-oct.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-02-convert-to-oct.html</guid>
  <pubDate>Thu, 02 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Conversion to octal isn't very computer-friendly, each digit occupy
3 bits, it isn't power of two. The smallest number of bits to convert
worth to consider is 12 (3 nibbles and 4 digits):&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌──────────────┐
│&lt;span style="color: red"&gt;ddd&lt;/span&gt;&lt;span style="color: blue"&gt;c&lt;/span&gt; &lt;span style="color: blue"&gt;cc&lt;/span&gt;&lt;span style="color: green"&gt;bb&lt;/span&gt; &lt;span style="color: green"&gt;b&lt;/span&gt;&lt;span style="color: magenta"&gt;aaa&lt;/span&gt;│
└──────────────┘
8┈┈11 4┈┈7 0┈┈3&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then output is a 32-bit word:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌────────┬────────┬────────┬────────┐
│&lt;span style="color: gray"&gt;00000&lt;/span&gt;&lt;span style="color: red"&gt;ddd&lt;/span&gt;│&lt;span style="color: gray"&gt;00000&lt;/span&gt;&lt;span style="color: blue"&gt;ccc&lt;/span&gt;│&lt;span style="color: gray"&gt;00000&lt;/span&gt;&lt;span style="color: green"&gt;bbb&lt;/span&gt;│&lt;span style="color: gray"&gt;00000&lt;/span&gt;&lt;span style="color: magenta"&gt;aaa&lt;/span&gt;│
└────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Conversion to ASCII require single add of 0x30303030 (0x30 = ord('0')).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Determining if an integer is a power of 2 --- part 2</title>
  <link>http://0x80.pl/notesen/2014-10-01-is-pow-2.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-01-is-pow-2.html</guid>
  <pubDate>Wed, 01 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The idea is simple:&lt;/p&gt;
&lt;ol class="arabic simple" start="0"&gt;
&lt;li&gt;&lt;strong&gt;precondition&lt;/strong&gt;: x is not zero;&lt;/li&gt;
&lt;li&gt;isolate lowest bit set: &lt;tt class="docutils literal"&gt;x &amp;amp; &lt;span class="pre"&gt;-x&lt;/span&gt;&lt;/tt&gt;;&lt;/li&gt;
&lt;li&gt;check if this number is equal to x: &lt;tt class="docutils literal"&gt;x == (x &amp;amp; &lt;span class="pre"&gt;-x)&lt;/span&gt;&lt;/tt&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Sample code:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;isolate_lowest_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is_power_of_two_non_zero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;isolate_lowest_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is_power_of_two&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is_power_of_two_non_zero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;GCC 4.8.2 generates for &lt;tt class="docutils literal"&gt;is_power_of_two&lt;/tt&gt; following code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
8b 54 24 04     mov    0x4(%esp),%edx
31 c0           xor    %eax,%eax
85 d2           test   %edx,%edx
74 0e           je     8048488 &amp;lt;is_power_of_two+0x18&amp;gt;
89 d0           mov    %edx,%eax
f7 d8           neg    %eax
21 d0           and    %edx,%eax
39 c2           cmp    %eax,%edx
0f 94 c0        sete   %al
0f b6 c0        movzbl %al,%eax
f3 c3           repz ret
&lt;/pre&gt;
&lt;p&gt;Unfortunately compiler inserted a jump. But when we are sure that
arguments are non-zero then only 5 basic instruction are required
to perform this check.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Conditionally fill word (for limited set of input values)</title>
  <link>http://0x80.pl/notesen/2014-10-01-conditionally-fill-word.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-10-01-conditionally-fill-word.html</guid>
  <pubDate>Wed, 01 Oct 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;&amp;quot;Limited&amp;quot; means a value where at most one bit is set. I.e. values
are zero and all powers of two.  Specification:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;if x is zero then result is also zero,&lt;/li&gt;
&lt;li&gt;if x is power of two result is word full of zeros.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Naive implementation:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fill_word_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xffffffff&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00000000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;GCC 4.8.2 produces following code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
89 c2        mov    %eax,%edx
f7 da        neg    %edx
21 c2        and    %eax,%edx
39 c2        cmp    %eax,%edx
0f 94 c0     sete   %al
0f b6 c0     movzbl %al,%eax
&lt;/pre&gt;
&lt;p&gt;Not bad, but this can be done much simpler. We know that a value is zero or
have exactly one bit set &amp;mdash; first we have to copy this bit to the highest
position, and then populate the MSB using arithmetic shift right.&lt;/p&gt;
&lt;p&gt;Copying a bit can be done using single operation &amp;mdash; &lt;strong&gt;arithmetic negation&lt;/strong&gt;:&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="50%" /&gt;
&lt;col width="50%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;x&lt;/th&gt;
&lt;th class="head"&gt;-x&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;00000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000001&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffffff&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000002&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffffffe&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000004&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffffffc&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000008&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffffff8&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000010&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffffff0&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000020&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffffe0&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000040&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffffc0&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000080&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffff80&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000100&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffff00&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000200&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffffe00&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000400&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffffc00&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00000800&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffff800&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00001000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffff000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00002000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffe000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00004000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffffc000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00008000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffff8000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00010000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffff0000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00020000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffe0000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00040000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fffc0000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00080000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fff80000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00100000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fff00000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00200000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffe00000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00400000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ffc00000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;00800000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ff800000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;01000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;ff000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;02000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fe000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;04000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;fc000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;08000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;f8000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;10000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;f0000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;20000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;e0000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;40000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;c0000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;tt class="docutils literal"&gt;80000000&lt;/tt&gt;&lt;/td&gt;
&lt;td&gt;&lt;tt class="docutils literal"&gt;80000000&lt;/tt&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now the procedure could be saved as:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fill_word_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;int32_t&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The compilation result:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
f7 d8        neg    %eax
c1 f8 1f     sar    $0x1f,%eax
&lt;/pre&gt;
&lt;p&gt;Just two simple instructions.&lt;/p&gt;
&lt;p&gt;2018-03-11: GCC 7.2 still compiles &lt;tt class="docutils literal"&gt;fill_word_naive&lt;/tt&gt; the same way, but clang 6.0
produces the final, two-instruction sequence.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Small win over compiler</title>
  <link>http://0x80.pl/notesen/2014-09-30-win-over-compiler.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-09-30-win-over-compiler.html</guid>
  <pubDate>Tue, 30 Sep 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;There are some places where a low-level programmer can beat a compiler.
Consider this simple code:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;bsr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;// xor, because this builtin returns 31 - bsr(x)
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;__builtin_clz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;min1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bsr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Function &lt;tt class="docutils literal"&gt;min1&lt;/tt&gt; is compiled to (GCC 4.8 with flag &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-O3&lt;/span&gt;&lt;/tt&gt;):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
min1:
    movl    4(%esp), %edx
    movl    $1, %eax
    testl   %edx, %edx
    je  .L3
    bsrl    %edx, %eax
    addl    $1, %eax
.L3:
    rep ret
&lt;/pre&gt;
&lt;p&gt;There is a conditional jump, not very good. When we rewrite
the function:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;min2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bsr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Result is this nice branchless code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
min2:
    movl    4(%esp), %eax
    orl $1, %eax
    bsrl    %eax, %eax
    addl    $1, %eax
    ret
&lt;/pre&gt;
&lt;p&gt;Conclusion: it's worth to check a compiler output. Sometimes.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Interpolation search revisited</title>
  <link>http://0x80.pl/notesen/2014-09-25-interpolation-search.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-09-25-interpolation-search.html</guid>
  <pubDate>Thu, 25 Sep 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Interpolation_search"&gt;Interpolation search&lt;/a&gt; is the modification of binary search, where
the index of a &amp;quot;middle&amp;quot; key is obtained from linear interpolation of values
at start &amp;amp; end of a processed range:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
a := start
b := end
key := searched key

while a &amp;lt;= b loop
    t := (key - array[a])/(array[b] - array[a])
    c := a + floor(t * (b - a))

    -- in binary search just: c := (a + b)/2

    if key = array[c] then
        return c
    else if key &amp;lt; array[c] then
        b := c - 1
    else
        a := c + 1
    endif

end loop
&lt;/pre&gt;
&lt;p&gt;The clear advantage over basic binary search is complexity &lt;span class="math"&gt;O(loglog&lt;i&gt;n&lt;/i&gt;)&lt;/span&gt;.  When size of array is 1 million, then average number of
comparison in binary search is &lt;span class="math"&gt;log&lt;sub&gt;2&lt;/sub&gt;&lt;i&gt;n&lt;/i&gt; = 20&lt;/span&gt;. For
interpolation search it's &lt;span class="math"&gt;log&lt;sub&gt;2&lt;/sub&gt;log&lt;sub&gt;2&lt;/sub&gt;&lt;i&gt;n&lt;/i&gt; = 4.3&lt;/span&gt; &amp;mdash; 4.5 times
faster.&lt;/p&gt;
&lt;p&gt;However, this property is hold only when the distribution of keys is
&lt;strong&gt;uniform&lt;/strong&gt;. I guess this the reason why the algorithm is not well
known &amp;mdash; enforcing uniform distribution on real data is hard.  Also
calculating the index &lt;tt class="docutils literal"&gt;c&lt;/tt&gt; is more computational expansive.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Software emulation of PDEP</title>
  <link>http://0x80.pl/notesen/2014-09-23-pdep-soft-emu.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-09-23-pdep-soft-emu.html</guid>
  <pubDate>Tue, 23 Sep 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;PDEP&lt;/strong&gt; is a new instruction from BMI2 (&lt;em&gt;Bit Manipulation Instruction&lt;/em&gt;),
pseudocode for 32-bit &lt;tt class="docutils literal"&gt;PDEP&lt;/tt&gt; variant:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
src    - a source word
mask   - a mask

result := 0
k := 0
for i := 0 .. 31 loop
    if mask.bit[i] then
        result.bit[i] := src.bit[k]
        k := k + 1
    end if
end loop
&lt;/pre&gt;
&lt;p&gt;Quite complicated, but it is really fast &amp;mdash; latency is just 3 cycles and
throughput is only one cycle. I've showed how to use this instruction
in conversion &lt;a class="reference external" href="2014-09-11-convert-to-bin.html"&gt;integers to binary&lt;/a&gt; and &lt;a class="reference external" href="2014-09-21-convert-to-hex.html"&gt;haxadecimal&lt;/a&gt; representation.&lt;/p&gt;
&lt;p&gt;I was wondering how this algorithm would execute on an old hardware.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Conversion numbers to hexadecimal representation</title>
  <link>http://0x80.pl/notesen/2014-09-21-convert-to-hex.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-09-21-convert-to-hex.html</guid>
  <pubDate>Sun, 21 Sep 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="branchless-converting-single-nibble"&gt;
&lt;h1&gt;Branchless converting single nibble&lt;/h1&gt;
&lt;p&gt;For a nibble (0..15) stored in a byte conversion could be coded in this way:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
char nibble_to_hex(uint8_t byte) {
    assert(byte &amp;gt;= 0 &amp;amp;&amp;amp; byte &amp;lt;= 15);

    char c = byte + '0';
    if (byte &amp;gt; 9)
        c += 'a' - '0' - 10;

    return c;
}
&lt;/pre&gt;
&lt;p&gt;If a nibble is greater than 9, then resulting letter have to be ASCII
'a' .. 'f' (or 'A' .. 'F'). It's done with simple correction of
the code; value of correction &lt;tt class="docutils literal"&gt;'a' - 10 - '0'&lt;/tt&gt; is 39, and for
uppercase letters it is 7.&lt;/p&gt;
&lt;p&gt;The condition have to be replaced by &lt;strong&gt;branchless expression&lt;/strong&gt;. First
code is changed to:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
char nibble_to_hex2(uint8_t byte) {
    assert(byte &amp;gt;= 0 &amp;amp;&amp;amp; byte &amp;lt;= 15);

    const char corr = 'a' - '0' - 10
    const char c    = byte + '0';

    uint8_t mask    = (byte &amp;gt; 9) ? 0xff : 0x00;

    return c + (mask &amp;amp; corr);
}
&lt;/pre&gt;
&lt;p&gt;We're bit closer. Now the question is: how to get a mask from condition
&lt;tt class="docutils literal"&gt;byte &amp;gt; 9&lt;/tt&gt;? Let's examine simple addition: &lt;tt class="docutils literal"&gt;128 - 10 + x&lt;/tt&gt;.
For values &lt;tt class="docutils literal"&gt;x = 0 .. 9&lt;/tt&gt; the result is in range &lt;tt class="docutils literal"&gt;128 - 10&lt;/tt&gt; .. &lt;tt class="docutils literal"&gt;128 - 9&lt;/tt&gt;
and for values &lt;tt class="docutils literal"&gt;x = 10 .. 15&lt;/tt&gt; the result is in range &lt;tt class="docutils literal"&gt;128&lt;/tt&gt; .. &lt;tt class="docutils literal"&gt;128 + 5&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Observation&lt;/strong&gt;: for x in range &lt;tt class="docutils literal"&gt;10 .. 15&lt;/tt&gt; the result has the highest bit set,
otherwise it's clear. In other words we get 0x80 or 0x00 depending on
condition, and now mask could be calculated as:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
uint8_t tmp  = 128 - 10 + byte;
uint8_t msb  = tmp &amp;amp; 0x80;

uint8_t mask = msb - (msb &amp;gt;&amp;gt; 7) | msb; // 3 operations
&lt;/pre&gt;
&lt;p&gt;Since correction's value is 39 or 7, i.e. is less than 128, the mask could
be calculated simpler, yielding values 0x7f/0x00:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
uint8_t mask = msb - (msb &amp;gt;&amp;gt; 7); // 2 operations
&lt;/pre&gt;
&lt;p&gt;The final version:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
char nibble_to_hex3(uint8_t byte) {
    assert(byte &amp;gt;= 0 &amp;amp;&amp;amp; byte &amp;lt;= 15);

    const char corr = 'a' - '0' - 10
    const char c    = byte + '0';

    uint8_t tmp  = 128 - 10 + byte;
    uint8_t msb  = tmp &amp;amp; 0x80;

    uint8_t mask = msb - (msb &amp;gt;&amp;gt; 7); // 0x7f or 0x00

    return c + (mask &amp;amp; corr);
}
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Conversion numbers to binary representation</title>
  <link>http://0x80.pl/notesen/2014-09-11-convert-to-bin.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-09-11-convert-to-bin.html</guid>
  <pubDate>Thu, 11 Sep 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="converting-byte-swar-version-1"&gt;
&lt;h1&gt;Converting byte &amp;mdash; SWAR version 1&lt;/h1&gt;
&lt;div class="section" id="algorithm"&gt;
&lt;h2&gt;Algorithm&lt;/h2&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;Populate byte &lt;strong&gt;v&lt;/strong&gt; (bits: &lt;cite&gt;abcdefgh&lt;/cite&gt;) in a 64-bit word:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x0101010101010101&lt;/span&gt;
&lt;/pre&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Isolate one bit per byte:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x8040201008040201&lt;/span&gt;
&lt;/pre&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│&lt;span style="color: blue; font-weight: bold"&gt;a&lt;/span&gt;&lt;span style="color: gray"&gt;┈┈┈┈┈┈┈&lt;/span&gt;│&lt;span style="color: gray"&gt;.&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;b&lt;/span&gt;&lt;span style="color: gray"&gt;┈┈┈┈┈┈&lt;/span&gt;│&lt;span style="color: gray"&gt;┈┈&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;c&lt;/span&gt;&lt;span style="color: gray"&gt;┈┈┈┈┈&lt;/span&gt;│&lt;span style="color: gray"&gt;┈┈┈&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;d&lt;/span&gt;&lt;span style="color: gray"&gt;┈┈┈┈&lt;/span&gt;│&lt;span style="color: gray"&gt;┈┈┈┈&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;e&lt;/span&gt;&lt;span style="color: gray"&gt;┈┈┈&lt;/span&gt;│&lt;span style="color: gray"&gt;┈┈┈┈┈&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;f&lt;/span&gt;&lt;span style="color: gray"&gt;┈┈&lt;/span&gt;│&lt;span style="color: gray"&gt;┈┈┈┈┈┈&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;g&lt;/span&gt;&lt;span style="color: gray"&gt;.&lt;/span&gt;│&lt;span style="color: gray"&gt;┈┈┈┈┈┈┈&lt;/span&gt;&lt;span style="color: blue; font-weight: bold"&gt;h&lt;/span&gt;│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Clone each bit to the highest position&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00406070787c7e7f&lt;/span&gt;
&lt;/pre&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
     ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
r3 = │&lt;span style="color: blue; font-weight: bold"&gt;a&lt;/span&gt;┈┈┈┈┈┈┈│.&lt;span style="color: blue; font-weight: bold"&gt;b&lt;/span&gt;┈┈┈┈┈┈│┈┈&lt;span style="color: blue; font-weight: bold"&gt;c&lt;/span&gt;┈┈┈┈┈│┈┈┈&lt;span style="color: blue; font-weight: bold"&gt;d&lt;/span&gt;┈┈┈┈│┈┈┈┈&lt;span style="color: blue; font-weight: bold"&gt;e&lt;/span&gt;┈┈┈│┈┈┈┈┈&lt;span style="color: blue; font-weight: bold"&gt;f&lt;/span&gt;┈┈│┈┈┈┈┈┈&lt;span style="color: blue; font-weight: bold"&gt;g&lt;/span&gt;.│┈┈┈┈┈┈┈&lt;span style="color: blue; font-weight: bold"&gt;h&lt;/span&gt;│
     └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

     ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
   + │&lt;span style="color: gray"&gt;00000000&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;1&lt;/span&gt;&lt;span style="color: gray"&gt;000000&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;11&lt;/span&gt;&lt;span style="color: gray"&gt;00000&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;111&lt;/span&gt;&lt;span style="color: gray"&gt;0000&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;1111&lt;/span&gt;&lt;span style="color: gray"&gt;000&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;11111&lt;/span&gt;&lt;span style="color: gray"&gt;00&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;111111&lt;/span&gt;&lt;span style="color: gray"&gt;0&lt;/span&gt;│&lt;span style="color: gray"&gt;0&lt;/span&gt;&lt;span style="font-weight: bold"&gt;1111111&lt;/span&gt;│
     └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

     ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
r3 = │&lt;span style="color: blue; font-weight: bold"&gt;a&lt;/span&gt;┈┈┈┈┈┈┈│&lt;span style="color: blue; font-weight: bold"&gt;b&lt;/span&gt;&lt;span style="font-weight: bold"&gt;x&lt;/span&gt;┈┈┈┈┈┈│&lt;span style="color: blue; font-weight: bold"&gt;c&lt;/span&gt;&lt;span style="font-weight: bold"&gt;xx&lt;/span&gt;┈┈┈┈┈│&lt;span style="color: blue; font-weight: bold"&gt;d&lt;/span&gt;&lt;span style="font-weight: bold"&gt;xxx&lt;/span&gt;┈┈┈┈│&lt;span style="color: blue; font-weight: bold"&gt;e&lt;/span&gt;&lt;span style="font-weight: bold"&gt;xxxx&lt;/span&gt;┈┈┈│&lt;span style="color: blue; font-weight: bold"&gt;f&lt;/span&gt;&lt;span style="font-weight: bold"&gt;xxxxx&lt;/span&gt;┈┈│&lt;span style="color: blue; font-weight: bold"&gt;g&lt;/span&gt;&lt;span style="font-weight: bold"&gt;xxxxxx&lt;/span&gt;.│&lt;span style="color: blue; font-weight: bold"&gt;h&lt;/span&gt;&lt;span style="font-weight: bold"&gt;xxxxxxx&lt;/span&gt;│
     └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Move 7th bits to the 0th position and mask the lowest bits:&lt;/p&gt;
&lt;pre class="code literal-block"&gt;
const uint64_t r4 = (r3 &amp;gt;&amp;gt; 7) &amp;amp; 0x0101010101010101
&lt;/pre&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;a&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;b&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;c&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;d&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;e&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;f&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;g&lt;/span&gt;│┈┈┈┈┈┈┈&lt;span style="font-weight: bold"&gt;h&lt;/span&gt;│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;Finally convert to ASCII:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
const uint64_t result = 0x3030303030303030 + r4  // ord('0') == 0x30
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div class="section" id="notes"&gt;
&lt;h2&gt;Notes&lt;/h2&gt;
&lt;p&gt;Sample implementation:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;convert_swar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x0101010101010101&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x8040201008040201&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x00406070787c7e7f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x0101010101010101&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x3030303030303030&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;r4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// ord('0') == 0x30
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;This algorithm requires:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;1 multiplication,&lt;/li&gt;
&lt;li&gt;1 right shift,&lt;/li&gt;
&lt;li&gt;1 bit-and,&lt;/li&gt;
&lt;li&gt;1 bit-or,&lt;/li&gt;
&lt;li&gt;2 additions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Multiplication and shift are the slowest operations.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>C++ bitset vs array</title>
  <link>http://0x80.pl/notesen/2014-03-22-cpp-bitset-vs-byteset.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-22-cpp-bitset-vs-byteset.html</guid>
  <pubDate>Sat, 22 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The C++ bitset conserves a memory, but at cost of speed access. The bitset must
be slower than a set represented as a plain old array, at least when sets
are small (say a few hundred elements).&lt;/p&gt;
&lt;p&gt;Lets look at this simple functions:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="c1"&gt;// set_test.cpp
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;bitset&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;typedef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte_set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;any_in_byteset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byte_set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0u&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;span class="k"&gt;typedef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;bitset&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;any_in_bitset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0u&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The file was compiled with &lt;tt class="docutils literal"&gt;g++ &lt;span class="pre"&gt;-std=c++11&lt;/span&gt; &lt;span class="pre"&gt;-O3&lt;/span&gt; set_test.cpp&lt;/tt&gt;; Assembly
code of the core of &lt;tt class="docutils literal"&gt;any_in_byteset&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
28: 0f b6 10              movzbl (%eax),%edx
2b: 83 c0 01              add    $0x1,%eax
2e: 80 3c 11 00           cmpb   $0x0,(%ecx,%edx,1)
32: 75 0c                 jne    40
34: 39 d8                 cmp    %ebx,%eax
36: 75 f0                 jne    28
&lt;/pre&gt;
&lt;p&gt;Statement &lt;tt class="docutils literal"&gt;if &lt;span class="pre"&gt;(set[data[i]])&lt;/span&gt; return true&lt;/tt&gt; are lines 28, 2e and 32,
i.e.: load from memory, compare and jump. Instructions 2b, 34 and 36
handles the for loop.&lt;/p&gt;
&lt;p&gt;Now look at assembly code of &lt;tt class="docutils literal"&gt;any_in_bitset&lt;/tt&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
5f: 0f b6 13              movzbl (%ebx),%edx
62: b8 01 00 00 00        mov    $0x1,%eax
67: 89 d1                 mov    %edx,%ecx
69: 83 e1 1f              and    $0x1f,%ecx
6c: c1 ea 05              shr    $0x5,%edx
6f: d3 e0                 shl    %cl,%eax
71: 85 44 94 18           test   %eax,0x18(%esp,%edx,4)
75: 75 39                 jne    b0
&lt;/pre&gt;
&lt;p&gt;All these instructions implements the if statement! Again, we have a load
from memory (5f), but checking which bit is set requires much more work.
The input (&lt;tt class="docutils literal"&gt;edx&lt;/tt&gt;) is split to the lower part &amp;mdash; i.e. bit number (67, 6c) and
the higher part &amp;mdash; i.e. word index (6c). The last step is to check if a bit is
set in a word &amp;mdash; GCC used variable shift left (6f), but x86 has &lt;tt class="docutils literal"&gt;BT&lt;/tt&gt;
instruction, so in the perfect code we would have two instructions less.&lt;/p&gt;
&lt;p&gt;However, as we see a simple access in the bitset is much more complicated
than simple memory fetch from byteset. For small sets memory fetches are
well cached and smaller number of instruction improves performance. For
really large sets cache misses would kill performance, then bitset is
much better choice.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Quick and dirty ad-hoc git hosting</title>
  <link>http://0x80.pl/notesen/2014-03-19-quick-git-server.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-19-quick-git-server.html</guid>
  <pubDate>Wed, 19 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Recently I needed to synchronize my local repository with a remote machine,
just for full backup. It's really simple if you have standard Linux tools
(Cygwin works too, of course).&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;in a working directory run:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ pwd
/home/foo/project
#         ^^^^^^^
$ git update-server-info
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;in the parent directory start HTTP server:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ cd ..
$ pwd
/home/foo
$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;on a remote machine clone/pull/whatever:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ git clone http://your_ip:8000/project/.git
                                ^^^^^^^
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Step 1 have to be executed manually when the local repository has changed.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Is const-correctness paranoia good?</title>
  <link>http://0x80.pl/notesen/2014-03-19-const-correctness.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-19-const-correctness.html</guid>
  <pubDate>Wed, 19 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Yes, definitely. Lets see this simple example:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ cat test.cpp
int test(int x) {
 if (x = 1)
  return 42;
 else
  return 0;
}
$ g++ -c test.cpp
$ g++ -c -Wall test.cpp
int test(int x) {
 if (x = 1)
  return 42;
 else
  return 0;
}
&lt;/pre&gt;
&lt;p&gt;Only when we turn on the warnings, a compiler tell us about a possible error.
Making the parameter const shows us error:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ cat test2.cpp
int test(int x) {
 if (x = 1)
  return 42;
 else
  return 0;
}
$ g++ -c test.cpp
test2.cpp: In function ‘int test(int)’:
test2.cpp:2:8: error: assignment of read-only parameter ‘x’
  if (x = 1)
        ^
&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;All&lt;/strong&gt; input parameters should be const, all write-once variables serving as
a parameters for some computations should be also const.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Scalar version of SSE move mask instruction</title>
  <link>http://0x80.pl/notesen/2014-03-16-scalar-sse-movmask.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-16-scalar-sse-movmask.html</guid>
  <pubDate>Sun, 16 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="sample-implementation"&gt;
&lt;h1&gt;Sample implementation&lt;/h1&gt;
&lt;p&gt;C function for 32-bit numbers:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;movmask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7f7f7f7f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x02040810&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;GCC generated the best possible code:&lt;/p&gt;
&lt;p&gt;Disassembly of section .text:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
00000000 :
   0: b8 10 08 04 02        mov    $0x2040810,%eax
   5: f7 64 24 04           mull   0x4(%esp)
   9: 0f b6 c2              movzbl %dl,%eax
   c: c3                    ret
&lt;/pre&gt;
&lt;p&gt;C function for 64-bit numbers (the type &lt;tt class="docutils literal"&gt;__int128&lt;/tt&gt; is the &lt;a class="reference external" href="https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html"&gt;GCC extension&lt;/a&gt;):&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;movmask64_unsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7f7f7f7f7f7f7f7flu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1l&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;__int128&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;__int128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;mult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;__int128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0xff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;And disassembly, GCC generated also the shortest possible code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
48 bf 00 81 40 20 10 08 04 02    movabs $0x204081020408100,%rdi
48 f7 e7                         mul    %rdi
0f b6 c2                         movzbl %dl,%eax
&lt;/pre&gt;
&lt;p&gt;Full source code is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/movmask"&gt;available&lt;/a&gt;, including the &lt;strong&gt;proof&lt;/strong&gt; written in Python.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SIMD-friendly Rabin-Karp modification</title>
  <link>http://0x80.pl/notesen/2014-03-11-simd-friendly-karp-rabin.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-11-simd-friendly-karp-rabin.html</guid>
  <pubDate>Tue, 11 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="algorithm"&gt;
&lt;h1&gt;Algorithm&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Rabin-Karp_algorithm"&gt;Rabin-Karp algorithm&lt;/a&gt; uses a weak hash function to locate possible
substring positions. This modification uses merely equality of the first and
the last char of searched substring, however equality of chars can be done
very fast &lt;strong&gt;in parallel&lt;/strong&gt;, even without SIMD instruction.&lt;/p&gt;
&lt;p&gt;Let &lt;tt class="docutils literal"&gt;packed_byte(x)&lt;/tt&gt; is a function that fills a CPU register with byte
&lt;tt class="docutils literal"&gt;x&lt;/tt&gt;, for example on 32-bit architecture:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
// packed_byte(0xab) = 0xabababab
uint32_t packed_byte(uint8_t byte) {
        return 0x01010101 * byte;
}
&lt;/pre&gt;
&lt;p&gt;In a single iteration two registers are filled with part of string:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
const size_t n = strlen(string);
const size_t k = strlen(substring);

const size_t first = packed_byte(substring[0]);
const size_t last  = packed_byte(substring[k - 1]);

for (size_t i = 0; i &amp;lt; n - k; i += 4) {
        const uint32_t block_first = string[i     .. i + 4];
        const uint32_t block_last  = string[i + k .. i + k + 4];

        ...
}
&lt;/pre&gt;
&lt;p&gt;Then parallel comparison is done with simple xor:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
const uint32_t first_zeros = block_first ^ first;
const uint32_t last_zeros  = block_last  ^ last;
&lt;/pre&gt;
&lt;p&gt;Zero bytes in &lt;tt class="docutils literal"&gt;first_zeros&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;last_zeros&lt;/tt&gt; indicate equality of chars.
The positions of zero bytes have to be equal, so an additional bit or is
needed:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
const uint32_t zeros = first_zeros | last_zeros;
&lt;/pre&gt;
&lt;p&gt;Getting zeros requires only:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;two memory fetches,&lt;/li&gt;
&lt;li&gt;two bit-xor,&lt;/li&gt;
&lt;li&gt;one bit-or.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now we have to check if zeros has any zero bytes, then iterate through
zeros and perform byte-wise comparisons for all zero bytes:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
if (has_zero_byte(zeros)) {
        for (int j = 0; j &amp;lt; 4; j++) {
                if (is_zero(zeros, j) &amp;amp;&amp;amp; memcmp(&amp;amp;string[i + j], substring, k) == 0) {
                        return i + z;
                }
        }
}
&lt;/pre&gt;
&lt;p&gt;Function &lt;tt class="docutils literal"&gt;has_zero_byte(uint32_t word)&lt;/tt&gt; could be implemented using
algorithm from Bit Twiddling Hacks &amp;mdash; &lt;a class="reference external" href="http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord"&gt;Determine if a word has a zero
byte&lt;/a&gt;. Function &lt;tt class="docutils literal"&gt;is_zero(uint32_t word, int k)&lt;/tt&gt; may use results from
&lt;tt class="docutils literal"&gt;has_zero_byte&lt;/tt&gt; (warning: this method has a &lt;a class="reference external" href="http://wmula.blogspot.com/2014/03/mask-for-zeronon-zero-bytes.html"&gt;drawback&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;The worst case complexity is O(n*m). However the method minimize memory fetches
during a string scan, also comparisons are performed in parallel, and no
preprocessing is required (except two &lt;tt class="docutils literal"&gt;packed_byte&lt;/tt&gt; calls before the main
loop).&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>C++ standard inaccuracy</title>
  <link>http://0x80.pl/notesen/2014-03-11-cpp-standard.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-11-cpp-standard.html</guid>
  <pubDate>Tue, 11 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;First we read:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;21.4.1 basic_string general requirements [string.require]&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;3 No erase() or pop_back() member function shall throw any exceptions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... a few pages later:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;21.4.6.5 basic_string::erase [string::erase]&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;basic_string&amp;lt;charT,traits,
Allocator&amp;gt;&amp;amp; erase(size_type pos = 0, size_type n = npos);&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;2 &lt;strong&gt;Throws&lt;/strong&gt;: out_of_range if pos &amp;gt; size().&lt;/p&gt;
&lt;/blockquote&gt;
  </description>
 </item>
 <item>
  <title>Integer log 10 of an unsigned integer --- SIMD version</title>
  <link>http://0x80.pl/notesen/2014-03-09-simd-int-log-10.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-09-simd-int-log-10.html</guid>
  <pubDate>Sun, 09 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Fast calculate &lt;span class="math"&gt;ceil(log&lt;sub&gt;10&lt;/sub&gt;&lt;i&gt;x&lt;/i&gt;)&lt;/span&gt; of an unsigned number is
described on &lt;a class="reference external" href="http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10"&gt;Bit Twiddling Hacks&lt;/a&gt;, this text show the SIMD solution
for 32-bit numbers.&lt;/p&gt;
&lt;p&gt;Algorithm:&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;populate value in XMM registers. Since maximum value of this function
is 10 we need three registers:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
movd   %eax, %xmm0          // xmm0 = packed_dword(0, 0, 0, x)
pshufd $0, %xmm0, %xmm0 \n&amp;quot; // xmm0 = packed_dword(x, x, x, x)
movapd %xmm0, %xmm1
movapd %xmm0, %xmm2
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;compare these numbers with sequence of powers of 10:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
// powers_a = packed_dword(10^1 - 1, 10^2 - 1, 10^3 - 1, 10^4 - 1)
// powers_c = packed_dword(10^5 - 1, 10^6 - 1, 10^7 - 1, 10^8 - 1)
// powers_c = packed_dword(10^9 - 1, 0, 0, 0)
pcmpgtd powers_a, %xmm0
pcmpgtd powers_b, %xmm1
pcmpgtd powers_c, %xmm2
&lt;/pre&gt;
&lt;p&gt;result of comparisons are: 0 (false) or -1 (true), for example:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
xmm0 = packed_dword(-1, -1, -1, -1)
xmm1 = packed_dword( 0, 0, -1, -1)
xmm2 = packed_dword( 0, 0, 0, 0)
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;calculate sum of all dwords:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
psrld $31, %xmm0       // xmm0 = packed_dword( 1, 1, 1, 1) - convert -1 to 1
psubd %xmm1, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)
psubd %xmm2, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)

// convert packed_dword to packed_word
pxor %xmm1, %xmm1
packssdw %xmm1, %xmm0 // xmm0 = packed_word(0, 0, 0, 0, 1, 1, 2, 2)

// max value of word in xmm0 is 3, so higher
// bytes are always zero
psadbw %xmm1, %xmm0   // xmm0 = packded_qword(0, 6)
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;save a result, i.e. the lowest dword:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
movd %xmm0, %eax      // eax = 6
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Sample program is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/decimal-digits-count"&gt;available&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Mask for zero/non-zero bytes</title>
  <link>http://0x80.pl/notesen/2014-03-09-mask-zero-nonzero-bytes.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-09-mask-zero-nonzero-bytes.html</guid>
  <pubDate>Sun, 09 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The description of &lt;a class="reference external" href="http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord"&gt;Determine if a word has a zero byte&lt;/a&gt; from &amp;quot;Bit
Twiddling Hacks&amp;quot; says about &lt;tt class="docutils literal"&gt;haszero(v)&lt;/tt&gt;: &amp;quot;&lt;em&gt;the result is the high
bits set where the bytes in v were zero&lt;/em&gt;&amp;quot;.&lt;/p&gt;
&lt;p&gt;Unfortunately this is not true. The high bits are also set for ones followed
zeros, i.e. &lt;tt class="docutils literal"&gt;haszero(0xff010100) = 0x00808080&lt;/tt&gt;. Of course the result
is still valid (non-zero if there were any zero byte), but if we want to
iterate over all zeros or find  &lt;strong&gt;the last&lt;/strong&gt; zero index, this could be a
problem.&lt;/p&gt;
&lt;p&gt;It's possible to create an exact mask:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nonzeromask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="c1"&gt;// MSB are set if any of 7 lowest bits are set
&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nonzero_7bit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7f7f7f7f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x7f7f7f7f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nonzero_7bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x80808080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;zeromask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="c1"&gt;// negate MSBs
&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nonzeromask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x80808080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Function &lt;tt class="docutils literal"&gt;nonzeromask&lt;/tt&gt; requires four simple instructions, and
&lt;tt class="docutils literal"&gt;zeromask&lt;/tt&gt; one additional xor.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>GCC --- asm goto</title>
  <link>http://0x80.pl/notesen/2014-03-09-asmgoto.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-09-asmgoto.html</guid>
  <pubDate>Sun, 09 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Starting from GCC 4.5 the asm statement has new form: &lt;strong&gt;asm goto&lt;/strong&gt;.
A programmer can use any label from a C/C++ program, however a output from
this block is not allowed.&lt;/p&gt;
&lt;p&gt;Using an asm block in an old form requires more instructions and an additional
storage:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;asm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bt %2, %%eax  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;setc %%al  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;movzx %%al, %%eax &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;=a&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;r&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;cc&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bit_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="k"&gt;goto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_bit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Above code is compiled to:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
80483f6:       0f a3 d8                bt     %ebx,%eax
80483f9:       0f 92 c0                setb   %al
80483fc:       0f b6 c0                movzbl %al,%eax
80483ff:       85 c0                   test   %eax,%eax
8048401:       74 16                   je     8048419
&lt;/pre&gt;
&lt;p&gt;With asm goto the same task could be writting shorter and easier:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;asm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;goto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bt %1, %0 &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;jc %l[has_bit] &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;

       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cm"&gt;/* no output */&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;r&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;r&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;cc&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has_bit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// name of label
&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Complete demo is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/gcc-asmgoto"&gt;available&lt;/a&gt;. See also GCC documentation about &lt;a class="reference external" href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html"&gt;Extended
Asm&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Slow-paths in GNU libc strstr</title>
  <link>http://0x80.pl/notesen/2014-03-03-slow-strstr.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-03-03-slow-strstr.html</guid>
  <pubDate>Mon, 03 Mar 2014 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;I've observed that some patterns issued to &lt;tt class="docutils literal"&gt;strstr&lt;/tt&gt; cause significant
slowdown.&lt;/p&gt;
&lt;p&gt;Sample program &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/kill-gnulib-strstr"&gt;kill-strstr.c&lt;/a&gt; executes &lt;tt class="docutils literal"&gt;strstr(data, pattern)&lt;/tt&gt;,
where &lt;tt class="docutils literal"&gt;data&lt;/tt&gt; is a large string (16MB) filled with the character '?'. Patterns
are read from the command line.&lt;/p&gt;
&lt;p&gt;On my machine following times were recorded:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
1. searching string 'johndoe'...
                time: 0.032
2. searching string '??????????????????a'...
                time: 0.050
3. searching string '??????????????????????????????a'...
                time: 0.049
4. searching string '???????????????????????????????a'...
                time: 0.274
5. searching string '??????????????????????????????a?'...
                time: 0.356
6. searching string '??????????????????????????????a??????????????????????????????'...
                time: 0.396
&lt;/pre&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Slowdown is visible in case 4 (5 times slower than pattern 3).
Pattern has 32 characters, and contains '?', except the last char.&lt;/li&gt;
&lt;li&gt;Even bigger slowdown occurs in case 5 (7 times slower). This pattern
also contains 32 chars, but the position of the single letter 'a' is last
but one.&lt;/li&gt;
&lt;li&gt;Similar slowdown occurs in case 6 (nearly 8 times slower). In this
pattern single letter 'a' is surrounded by thirty '?'.&lt;/li&gt;
&lt;/ol&gt;
  </description>
 </item>
 <item>
  <title>Penalties of errors in SSE floating point calculations</title>
  <link>http://0x80.pl/notesen/2014-01-26-sse-penalties-of-errors.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-01-26-sse-penalties-of-errors.html</guid>
  <pubDate>Sun, 26 Jan 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;SSE provides not widely known control register, called &lt;strong&gt;MXCSR&lt;/strong&gt;. This
register plays three roles:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Controls calculations:&lt;ol class="loweralpha"&gt;
&lt;li&gt;flag &lt;strong&gt;flush to zero&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;flag &lt;strong&gt;denormals are zeros&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;rounding mode (not covered in this text).&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Allow to mask/unmask floating-point exceptions.&lt;/li&gt;
&lt;li&gt;Save information about floating-point errors &amp;mdash; these flags are
sticky, i.e. a programmer is responsible for clearing them.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Possible errors in SSE floating point calculations are:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;division by zero,&lt;/li&gt;
&lt;li&gt;underflow,&lt;/li&gt;
&lt;li&gt;overflow,&lt;/li&gt;
&lt;li&gt;operations on denormalized values,&lt;/li&gt;
&lt;li&gt;invalid operations, like square root of negative number, division zero by zero.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>x86 - ISA where 80% of instructions are unimportant</title>
  <link>http://0x80.pl/notesen/2014-01-01-instruction-utilization.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2014-01-01-instruction-utilization.html</guid>
  <pubDate>Wed, 01 Jan 2014 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="detailed-results"&gt;
&lt;h1&gt;Detailed results&lt;/h1&gt;
&lt;p&gt;Whole table as &lt;a class="reference external" href="https://github.com/WojciechMula/toys/blob/master/count_instructions/debian-386.txt"&gt;txt file&lt;/a&gt;.&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;instruction&lt;/th&gt;
&lt;th class="head"&gt;count&lt;/th&gt;
&lt;th class="head"&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;mov&lt;/td&gt;
&lt;td&gt;5934098&lt;/td&gt;
&lt;td&gt;37.63%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;call&lt;/td&gt;
&lt;td&gt;1414355&lt;/td&gt;
&lt;td&gt;8.97%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;lea&lt;/td&gt;
&lt;td&gt;1071501&lt;/td&gt;
&lt;td&gt;6.79%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movl&lt;/td&gt;
&lt;td&gt;760677&lt;/td&gt;
&lt;td&gt;4.82%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;push&lt;/td&gt;
&lt;td&gt;655921&lt;/td&gt;
&lt;td&gt;4.16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jmp&lt;/td&gt;
&lt;td&gt;611540&lt;/td&gt;
&lt;td&gt;3.88%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;add&lt;/td&gt;
&lt;td&gt;560517&lt;/td&gt;
&lt;td&gt;3.55%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;je&lt;/td&gt;
&lt;td&gt;490250&lt;/td&gt;
&lt;td&gt;3.11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;test&lt;/td&gt;
&lt;td&gt;475899&lt;/td&gt;
&lt;td&gt;3.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;pop&lt;/td&gt;
&lt;td&gt;441608&lt;/td&gt;
&lt;td&gt;2.80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sub&lt;/td&gt;
&lt;td&gt;366228&lt;/td&gt;
&lt;td&gt;2.32%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;cmp&lt;/td&gt;
&lt;td&gt;326379&lt;/td&gt;
&lt;td&gt;2.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jne&lt;/td&gt;
&lt;td&gt;264110&lt;/td&gt;
&lt;td&gt;1.67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;nop&lt;/td&gt;
&lt;td&gt;242356&lt;/td&gt;
&lt;td&gt;1.54%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;ret&lt;/td&gt;
&lt;td&gt;238569&lt;/td&gt;
&lt;td&gt;1.51%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;xor&lt;/td&gt;
&lt;td&gt;148194&lt;/td&gt;
&lt;td&gt;0.94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movzbl&lt;/td&gt;
&lt;td&gt;122730&lt;/td&gt;
&lt;td&gt;0.78%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;and&lt;/td&gt;
&lt;td&gt;88863&lt;/td&gt;
&lt;td&gt;0.56%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;xchg&lt;/td&gt;
&lt;td&gt;66885&lt;/td&gt;
&lt;td&gt;0.42%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;cmpl&lt;/td&gt;
&lt;td&gt;64907&lt;/td&gt;
&lt;td&gt;0.41%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movzwl&lt;/td&gt;
&lt;td&gt;64589&lt;/td&gt;
&lt;td&gt;0.41%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movb&lt;/td&gt;
&lt;td&gt;57247&lt;/td&gt;
&lt;td&gt;0.36%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;or&lt;/td&gt;
&lt;td&gt;52138&lt;/td&gt;
&lt;td&gt;0.33%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;shl&lt;/td&gt;
&lt;td&gt;50908&lt;/td&gt;
&lt;td&gt;0.32%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;cmpb&lt;/td&gt;
&lt;td&gt;50152&lt;/td&gt;
&lt;td&gt;0.32%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jle&lt;/td&gt;
&lt;td&gt;41083&lt;/td&gt;
&lt;td&gt;0.26%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;leave&lt;/td&gt;
&lt;td&gt;39923&lt;/td&gt;
&lt;td&gt;0.25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fldl&lt;/td&gt;
&lt;td&gt;37428&lt;/td&gt;
&lt;td&gt;0.24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fstpl&lt;/td&gt;
&lt;td&gt;37368&lt;/td&gt;
&lt;td&gt;0.24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;shr&lt;/td&gt;
&lt;td&gt;36503&lt;/td&gt;
&lt;td&gt;0.23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jbe&lt;/td&gt;
&lt;td&gt;32866&lt;/td&gt;
&lt;td&gt;0.21%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;ja&lt;/td&gt;
&lt;td&gt;32333&lt;/td&gt;
&lt;td&gt;0.21%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sar&lt;/td&gt;
&lt;td&gt;30917&lt;/td&gt;
&lt;td&gt;0.20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;flds&lt;/td&gt;
&lt;td&gt;29672&lt;/td&gt;
&lt;td&gt;0.19%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;subl&lt;/td&gt;
&lt;td&gt;27636&lt;/td&gt;
&lt;td&gt;0.18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;setne&lt;/td&gt;
&lt;td&gt;27626&lt;/td&gt;
&lt;td&gt;0.18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;testb&lt;/td&gt;
&lt;td&gt;27420&lt;/td&gt;
&lt;td&gt;0.17%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;addl&lt;/td&gt;
&lt;td&gt;25906&lt;/td&gt;
&lt;td&gt;0.16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;imul&lt;/td&gt;
&lt;td&gt;25569&lt;/td&gt;
&lt;td&gt;0.16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jg&lt;/td&gt;
&lt;td&gt;24796&lt;/td&gt;
&lt;td&gt;0.16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fstp&lt;/td&gt;
&lt;td&gt;24349&lt;/td&gt;
&lt;td&gt;0.15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fxch&lt;/td&gt;
&lt;td&gt;23464&lt;/td&gt;
&lt;td&gt;0.15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;js&lt;/td&gt;
&lt;td&gt;21550&lt;/td&gt;
&lt;td&gt;0.14%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fstps&lt;/td&gt;
&lt;td&gt;21248&lt;/td&gt;
&lt;td&gt;0.13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sbb&lt;/td&gt;
&lt;td&gt;16607&lt;/td&gt;
&lt;td&gt;0.11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;inc&lt;/td&gt;
&lt;td&gt;16200&lt;/td&gt;
&lt;td&gt;0.10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;lock&lt;/td&gt;
&lt;td&gt;16049&lt;/td&gt;
&lt;td&gt;0.10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jae&lt;/td&gt;
&lt;td&gt;14825&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sahf&lt;/td&gt;
&lt;td&gt;14765&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;dec&lt;/td&gt;
&lt;td&gt;14276&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fnstsw&lt;/td&gt;
&lt;td&gt;14026&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;sete&lt;/td&gt;
&lt;td&gt;13902&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movw&lt;/td&gt;
&lt;td&gt;13895&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;adc&lt;/td&gt;
&lt;td&gt;13640&lt;/td&gt;
&lt;td&gt;0.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jb&lt;/td&gt;
&lt;td&gt;12467&lt;/td&gt;
&lt;td&gt;0.08%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jl&lt;/td&gt;
&lt;td&gt;11700&lt;/td&gt;
&lt;td&gt;0.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;repz&lt;/td&gt;
&lt;td&gt;11178&lt;/td&gt;
&lt;td&gt;0.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fldcw&lt;/td&gt;
&lt;td&gt;11110&lt;/td&gt;
&lt;td&gt;0.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jge&lt;/td&gt;
&lt;td&gt;11019&lt;/td&gt;
&lt;td&gt;0.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movswl&lt;/td&gt;
&lt;td&gt;10816&lt;/td&gt;
&lt;td&gt;0.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fildl&lt;/td&gt;
&lt;td&gt;8852&lt;/td&gt;
&lt;td&gt;0.06%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;cmpw&lt;/td&gt;
&lt;td&gt;7601&lt;/td&gt;
&lt;td&gt;0.05%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jns&lt;/td&gt;
&lt;td&gt;7490&lt;/td&gt;
&lt;td&gt;0.05%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fldz&lt;/td&gt;
&lt;td&gt;7331&lt;/td&gt;
&lt;td&gt;0.05%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fmul&lt;/td&gt;
&lt;td&gt;7229&lt;/td&gt;
&lt;td&gt;0.05%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;out&lt;/td&gt;
&lt;td&gt;7203&lt;/td&gt;
&lt;td&gt;0.05%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;not&lt;/td&gt;
&lt;td&gt;7028&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movsbl&lt;/td&gt;
&lt;td&gt;6720&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;in&lt;/td&gt;
&lt;td&gt;6503&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fld&lt;/td&gt;
&lt;td&gt;6309&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;faddp&lt;/td&gt;
&lt;td&gt;6254&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fstl&lt;/td&gt;
&lt;td&gt;5760&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fucom&lt;/td&gt;
&lt;td&gt;5753&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;neg&lt;/td&gt;
&lt;td&gt;5725&lt;/td&gt;
&lt;td&gt;0.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fucompp&lt;/td&gt;
&lt;td&gt;5354&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;rep&lt;/td&gt;
&lt;td&gt;5059&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fmuls&lt;/td&gt;
&lt;td&gt;5039&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;pushl&lt;/td&gt;
&lt;td&gt;4430&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jp&lt;/td&gt;
&lt;td&gt;4424&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fnstcw&lt;/td&gt;
&lt;td&gt;4400&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fld1&lt;/td&gt;
&lt;td&gt;4176&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fmulp&lt;/td&gt;
&lt;td&gt;4133&lt;/td&gt;
&lt;td&gt;0.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;orl&lt;/td&gt;
&lt;td&gt;3927&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fadds&lt;/td&gt;
&lt;td&gt;3789&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;movq&lt;/td&gt;
&lt;td&gt;3779&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fistpl&lt;/td&gt;
&lt;td&gt;3709&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;cltd&lt;/td&gt;
&lt;td&gt;3597&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fmull&lt;/td&gt;
&lt;td&gt;3313&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;stos&lt;/td&gt;
&lt;td&gt;3298&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;lret&lt;/td&gt;
&lt;td&gt;3183&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;scas&lt;/td&gt;
&lt;td&gt;3103&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;lods&lt;/td&gt;
&lt;td&gt;3066&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;cwtl&lt;/td&gt;
&lt;td&gt;3064&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fadd&lt;/td&gt;
&lt;td&gt;2852&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fucomp&lt;/td&gt;
&lt;td&gt;2678&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;orb&lt;/td&gt;
&lt;td&gt;2481&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;fildll&lt;/td&gt;
&lt;td&gt;2418&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;andl&lt;/td&gt;
&lt;td&gt;2379&lt;/td&gt;
&lt;td&gt;0.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;setb&lt;/td&gt;
&lt;td&gt;2337&lt;/td&gt;
&lt;td&gt;0.01%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;andb&lt;/td&gt;
&lt;td&gt;2263&lt;/td&gt;
&lt;td&gt;0.01%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;552 rows more...&lt;/td&gt;
&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;td&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>I accidentally created an infinite loop</title>
  <link>http://0x80.pl/notesen/2013-12-30-infinite-loop.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-12-30-infinite-loop.html</guid>
  <pubDate>Mon, 30 Dec 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;I needed to iterate through all values of 32-bit unsigned integer, so I
wrote:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cpf"&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UINT32_MAX&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="c1"&gt;// whatever
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Is it ok? No, because the value of &lt;tt class="docutils literal"&gt;uint32_t&lt;/tt&gt; will never exceed
&lt;tt class="docutils literal"&gt;UINT32_MAX = 0xffffffff&lt;/tt&gt;. Of course we can use larger types, like
&lt;tt class="docutils literal"&gt;uint64_t&lt;/tt&gt;, but on 32-bit machines this requires some additional
instructions. For example gcc 4.7 compiled following code:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;loop1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fun&lt;/span&gt;&lt;span class="p"&gt;)())&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UINT32_MAX&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="n"&gt;fun&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;to:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
00000000 :
   0:   57                      push   %edi
   1:   bf 01 00 00 00          mov    $0x1,%edi
   6:   56                      push   %esi
   7:   31 f6                   xor    %esi,%esi
   9:   53                      push   %ebx
   a:   8b 5c 24 10             mov    0x10(%esp),%ebx
   e:   66 90                   xchg   %ax,%ax
  10:   ff d3                   call   *%ebx
  12:   83 c6 ff                add    $0xffffffff,%esi
  15:   83 d7 ff                adc    $0xffffffff,%edi
  18:   89 f8                   mov    %edi,%eax
  1a:   09 f0                   or     %esi,%eax
  1c:   75 f2                   jne    10
  1e:   5b                      pop    %ebx
  1f:   5e                      pop    %esi
  20:   5f                      pop    %edi
  21:   c3                      ret
&lt;/pre&gt;
&lt;p&gt;TBH, I have no idea why such a weird sequence has been generated
(&lt;tt class="docutils literal"&gt;add&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;adc&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;or&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;jnz&lt;/tt&gt;). The simplest and portable solution
is to detect a wrap-around of 32-bit value after increment:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="c1"&gt;// loop body
&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// wrap-around
&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;In an assembly code it's even simpler, because a CPU sets the carry flag:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
         xor %eax, %eax
loop:
         ; loop body

         add $1, %eax
         jnc loop
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Calculate floor value without FPU/SSE instruction</title>
  <link>http://0x80.pl/notesen/2013-12-29-calculate-floor-without-fpu.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-12-29-calculate-floor-without-fpu.html</guid>
  <pubDate>Sun, 29 Dec 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Presented algorithm works properly for any normalized floating point
value, examples are given for double precision numbers (64-bit).&lt;/p&gt;
&lt;p&gt;The layout of value 8192.625:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt; S  exp + bias  fraction
┌─┬───────────┬────────────────────────────────────────────────────┐
│0│10000001100│0000000000000101000000000000000000000000000000000000│
└─┴───────────┴────────────────────────────────────────────────────┘
63 62       52 51                                                  0&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The exponent value is 13, thus fraction bits spans range 0 .. 52 - 13:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌─┬───────────┬────────────────────────────────────────────────────┐
│0│10000001100│000000000000&lt;span style="color: blue; font-weight: bold"&gt;0101000000000000000000000000000000000000&lt;/span&gt;│
└─┴───────────┴────────────────────────────────────────────────────┘
                           │                                      │
                           ╰──────╴ bits after decimal dot ╶──────╯&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To calculate the floor of value, the bits after decimal dot have to be clear.
This operation doesn't alter the exponent, so only single bit-and is
required, and no extra calculations are needed.&lt;/p&gt;
&lt;p&gt;To be precise, value of &lt;tt class="docutils literal"&gt;dot_position := 52 - exponent&lt;/tt&gt; decides what
have to be done:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;If &lt;tt class="docutils literal"&gt;dot_position &amp;gt; 52&lt;/tt&gt; then the value is less than 1.0, i.e. &lt;tt class="docutils literal"&gt;floor(x) = 0&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;If &lt;tt class="docutils literal"&gt;dot_position &amp;lt;= 0&lt;/tt&gt; then the value have no fraction part (it is
larger than &lt;span class="math"&gt;2&lt;sup&gt;52&lt;/sup&gt;&lt;/span&gt;), i.e. &lt;tt class="docutils literal"&gt;floor(x) = x&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;If &lt;tt class="docutils literal"&gt;0 &amp;lt; dot_position &amp;lt;= 52&lt;/tt&gt; then the value have fraction part and
bits after &lt;tt class="docutils literal"&gt;dot_position&lt;/tt&gt; bits have to be cleared.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The number of operations:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Extracting exponent value requires: 1 shift, 1 bit-and, 1 subtract.&lt;/li&gt;
&lt;li&gt;Determining which case have to be selected requires up to 4
comparisons.&lt;/li&gt;
&lt;li&gt;When clearing bits is needed, then building bit-mask require: 1 shift,
1 subtract, 1 bit negation, 1 bit-and.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sample program is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/floor"&gt;available&lt;/a&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ ./demo 123.75 0.012 120000000000000 0.99999999 99.999
floor(123.75000000) = 123.00000000
floor(0.01200000) = 0.00000000
floor(120000000000000.00000000) = 120000000000000.00000000
floor(0.99999999) = 0.00000000
floor(99.99900000) = 99.00000000
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Convert float to int without FPU/SSE</title>
  <link>http://0x80.pl/notesen/2013-12-27-convert-float-to-integer.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-12-27-convert-float-to-integer.html</guid>
  <pubDate>Fri, 27 Dec 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;This short article shows how a normalized floating point value could be
safely converted to an integer value without assistance of FPU/SSE. Only
basic bit and arithmetic operations are used. In the worst case following
operations are performed:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;6 comparisons,&lt;/li&gt;
&lt;li&gt;2 subtracts,&lt;/li&gt;
&lt;li&gt;1 and,&lt;/li&gt;
&lt;li&gt;1 or,&lt;/li&gt;
&lt;li&gt;2 shifts (one with variable, one with constant amount).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The floating point value is calculated as: &lt;span class="math"&gt; &amp;minus; 1&lt;sup&gt;&lt;i&gt;sign&lt;/i&gt;&lt;/sup&gt; &amp;sdot; (1 + &lt;i&gt;fraction&lt;/i&gt;) &amp;sdot; 2&lt;sup&gt;&lt;i&gt;exponent&lt;/i&gt; &amp;minus; &lt;i&gt;bias&lt;/i&gt;&lt;/sup&gt;&lt;/span&gt;.  The fraction part is in range [0, 1). For 32-bit
values &lt;strong&gt;sign&lt;/strong&gt; has 1 bit, &lt;strong&gt;exponent&lt;/strong&gt; has 8 bits, &lt;strong&gt;fraction&lt;/strong&gt; has
23 bits, and &lt;strong&gt;bias&lt;/strong&gt; has value 127; &lt;strong&gt;exponent + bias&lt;/strong&gt; is saved as
an unsigned number.&lt;/p&gt;
&lt;p&gt;The layout of binary word:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;┌─┬────────┬───────────────────────┐
│S│exp+bias│        fraction       │
└─┴────────┴───────────────────────┘
31 30    23 22                     0&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Let clear fields &lt;strong&gt;exponent + bias&lt;/strong&gt; and &lt;strong&gt;sign&lt;/strong&gt; and restore the implicit integer 1 at 24-th bit:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌─┬────────┬───────────────────────┐
│0│0000000&lt;span style="font-weight: bold; color: blue"&gt;1&lt;/span&gt;│&lt;span style="font-weight: bold; color: blue"&gt;XXXXXXXXXXXXXXXXXXXXXXX&lt;/span&gt;│
└─┴────────┴───────────────────────┘
31 30    23 22                     0&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The value of such 32-bit word treated as an unsigned integer is &lt;span class="math"&gt;(1 + &lt;i&gt;fraction&lt;/i&gt;) &amp;sdot; 2&lt;sup&gt;23&lt;/sup&gt;&lt;/span&gt;. To calculate the result this word have to be shifted left
or right depending on value and sign of &lt;tt class="docutils literal"&gt;shift := exponent - 23&lt;/tt&gt;;
only few cases have to be considered:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;If &lt;strong&gt;shift&lt;/strong&gt; is negative, then the word must be shifted right. The number of
significant bits is 24, so if &lt;span class="math"&gt;&lt;i&gt;shift&lt;/i&gt; &amp;lt  &amp;minus; 24&lt;/span&gt; the result is always zero.&lt;/li&gt;
&lt;li&gt;If &lt;strong&gt;shift&lt;/strong&gt; is positive, then the word must be shifted left. Since
destination is a 32-bit signed value, thus maximum shift is 31 - 24 = 7 bits
--- when shift is greater than 7, then overflow will occur.&lt;/li&gt;
&lt;li&gt;If &lt;span class="math"&gt; &amp;minus; 24 &amp;lt &lt;i&gt;shift&lt;/i&gt; &amp;lt 7&lt;/span&gt; then the number could be safely shifted. When
&lt;span class="math"&gt;&lt;i&gt;shift&lt;/i&gt; = 7&lt;/span&gt;, then result has exactly 31 significant bits, thus a range
check is required: for positive numbers (sign = 0) maximum value is
&lt;span class="math"&gt;2&lt;sup&gt;31&lt;/sup&gt; &amp;minus; 1&lt;/span&gt; and for negative is &lt;span class="math"&gt;2&lt;sup&gt;31&lt;/sup&gt;&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sample program is &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/float2int"&gt;available&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>fopen a directory</title>
  <link>http://0x80.pl/notesen/2013-12-25-fopen-directory.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-12-25-fopen-directory.html</guid>
  <pubDate>Wed, 25 Dec 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;It's not clear how function &lt;tt class="docutils literal"&gt;fopen&lt;/tt&gt; applied to a directory should
behave, manual pages don't say anything about this. So, our common sense
fail &amp;mdash; at least when use a standard library shipped with GCC, because
&lt;tt class="docutils literal"&gt;fopen&lt;/tt&gt; returns a valid handle.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://stackoverflow.com/questions/18192998/plain-c-opening-a-directory-with-fopen"&gt;Discussion on stackoverflow&lt;/a&gt; pointed that &lt;tt class="docutils literal"&gt;fseek&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;ftell&lt;/tt&gt;
would fail. But on my system it's not true, &lt;tt class="docutils literal"&gt;ftell(f, 0, SEEK_END)&lt;/tt&gt;
returns the size of an opened directory.&lt;/p&gt;
&lt;p&gt;Only when we trying to read data using &lt;tt class="docutils literal"&gt;fread&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;fgetc&lt;/tt&gt; the errno variable
is set to &lt;strong&gt;EISDIR&lt;/strong&gt; error code.&lt;/p&gt;
&lt;p&gt;Here is output from simple &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/fopen_directory"&gt;test program&lt;/a&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ ./a.out ~
testing '/home/wojtek'...
fopen: Success [errno=0]
fseek: Success [errno=0]
fseek result: 0
ftell: Success [errno=0]
ftell result: 24576
feof: Success [errno=0]
feof result: 0 (EOF=no)
fgetc: Is a directory [errno=21]
fgetc result: -1 (EOF=yes)
fread: Is a directory [errno=21]
fread result: 0
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>x86 extensions are useless</title>
  <link>http://0x80.pl/notesen/2013-12-12-instructions-usage.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-12-12-instructions-usage.html</guid>
  <pubDate>Thu, 12 Dec 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Intel announced new &lt;a class="reference external" href="http://software.intel.com/en-us/articles/intel-sha-extensions"&gt;extension to SSE&lt;/a&gt;: instructions accelerating
calculating hashes SHA-1 and SHA256. As everything else added
recently to the x86 ISA, these new instructions address special cases of
&amp;quot;something&amp;quot;. The number of instructions, encoding modes, etc. is increasing,
but do not help in general.&lt;/p&gt;
&lt;p&gt;Let see what &lt;tt class="docutils literal"&gt;sha1msg1 xmm1, xmm2&lt;/tt&gt; does (type of arguments is packed dword):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
result[0] := xmm1[0] xor xmm1[2]
result[1] := xmm1[1] xor xmm1[3]
result[2] := xmm1[2] xor xmm2[0]
result[3] := xmm1[3] xor xmm2[1]
&lt;/pre&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The logical operation &amp;quot;xor&amp;quot; is hardcoded. Why can't we use &amp;quot;or&amp;quot;, &amp;quot;and&amp;quot;,
&amp;quot;not and&amp;quot;? These operations are already present in ISA.&lt;/li&gt;
&lt;li&gt;Indices to &lt;tt class="docutils literal"&gt;xmm1&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;xmm2&lt;/tt&gt; are hardcoded too. The instruction
&lt;tt class="docutils literal"&gt;pshufd&lt;/tt&gt; accepts an immediate argument (1 byte) to select permutation,
why &lt;tt class="docutils literal"&gt;sha1msg1&lt;/tt&gt; couldn't be feed with 2 bytes allowing a programmer to select
any permutations of arguments?&lt;/li&gt;
&lt;li&gt;Sources of operators are also hardcoded. Why not use another immediate (1 byte)
to select sources, for example &lt;tt class="docutils literal"&gt;00b = xmm1/xmm1&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;01b = xmm1/xmm2&lt;/tt&gt;,
&lt;tt class="docutils literal"&gt;10b = xmm2/xmm1&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;11b = xmm2/xmm2&lt;/tt&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Such generic instruction would be saved as &lt;tt class="docutils literal"&gt;generic_op xmm1, xmm2,
imm_1, imm_2, imm_3&lt;/tt&gt; and execute following algorithm:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
for i := 0 to 3 do
        arg1_indice := imm_1[2*i:2*i + 1]
        arg2_indice := imm_2[2*i:2*i + 1]

        if imm_3[2*i] = 1 then
                arg1 := xmm1
        else
                arg1 := xmm2
        end if

        if imm_3[2*i + 1] = 1 then
                arg2 := xmm2
        else
                arg2 := xmm1
        end if

        result[i] := arg1[arg1_indice] op arg2[arg2_indice]
end for
&lt;/pre&gt;
&lt;p&gt;Then &lt;tt class="docutils literal"&gt;sha1msg1&lt;/tt&gt; is just a special case:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
generic_xor xmm1, xmm2, 0b11100100, 0b01001110, 0b01010000
&lt;/pre&gt;
&lt;p&gt;Maybe this example is &amp;quot;too generic&amp;quot;, too complex, and would be hard to
express in hardware. I just wanted to show that we will get shine new
instructions useful in few cases. Compilers can vectorize loops and make
use of SSE, but SHA is used in drivers, OS and is encapsulated in
libraries &amp;mdash; &lt;tt class="docutils literal"&gt;sha1msg1&lt;/tt&gt; and friends will never appear in ordinary
programs.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Problems with PDO for PostgreSQL on 32-bit machines</title>
  <link>http://0x80.pl/notesen/2013-12-07-postgresql-pdo.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-12-07-postgresql-pdo.html</guid>
  <pubDate>Sat, 07 Dec 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The size of an integer in PHP &lt;a class="reference external" href="http://www.php.net/manual/en/language.types.integer.php"&gt;depends on machine&lt;/a&gt;, it has 32 bits on 32-bit
architectures, and 64 on 64-bit architectures.&lt;/p&gt;
&lt;p&gt;On 32-bit machines &lt;a class="reference external" href="http://php.net/manual/en/ref.pdo-pgsql.php"&gt;PDO for PostgreSQL&lt;/a&gt; always convert &lt;strong&gt;bigint&lt;/strong&gt;
numbers returned by a server to string. Never casts to integer even if
value of bigint would fit in 32-bit signed integer.&lt;/p&gt;
&lt;p&gt;Type bigint is returned for example by &lt;tt class="docutils literal"&gt;COUNT&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;SUM&lt;/tt&gt; functions.&lt;/p&gt;
&lt;p&gt;On 64-bit machines there is no such problem because PHP integer is the
same as bigint.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Encoding array of unsigned integers</title>
  <link>http://0x80.pl/notesen/2013-11-23-integer-sequence-encoding.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-11-23-integer-sequence-encoding.html</guid>
  <pubDate>Sat, 23 Nov 2013 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Russ Cox&lt;/strong&gt; wrote &lt;a class="reference external" href="http://swtch.com/~rsc/regexp/regexp4.html"&gt;very interesting article&lt;/a&gt; about algorithms behind
service Google Code Search. In short: files are indexed with trigrams, the query
string is split to trigrams and then an index is used to limit number
of searched files.&lt;/p&gt;
&lt;p&gt;Code search manages a reverse index &amp;mdash; there is a file list, and for
each trigram there is a list of file's id. I've interested how the list
of file's id is stored on a disc. In the Russ implementation such
list is built as follows:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;ids are sorted in ascending order;&lt;/li&gt;
&lt;li&gt;differences between current and previous id are encoded using
&lt;strong&gt;variable length integers&lt;/strong&gt; (&lt;tt class="docutils literal"&gt;varint&lt;/tt&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The variable length integer is a byte-oriented encoding, similar to UTF-8, where
each byte holds 7 bit of data and 1 bit (MSB) is used as the end-of-word marker.
Thus smaller values require less bytes: integers less than 128 occupy 1 byte,
less than 16384 occupy 2 bytes, and so on.&lt;/p&gt;
&lt;p&gt;Important: lists of file's id are encoded &lt;strong&gt;independently&lt;/strong&gt;, because
of that it is not possible to use any common attributes to increase
compression.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>FBSTP --- the most complex instruction in x86 ISA</title>
  <link>http://0x80.pl/notesen/2013-11-07-fbstp.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-11-07-fbstp.html</guid>
  <pubDate>Thu, 07 Nov 2013 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="sample-sources"&gt;
&lt;h1&gt;Sample sources&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/fbstp"&gt;Sources&lt;/a&gt; are available at github.&lt;/p&gt;
&lt;p&gt;Program &lt;tt class="docutils literal"&gt;fbst_tests.c&lt;/tt&gt; converts a number to a string, remove leading
zeros, and detects errors:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ ./test 0 12 5671245 -143433 334535 4543985349054 999999999999999999999
printf =&amp;gt; 0.000000
FBSTP  =&amp;gt; 0
printf =&amp;gt; 12.000000
FBSTP  =&amp;gt; 12
printf =&amp;gt; 5671245.000000
FBSTP  =&amp;gt; 5671245
printf =&amp;gt; -143433.000000
FBSTP  =&amp;gt; -143433
printf =&amp;gt; 334535.000000
FBSTP  =&amp;gt; 334535
printf =&amp;gt; 4543985349054.000000
FBSTP  =&amp;gt; 4543985349054
printf =&amp;gt; 10000000000000000000000.000000
FBSTP  =&amp;gt; NaN/overflow
&lt;/pre&gt;
&lt;p&gt;Program &lt;tt class="docutils literal"&gt;fbst_speed.c&lt;/tt&gt; compares instruction &lt;tt class="docutils literal"&gt;FBSTP&lt;/tt&gt; with simple
implementation of &lt;tt class="docutils literal"&gt;itoa&lt;/tt&gt;. There are no formatting, BCD to ASCII
conversion, etc. Numbers from 1 to 10,000,000 are converted.&lt;/p&gt;
&lt;p&gt;Results from quite old Pentium M:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
FBSTP...
... 2.285 s
simple itoa...
... 0.589 s
&lt;/pre&gt;
&lt;p&gt;and recent Core i7:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
FBSTP...
... 2.165 s
simple itoa...
... 0.419 s
&lt;/pre&gt;
&lt;p&gt;There is no difference! &lt;tt class="docutils literal"&gt;FBSTP&lt;/tt&gt; is just 5% faster on Core i7.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Short story about PostgreSQL SUM function</title>
  <link>http://0x80.pl/notesen/2013-11-04-postresql-sum.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-11-04-postresql-sum.html</guid>
  <pubDate>Mon, 04 Nov 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Here is a simple PostgreSQL type:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
CREATE TYPE foo_t AS (
        id    integer,
        total bigint
);
&lt;/pre&gt;
&lt;p&gt;and a simple query wrapped in a stored procedure:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
CREATE FUNCTION group_foo()
        RETURNS SETOF foo_t
        LANGUAGE &amp;quot;SQL&amp;quot;
AS $$
        SELECT id, SUM(some_column) FROM some_table GROUP BY id;
$$;
&lt;/pre&gt;
&lt;p&gt;Now, we want to sum everything:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
CREATE FUNCTION total_foo()
        RETURNS bigint -- same as foo_t.total
        LANGUAGE &amp;quot;SQL&amp;quot;
AS $$
        SELECT SUM(total) FROM group_foo();
$$;
&lt;/pre&gt;
&lt;p&gt;And we have an error about types inconsistency!&lt;/p&gt;
&lt;p&gt;This is caused by SUM function &amp;mdash; in PostgreSQL there are many variants of
this function, as the db engine supports function name overriding (sounds
familiar for C++ guys). There are following variants in PostgreSQL 9.1:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ \df sum
                                                 List of functions
   Schema   | Name | Result data type | Argument data types | Type
------------+------+------------------+---------------------+------
 pg_catalog | sum  | numeric          | bigint              | agg
 pg_catalog | sum  | double precision | double precision    | agg
 pg_catalog | sum  | bigint           | integer             | agg
 pg_catalog | sum  | interval         | interval            | agg
 pg_catalog | sum  | money            | money               | agg
 pg_catalog | sum  | numeric          | numeric             | agg
 pg_catalog | sum  | real             | real                | agg
 pg_catalog | sum  | bigint           | smallint            | agg
&lt;/pre&gt;
&lt;p&gt;Smaller types are promoted: from integer we get bigint, from bigint we get numeric, and so on.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>PostgreSQL --- faster reads from static tables</title>
  <link>http://0x80.pl/notesen/2013-11-02-postgres-reorder-tuples.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-11-02-postgres-reorder-tuples.html</guid>
  <pubDate>Sat, 02 Nov 2013 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&amp;quot;Static table&amp;quot; means that data is changed rarely, in my case it's a cache
for some reporting system. Table has no any foreign keys, constraints etc.
This cache is feed by night cron by a query looking like this:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
TRUNCATE reports_cache;
INSERT INTO reports_cache
        (... list of columns ...)
        (SELECT ... FROM reports_cache_data(... some parameters ...));
&lt;/pre&gt;
&lt;p&gt;&lt;tt class="docutils literal"&gt;reports_cache_data&lt;/tt&gt; is a stored procedure that performs all boring
transformations, contains many joins, filters out certain rows, etc. The
cache table contains a lot of records.&lt;/p&gt;
&lt;p&gt;The main column used by all report is a &amp;quot;creation date&amp;quot;, range filtering
on this column appear in virtually all queries. For example aggregating
query looks like this:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
SELECT
        idx1
        count(counter1) as quantity1_count
FROM
        reports_cache
WHERE
        date BETWEEN '2013-01-01' AND '2013-01-31';
&lt;/pre&gt;
&lt;p&gt;Of course there is an index on &lt;tt class="docutils literal"&gt;date&lt;/tt&gt; field.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>PostgreSQL: printf in PL/pgSQL</title>
  <link>http://0x80.pl/notesen/2013-10-06-plpgsql-printf.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-10-06-plpgsql-printf.html</guid>
  <pubDate>Sun, 06 Oct 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;PostgreSQL wiki has &lt;a class="reference external" href="http://wiki.postgresql.org/wiki/Sprintf"&gt;entry about sprintf&lt;/a&gt; &amp;mdash; is is quite simple approach
(and isn't marked as &lt;strong&gt;immutable&lt;/strong&gt;). The main drawback is iterating over all
chars of a format string. Here is a version that use &lt;tt class="docutils literal"&gt;strpos&lt;/tt&gt; to locate % in the format
string, and it's faster around 2 times:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
CREATE OR REPLACE FUNCTION printf2(fmt text, variadic args anyarray) RETURNS text
LANGUAGE plpgsql IMMUTABLE AS $$
   DECLARE
      argcnt  int  := 1;
      head    text := &amp;rdquo;;     -- result
      tail    text := fmt;    -- unprocessed part
      k       int;
   BEGIN
      LOOP
         k := strpos(tail, '%');
         IF k = 0 THEN
            -- no more '%'
            head := head || tail;
            EXIT;
         ELSE
            IF substring(tail, k+1, 1) = '%' THEN
               -- escape sequence '%%'
               head := head || substring(tail, 1, k);
               tail := substring(tail, k+2);
            ELSE
               -- insert argument
               head := head || substring(tail, 1, k-1) || COALESCE(args[argcnt]::text, &amp;rdquo;);
               tail := substring(tail, k+1);
               argcnt := argcnt + 1;
            END IF;
         END IF;
      END LOOP;

      RETURN head;
END;
$$;
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>SSE: trie lookup speedup</title>
  <link>http://0x80.pl/notesen/2013-09-30-sse-trie.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-09-30-sse-trie.html</guid>
  <pubDate>Mon, 30 Sep 2013 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Trie is a multiway tree where each edge is labelled by a letter;
such trees are used as dictionaries. Lookup takes &lt;span class="math"&gt;O(&lt;i&gt;k&lt;/i&gt;)&lt;/span&gt; time,
where &lt;span class="math"&gt;&lt;i&gt;k&lt;/i&gt;&lt;/span&gt; is a string length.&lt;/p&gt;
&lt;p&gt;Trie node is defined in C:&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="k"&gt;typedef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;TrieNode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;eow&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// end of word marker
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="c1"&gt;// children count
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;letter&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="c1"&gt;// list of edge labels
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;TrieNode&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;// pointers to children nodes
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Lookup procedure is simple.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;trie_lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TrieNode&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;TrieNode&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="c1"&gt;// go through edge labelled by i-th letter
&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;trie_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;// we visited 'size' nodes, check if last
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// visited node is located at end-of-word
&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;eow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;Function &lt;tt class="docutils literal"&gt;trie_next&lt;/tt&gt; returns a child node labelled with given letter.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="n"&gt;TrieNode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;trie_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TrieNode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;letter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;The implementation of this procedure determines overall searching
and inserting time.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Detecting intersection of convex polygons in 2D</title>
  <link>http://0x80.pl/notesen/2013-09-15-convex-polygon-intersection.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-09-15-convex-polygon-intersection.html</guid>
  <pubDate>Sun, 15 Sep 2013 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Detecting intersection of convex polygons is a common problem in a wide range
of problems. The method of &lt;strong&gt;separated axis theorem&lt;/strong&gt; (SAT) is widely used, and
considered as the easiest and the fastest.&lt;/p&gt;
&lt;p&gt;This article presents a quite naive algorithm, that in terms of processing
polygon vertices is better than SAT &amp;mdash; in the worst case it requires fewer
additions &amp;amp; multiplications.  However, in practice SAT is faster.&lt;/p&gt;
&lt;p&gt;I believe that the approach was already discussed, but quick search didn't
return anything.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>PHP quirk</title>
  <link>http://0x80.pl/notesen/2013-09-01-php-quirk.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2013-09-01-php-quirk.html</guid>
  <pubDate>Sun, 01 Sep 2013 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;PHP is very funny language. Here we are a simple class, without a constructor:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
class Foo
{
}

$foo = new Foo($random, $names, $are, $not, $detected);
echo &amp;quot;ok!\n&amp;quot;;
&lt;/pre&gt;
&lt;p&gt;One can assume that interpreter will detect undeclared variables, but as
their names state this doesn't happen (PHP versions 5.3..5.5):&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ php foo1.php
ok!
&lt;/pre&gt;
&lt;p&gt;When the class Foo have the constructor:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
class Foo
{
        public function __construct() { }
}

$foo = new Foo($random, $names, $are, $not, $detected);
&lt;/pre&gt;
&lt;p&gt;everything works as expected:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
$ php foo2.php
PHP Notice:  Undefined variable: random in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: names in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: are in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: not in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: detected in /home/wojtek/foo2.php on line 10
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Average of two unsigned integers</title>
  <link>http://0x80.pl/notesen/2012-07-02-average-unsigned-ints.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2012-07-02-average-unsigned-ints.html</guid>
  <pubDate>Mon, 02 Jul 2012 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="sample-program"&gt;
&lt;h1&gt;Sample program&lt;/h1&gt;
&lt;p&gt;Following python script verifies both methods (for 8-bit integers)&lt;/p&gt;
&lt;pre class="code python literal-block"&gt;
&lt;span class="n"&gt;bits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;    &lt;span class="c1"&gt;# unsigned width&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;MSB&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bits&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="n"&gt;LSB_carry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;LSB_carry&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;MSB&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="n"&gt;MSB_carry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MSB_carry&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bits&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bits&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;xrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;xrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;            &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;            &lt;span class="n"&gt;r1&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;safe1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;            &lt;span class="n"&gt;r2&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;safe2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;            &lt;span class="n"&gt;r3&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;safe3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;            &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;            &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;            &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;r3&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;'__main__'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;    &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Speeding up LIKE '%text%' queries (at least in PostgeSQL)</title>
  <link>http://0x80.pl/notesen/2012-05-25-sql-ngram-index.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2012-05-25-sql-ngram-index.html</guid>
  <pubDate>Fri, 25 May 2012 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;PostgreSQL executing queries with construct &lt;tt class="docutils literal"&gt;like '%text%'&lt;/tt&gt;
can't use any system index (I guess none of existing
SQL servers have such feature).&lt;/p&gt;
&lt;p&gt;But it's possible to speed up such queries with an own index
using n-grams. N-gram is a subword consists N successive characters.
For example all 3-grams (trigrams) for word &amp;quot;bicycle&amp;quot; are: &amp;quot;bic&amp;quot;,
&amp;quot;icy&amp;quot;, &amp;quot;cyc&amp;quot;, &amp;quot;cle&amp;quot;. Likewise all 4-grams for this word are: &amp;quot;bicy&amp;quot;,
&amp;quot;icyc&amp;quot;, &amp;quot;ycle&amp;quot;.&lt;/p&gt;
&lt;p&gt;This article shows two approaches:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;An n-gram index as a regular table &amp;mdash; may be applicable
for engines other than PostgreSQL;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.postgresql.org/docs/9.1/static/textsearch-intro.html"&gt;Full Search Text&lt;/a&gt; indexes GIN and GIST existing in PostgreSQL
since 8.4. Note that both types of indexes has different properties
in terms of update and read timings.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE: conversion integers to decimal representation</title>
  <link>http://0x80.pl/notesen/2011-10-21-sse-itoa.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-10-21-sse-itoa.html</guid>
  <pubDate>Fri, 21 Oct 2011 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;With SSE2 instructions it's possible to convert up to four numbers
in range 0..9999_9999 and get 32 decimal digits results. This texts
describe code for two numbers (suitable for 64-bit conversions)
and for one number (suitable for 32-bit conversions).&lt;/p&gt;
&lt;p&gt;The outline of algorithm 1 has been posted by &lt;a class="reference external" href="http://42.pl/na/j5aji9$oe4$1&amp;#64;news.task.gda.pl"&gt;Piotr Wyderski&lt;/a&gt;
on the usenet group &lt;tt class="docutils literal"&gt;pl.comp.lang.c&lt;/tt&gt;, I merely implemented it.
The main idea is to perform in &lt;strong&gt;parallel&lt;/strong&gt; divisions &amp;amp; modulo
by &lt;span class="math"&gt;10&lt;sup&gt;8&lt;/sup&gt;&lt;/span&gt; (for 64-bit numbers), then &lt;span class="math"&gt;10&lt;sup&gt;4&lt;/sup&gt;&lt;/span&gt;,
&lt;span class="math"&gt;10&lt;sup&gt;2&lt;/sup&gt;&lt;/span&gt; and finally &lt;span class="math"&gt;10&lt;sup&gt;1&lt;/sup&gt;&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;I've developed the algorithm 2, converting just a single 8-digit number.
First division &amp;amp; modulo by &lt;span class="math"&gt;10&lt;sup&gt;8&lt;/sup&gt;&lt;/span&gt; is performed, and an input
vector is formed &lt;span class="math"&gt;[&lt;i&gt;abcd&lt;/i&gt;, &lt;i&gt;abcd&lt;/i&gt;, &lt;i&gt;abcd&lt;/i&gt;, &lt;i&gt;abcd&lt;/i&gt;, &lt;i&gt;efgh&lt;/i&gt;, &lt;i&gt;efgh&lt;/i&gt;, &lt;i&gt;efgh&lt;/i&gt;, &lt;i&gt;efgh&lt;/i&gt;]&lt;/span&gt;.
Then this vector is divided by vector &lt;span class="math"&gt;[10&lt;sup&gt;3&lt;/sup&gt;, 10&lt;sup&gt;2&lt;/sup&gt;, 10&lt;sup&gt;1&lt;/sup&gt;, 10&lt;sup&gt;0&lt;/sup&gt;, 10&lt;sup&gt;3&lt;/sup&gt;, 10&lt;sup&gt;2&lt;/sup&gt;, 10&lt;sup&gt;1&lt;/sup&gt;, 10&lt;sup&gt;0&lt;/sup&gt;]&lt;/span&gt; yielding vector &lt;span class="math"&gt;[&lt;i&gt;a&lt;/i&gt;, &lt;i&gt;ab&lt;/i&gt;, &lt;i&gt;abc&lt;/i&gt;, &lt;i&gt;abcd&lt;/i&gt;, &lt;i&gt;e&lt;/i&gt;, &lt;i&gt;ef&lt;/i&gt;, &lt;i&gt;efg&lt;/i&gt;, &lt;i&gt;efgh&lt;/i&gt;]&lt;/span&gt;. Obtaining an 8 digits result from this vector requires
some shuffling, multiply by 10 and a subtract.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Traversing DAGs</title>
  <link>http://0x80.pl/notesen/2011-04-11-traversing-dags.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-04-11-traversing-dags.html</guid>
  <pubDate>Mon, 11 Apr 2011 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;If a DAG has got one component, then the simplest traversing
method is depth-first-search, which could be easily implemented
recursively (using an implicit stack).&lt;/p&gt;
&lt;pre class="literal-block"&gt;
struct DAGNode {
        // user data
        bool    visited;        // after construction = 0
}

void DFS_aux(DAGNode* node, const bool val) {
        if (node-&amp;gt;visited != val) {
                // visit node

                node-&amp;gt;visited = val;
                for (n in node.connected)
                        DFS_aux(n, val)
        }
}

void DFS(DAGNode node) {
        static val = true;

        DFS_aux(node, val);
        val = not val;
}
&lt;/pre&gt;
&lt;p&gt;On every call of &lt;tt class="docutils literal"&gt;DFS()&lt;/tt&gt; the variable &lt;tt class="docutils literal"&gt;val&lt;/tt&gt; is switched,
and  &lt;tt class="docutils literal"&gt;visited&lt;/tt&gt; member is marked alternately with
&lt;tt class="docutils literal"&gt;true&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;false&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;There is just &lt;strong&gt;one problem&lt;/strong&gt; &amp;mdash; what if a traversing method
stop execution before visiting all nodes? Of course in such
situation we have to visit the DAG twice: on first pass reset
(possibly many times) &lt;tt class="docutils literal"&gt;visited&lt;/tt&gt; member to &lt;tt class="docutils literal"&gt;false&lt;/tt&gt;, and
then visit once each node.&lt;/p&gt;
&lt;p&gt;But usually &lt;tt class="docutils literal"&gt;bool&lt;/tt&gt; have at least 8 bits, so numbers could
be used instead of boolean values 0 or 1. On each call of
&lt;tt class="docutils literal"&gt;DFS()&lt;/tt&gt; a reference number is incremented, thanks to that even
if previous call stopped in the middle, the procedure will work
correctly.&lt;/p&gt;
&lt;p&gt;The only moment when &lt;tt class="docutils literal"&gt;visited&lt;/tt&gt; markers have to be cleared
is wrapping a reference numbers to zero. This happen every
256 calls if 8-bit values used; for wider counters (16, 32 bits)
max value is greater.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
void DFS(DAGNode node) {
        static unsigned int val = 1;
        if (val == 0) {
                // set visited member of all nodes to 0
                val += 1;
        }

        DFS_aux(node, val);
        val += 1;
}
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>DAWG as dictionary? Yes!</title>
  <link>http://0x80.pl/notesen/2011-04-09-dawg-as-dictionary.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-04-09-dawg-as-dictionary.html</guid>
  <pubDate>Sat, 09 Apr 2011 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;If you read the Wikipedia entry about &lt;a class="reference external" href="http://en.wikipedia.org/wiki/DAWG"&gt;DAWG&lt;/a&gt;, then you find following
sentence:&lt;/p&gt;
&lt;blockquote&gt;
Because the terminal nodes of a DAWG can be reached by multiple
paths, a DAWG is not suitable for storing auxiliary information
relating to each path, e.g. a word's frequency in the English
language. A trie would be more useful in such a case.&lt;/blockquote&gt;
&lt;p&gt;This isn't true!&lt;/p&gt;
&lt;p&gt;There is a quite simple algorithm, that allow to perform two-way minimal
perfect hashing (MPH), i.e. convert any path representing a word to a
unique number, or back &amp;mdash; a number to a path (word). Values lie in the range
1 .. &lt;em&gt;n&lt;/em&gt;, where &lt;em&gt;n&lt;/em&gt; is the number of distinct words saved in a DAWG.&lt;/p&gt;
&lt;p&gt;The algorithm is described in &lt;em&gt;Applications of Finite Automata Representing
Large Vocabularies&lt;/em&gt;, by &lt;strong&gt;Claudio Lucchiesi&lt;/strong&gt; and &lt;strong&gt;Tomasz Kowaltowski&lt;/strong&gt;
(preprint is freely available somewhere online).&lt;/p&gt;
&lt;p&gt;The main part of the algorithm is assigning to each node the number of
reachable words from a node; this can be easily done in one pass. Then
these numbers are used to perform perfect hashing. Hashing algorithm is
fast and simple, translation from pseudocode presented in the paper is
straightforward.&lt;/p&gt;
&lt;p&gt;Algorithm requires additional memory for numbers in each node and a table
of size n to implement dictionary lookups.&lt;/p&gt;
&lt;p&gt;I've updated &lt;a class="reference external" href="https://github.com/WojciechMula/pydawg"&gt;pyDAWG&lt;/a&gt; to support MPH.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Python: C extensions --- sequence-like object</title>
  <link>http://0x80.pl/notesen/2011-04-08-sequence-like-object.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-04-08-sequence-like-object.html</guid>
  <pubDate>Fri, 08 Apr 2011 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;If a class has to support standard &lt;tt class="docutils literal"&gt;len()&lt;/tt&gt; function or operator &lt;tt class="docutils literal"&gt;in&lt;/tt&gt;,
then must be a sequence-like. This requires a variable of type
&lt;tt class="docutils literal"&gt;PySequenceMethods&lt;/tt&gt;, that store addresses of proper functions.
Finally the address of this structure have to be assigned to &lt;tt class="docutils literal"&gt;tp_as_sequence&lt;/tt&gt;
member of the main &lt;tt class="docutils literal"&gt;PyTypeObject&lt;/tt&gt; variable.&lt;/p&gt;
&lt;p&gt;Here is a sample code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
static PySequenceMethods class_seq;

static PyTypeObject class_type_dsc = {
        ...
};

ssize_t
classmeth_len(PyObject* self) {
        if (not error)
                return sequence_size;
        else
                return -1;
}

int
classmeth_contains(PyObject* self, PyObject* value) {
        if (not error) {
                if (value in self)
                        return 1;
                else
                        return 0;
        }
        else
                return -1;
}


PyMODINIT_FUNC
PyInit_module() {
        class_seq.sq_length   = classmeth_len;
        class_seq.sq_contains = classmeth_contains;

        class_type_dsc.tp_as_sequence = &amp;amp;class_seq;

        ...
}
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Efficient trie representation</title>
  <link>http://0x80.pl/notesen/2011-03-26-trie-representation.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-03-26-trie-representation.html</guid>
  <pubDate>Sat, 26 Mar 2011 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Each node of a tree store two kinds of data: user data and trie-related
data. Let say an alphabet has at most 256 letters, thus a letter could be
saved on a byte.&lt;/p&gt;
&lt;p&gt;During maintaining tree structure following issues appear:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;strong&gt;Amount of memory&lt;/strong&gt; required to store nodes and edges.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Internal memory fragmentation&lt;/strong&gt; in underlying dynamic
memory allocation routines (&lt;tt class="docutils literal"&gt;malloc&lt;/tt&gt;/&lt;tt class="docutils literal"&gt;free&lt;/tt&gt;), that
makes real size of tree larger. If small object are
allocated/reallocated fragmentation could be significant.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic allocation/reallocation&lt;/strong&gt; scatters data on a heap,
making cache misses visible. Using arrays may improve
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/memory_locality"&gt;memory locality&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data alignment&lt;/strong&gt; in nowadays CPUs is important, making
reading and writing memory faster. Structure of node
could be packed to fill the smallest possible memory at
cost of speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time required to retrieve a child node&lt;/strong&gt; depending on
a letter (edge label):&lt;ul&gt;
&lt;li&gt;const (arrays),&lt;/li&gt;
&lt;li&gt;logarithmic (arrays) or,&lt;/li&gt;
&lt;li&gt;linear (arrays, lists).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In experiments Polish, English, French and German words lists were
used. User data has got two pointers, i.e. additional 8 bytes per
node.&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;col width="33%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;list&lt;/th&gt;
&lt;th class="head"&gt;words&lt;/th&gt;
&lt;th class="head"&gt;trie nodes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;english&lt;/td&gt;
&lt;td&gt;138 622&lt;/td&gt;
&lt;td&gt;312 855&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;german&lt;/td&gt;
&lt;td&gt;162 032&lt;/td&gt;
&lt;td&gt;610 470&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;french&lt;/td&gt;
&lt;td&gt;629 420&lt;/td&gt;
&lt;td&gt;1 297 080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;polish&lt;/td&gt;
&lt;td&gt;3 588 729&lt;/td&gt;
&lt;td&gt;5 933 658&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The sample program were linked against GNU libc, and procedure &lt;tt class="docutils literal"&gt;malloc_stats&lt;/tt&gt;
from &lt;tt class="docutils literal"&gt;malloc.h&lt;/tt&gt; was used to obtain statistics about the real memory usage.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Python: test if object is iterable</title>
  <link>http://0x80.pl/notesen/2011-02-26-python-is-iterable.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-02-26-python-is-iterable.html</guid>
  <pubDate>Sat, 26 Feb 2011 12:00:00 +0100</pubDate>
  <description>
&lt;pre class="code python literal-block"&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;isiterable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;                &lt;span class="nb"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="ne"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Traversing tree without stack</title>
  <link>http://0x80.pl/notesen/2011-02-17-traversing-trees.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2011-02-17-traversing-trees.html</guid>
  <pubDate>Thu, 17 Feb 2011 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Obvious traversal algorithms require &lt;span class="math"&gt;O(log&lt;i&gt;n&lt;/i&gt;)&lt;/span&gt; memory, i.e.
an explicit or an implicit stack or a queue.&lt;/p&gt;
&lt;p&gt;An iterative algorithm described here performs depth-first-search and
requires &lt;span class="math"&gt;O(1)&lt;/span&gt; memory. In technical terms a reference (or
a pointer) to one tree node is needed. This node, called &lt;tt class="docutils literal"&gt;p&lt;/tt&gt;, is
a node processed in the previous step of the algorithm.&lt;/p&gt;
&lt;p&gt;Following properties of a tree node &lt;tt class="docutils literal"&gt;x&lt;/tt&gt; are needed:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;it is possible to check if a node &lt;tt class="docutils literal"&gt;y&lt;/tt&gt; is a child
(&lt;tt class="docutils literal"&gt;x.is_child(y)&lt;/tt&gt;);&lt;/li&gt;
&lt;li&gt;it is possible to check if a node &lt;tt class="docutils literal"&gt;y&lt;/tt&gt; is a parent
(&lt;tt class="docutils literal"&gt;y = x.parent()&lt;/tt&gt;);&lt;/li&gt;
&lt;li&gt;it is possible to get the first child node
(&lt;tt class="docutils literal"&gt;x.first_child()&lt;/tt&gt;);&lt;/li&gt;
&lt;li&gt;it is possible to get the next sibling child node if
another child &lt;tt class="docutils literal"&gt;y&lt;/tt&gt; is given
(&lt;tt class="docutils literal"&gt;x.next_sibling(y)&lt;/tt&gt;);&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For binary trees all of these functions are quite simple. The main
disadvantage is necessary to remember the parent of each node.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Branchless set mask if value greater or how to print hex values</title>
  <link>http://0x80.pl/notesen/2010-06-09-brancheless-hex-print.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-06-09-brancheless-hex-print.html</guid>
  <pubDate>Wed, 09 Jun 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Suppose we need to get a mask when a nonnegative argument is greater then
some constant value; in other words, we want to evaluate following
expression:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
if x &amp;gt; const_n then
   mask := 0xffffffff;
else
   mask := 0x00000000;
&lt;/pre&gt;
&lt;p&gt;Portable branchless solution:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;choose a magic number &lt;tt class="docutils literal"&gt;M := (1 &amp;lt;&amp;lt; &lt;span class="pre"&gt;(k-1))&lt;/span&gt; - 1 - n&lt;/tt&gt;, where &lt;tt class="docutils literal"&gt;k&lt;/tt&gt; is a bit position,
for example 31 if we operate on 32-bit words&lt;/li&gt;
&lt;li&gt;calculate &lt;tt class="docutils literal"&gt;R := x + M&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;k-th bit of &lt;tt class="docutils literal"&gt;R&lt;/tt&gt; is set if &lt;tt class="docutils literal"&gt;x &amp;gt; n&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;fill mask with this bit - see note &lt;a class="reference external" href="http://wmula.blogspot.com/2010/04/fill-word-with-selected-bit.html"&gt;Fill word with selected bit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key to understand this trick is binary form of M:
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;0111..1111zzzz&lt;/span&gt;&lt;/tt&gt;, where &lt;tt class="docutils literal"&gt;z&lt;/tt&gt; is 0 or 1 depending on &lt;tt class="docutils literal"&gt;n&lt;/tt&gt; value. When
&lt;tt class="docutils literal"&gt;x&lt;/tt&gt; is greater then &lt;tt class="docutils literal"&gt;n&lt;/tt&gt;, then &lt;tt class="docutils literal"&gt;x + M&lt;/tt&gt; has form &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;1000..000zzzz&lt;/span&gt;&lt;/tt&gt;,
because the carry bit propagates through series of ones to the k-th position of
the result.&lt;/p&gt;
&lt;p&gt;Real world example &amp;mdash; branchless converting hex digit to ASCII
(&lt;tt class="docutils literal"&gt;M=0x7ffffff6&lt;/tt&gt; for &lt;tt class="docutils literal"&gt;k=31&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;n=9&lt;/tt&gt;).&lt;/p&gt;
&lt;pre class="literal-block"&gt;
; input:    eax - hex digit
; output:   eax - ASCII letter (0-9, A-F or a-f)
; destroys: ebx

        andl 0xf, %eax
        leal 0x7ffffff6(%eax), %ebx     ; MSB(ebx)=1 when eax &amp;gt;= 10
        sarl $31, %ebx                  ; ebx - mask
        andl  $7, %ebx                  ; ebx = 7 when eax &amp;gt;= 10 (for A-F letters)
        ;andl $39, %ebx                 ; ebx = 39 when eax &amp;gt;= 10 (for a-f letters)
        leal '0'(%eax, %ebx), %eax      ; eax = '0' + eax + ebx =&amp;gt; ASCII letter
&lt;/pre&gt;
&lt;p&gt;It is also possible to convert 4 hex digits in parallel using similar
algorithm, but the input data have to be correctly prepared. Moreover
generating mask requires 3 instructions and one extra register (in a scalar
version just one arithmetic shift). I guess it wont be fast on x86,
maybe this approach would be good for a SIMD code, where similar code
transforms more bytes at once.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
; input: eax - four hex digits in form [0a0b0c0d]
; output: eax - four ascii letters
; destroys: ebx, ecx

        leal 0x76767676(%eax), %ebx        ; MSB of each byte is set when corresponding eax byte is &amp;gt;= 10
                                           ; (here: 0x7f - 9 = 0x76)
        andl $0x80808080, %ebx
        movl %ebx, %ecx
        shrl    $7, %ebx
        subl %ebx, %ecx                    ; ecx - byte-wise mask
        ;andl $0x07070707, %ecx            ; for ASCII letters A-F
        andl $0x27272727, %ecx             ; for ASCII letters a-f
        leal 0x30303030(%eax, %ecx), %eax  ; ecx - four ascii letters
&lt;/pre&gt;
&lt;p&gt;See also: SSSE3: &lt;a class="reference external" href="2008-05-24-sse-popcount.html"&gt;printing hex values&lt;/a&gt; (weird use of PSHUFB instruction)&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Speedup reversing table of bytes</title>
  <link>http://0x80.pl/notesen/2010-05-01-reverse-array-of-bytes.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-05-01-reverse-array-of-bytes.html</guid>
  <pubDate>Sat, 01 May 2010 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="tests"&gt;
&lt;h1&gt;Tests&lt;/h1&gt;
&lt;p&gt;Two scenarios of test were considered:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The table size is hardware friendly, i.e. is multiply of implementation
base step; also address of table is aligned:&lt;ul&gt;
&lt;li&gt;all procedures use &lt;tt class="docutils literal"&gt;MOVUPS&lt;/tt&gt;,&lt;/li&gt;
&lt;li&gt;all procedures use &lt;tt class="docutils literal"&gt;MOVAPS&lt;/tt&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The size is not hardware friendly and address is not aligned.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Procedures were tested on following computers:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;recent Core 2 Due E8200,&lt;/li&gt;
&lt;li&gt;quite old Pentium M (instruction &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt; not available).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Quick results discussion:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;As always speedup depends on the table size &amp;mdash; for larger tables
speedup is also larger. Max speedup:&lt;ul&gt;
&lt;li&gt;Core 2:&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;15.5 times&lt;/strong&gt; &amp;mdash; &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt; unrolled &amp;amp; &lt;tt class="docutils literal"&gt;MOVAPS&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;3.5 times &amp;mdash; &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt; &amp;amp; &lt;tt class="docutils literal"&gt;MOVUPS&lt;/tt&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Pentium M:&lt;ul&gt;
&lt;li&gt;4 times &amp;mdash; &lt;tt class="docutils literal"&gt;BSWAP&lt;/tt&gt; unrolled&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Unaligned memory access kills performance &amp;mdash; results clearly
shows this behaviour.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="section" id="results-for-core-2"&gt;
&lt;h2&gt;Results for Core 2&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="2010-05-01-reverse-array-of-bytes/results-core2.txt"&gt;Results&lt;/a&gt;&lt;/p&gt;
&lt;img alt="Core 2 - tables aligned, use MOVUPS for memory transfers" class="align-center" src="2010-05-01-reverse-array-of-bytes/core2-aligned.png" /&gt;
&lt;p&gt;&lt;a class="reference external" href="2010-05-01-reverse-array-of-bytes/results-core2-movaps.txt"&gt;Results&lt;/a&gt;&lt;/p&gt;
&lt;img alt="Core 2 - tables aligned, use MOVAPS for memory transfers" class="align-center" src="2010-05-01-reverse-array-of-bytes/core2-aligned-movaps.png" /&gt;
&lt;p&gt;&lt;a class="reference external" href="2010-05-01-reverse-array-of-bytes/results-core2.txt"&gt;Results&lt;/a&gt;&lt;/p&gt;
&lt;img alt="Core 2 - tables unaligned" class="align-center" src="2010-05-01-reverse-array-of-bytes/core2-unaligned.png" /&gt;
&lt;/div&gt;
&lt;div class="section" id="results-for-pentium-m"&gt;
&lt;h2&gt;Results for Pentium M&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="2010-05-01-reverse-array-of-bytes/results-pentiumm.txt"&gt;Results&lt;/a&gt;&lt;/p&gt;
&lt;img alt="Pentium M - tables aligned, use MOVUPS for memory transfers" class="align-center" src="2010-05-01-reverse-array-of-bytes/pentiumm-aligned.png" /&gt;
&lt;p&gt;&lt;a class="reference external" href="2010-05-01-reverse-array-of-bytes/results-pentiumm-movaps.txt"&gt;Results&lt;/a&gt;&lt;/p&gt;
&lt;img alt="Pentium M - tables aligned, use MOVAPS for memory transfers" class="align-center" src="2010-05-01-reverse-array-of-bytes/pentiumm-aligned-movaps.png" /&gt;
&lt;p&gt;&lt;a class="reference external" href="2010-05-01-reverse-array-of-bytes/results-pentiumm.txt"&gt;Results&lt;/a&gt;&lt;/p&gt;
&lt;img alt="Pentium M - tables unaligned" class="align-center" src="2010-05-01-reverse-array-of-bytes/pentiumm-unaligned.png" /&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Determining if an integer is a power of 2</title>
  <link>http://0x80.pl/notesen/2010-04-11-is-pow2.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-04-11-is-pow2.html</guid>
  <pubDate>Sun, 11 Apr 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Method from &lt;a class="reference external" href="http://graphics.stanford.edu/%7Eseander/bithacks.html"&gt;Bit Twiddling Hacks&lt;/a&gt;: &lt;tt class="docutils literal"&gt;(x != 0) &amp;amp;&amp;amp; (x &amp;amp; &lt;span class="pre"&gt;(x-1)&lt;/span&gt; == 0)&lt;/tt&gt;.
GCC compiles this to following code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
; input/ouput: eax
; destroys: ebx

        test    %eax,  %eax     ; x == 0?
        jz      1f

        leal -1(%eax), %ebx     ; ebx := x-1
        test    %eax,  %ebx     ; ZF  := (eax &amp;amp; ebx == 0)

        setz     %al
        movzx    %al, %eax       ; eax := ZF
1:
&lt;/pre&gt;
&lt;p&gt;We can use also &lt;tt class="docutils literal"&gt;BSF&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;BSR&lt;/tt&gt; instructions, which determine position of first and last bit=1, respectively. If a number is power of 2, then just one bit is set, and thus these positions are equal. &lt;tt class="docutils literal"&gt;BSx&lt;/tt&gt;  sets also &lt;tt class="docutils literal"&gt;ZF&lt;/tt&gt; flag if input value is zero.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
; input/output: eax
; destroys: ebx, edx

        bsf     %eax, %ebx      ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
        jz      1f
        bsr     %eax, %edx      ; edx := MSB's position

        cmp     %ebx, %edx      ; ZF  := (ebx = edx)

        setz    %al
        movzx   %al, %eax       ; eax := ZF
1:
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Brenchless conditional exchange</title>
  <link>http://0x80.pl/notesen/2010-04-08-branchless-xchg.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-04-08-branchless-xchg.html</guid>
  <pubDate>Thu, 08 Apr 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Suppose we have to exchange (or just move) two registers A and B:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;C := A xor B&lt;/li&gt;
&lt;li&gt;C := 0 if condition is not true&lt;/li&gt;
&lt;li&gt;A := A xor C&lt;/li&gt;
&lt;li&gt;B := B xor C&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If C is 0, then A and B left unchanged, else A and B are swapped.
If only a conditional move from B to A is needed, then step 4th have
to be skipped.&lt;/p&gt;
&lt;p&gt;Here is a sample x86 code, where condition is value of CF:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise
mov ecx, eax
xor ecx, ebx ; step 1
and ecx, edx ; completed step 2. - now C is 0 or (A xor B)
xor eax, ecx ; step 3
xor ebx, ecx ; step 4
&lt;/pre&gt;
&lt;p&gt;Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.&lt;/p&gt;
&lt;p&gt;See also &lt;a class="reference external" href="http://en.wikipedia.org/wiki/XOR_linked_list"&gt;XOR linked list&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>STL: map with string as key --- access speedup</title>
  <link>http://0x80.pl/notesen/2010-04-03-stl-map-of-strings.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-04-03-stl-map-of-strings.html</guid>
  <pubDate>Sat, 03 Apr 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;The idea is quite simple: we do not have a single &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;stl::map&amp;lt;string,&lt;/span&gt; something&amp;gt;&lt;/tt&gt;,
but a vector of maps, indexed with O(1) time &amp;mdash; each map stores keys sharing
certain properties. Drawback: additional memory.&lt;/p&gt;
&lt;p&gt;I've tested following grouping schemes:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;the length of string,&lt;/li&gt;
&lt;li&gt;the first letter of string (one level trie),&lt;/li&gt;
&lt;li&gt;both length and the first letter.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Third is the fastest &amp;mdash; around &lt;strong&gt;60%&lt;/strong&gt; faster then plain &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;std::map&lt;/span&gt;&lt;/tt&gt; from GCC
(red-black tree).&lt;/p&gt;
&lt;p&gt;Tests: my program read plain text (I've used &lt;em&gt;The Illiad&lt;/em&gt; from &lt;a class="reference external" href="http://gutenberg.org"&gt;http://gutenberg.org&lt;/a&gt;),
text is split into words (~190000) and then each words is inserted into a dictionary
(~28000 distinct words); then the same words are searched in dictionaries.
Table below summarizes results on my computer (gcc 4.3.4 from Cygwin).&lt;/p&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="18%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;col width="14%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head" rowspan="2"&gt;data struct&lt;/th&gt;
&lt;th class="head" colspan="3"&gt;running time [ms]&lt;/th&gt;
&lt;th class="head" colspan="3"&gt;speedup [%]&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;th class="head"&gt;min&lt;/th&gt;
&lt;th class="head"&gt;avg&lt;/th&gt;
&lt;th class="head"&gt;max&lt;/th&gt;
&lt;th class="head"&gt;min&lt;/th&gt;
&lt;th class="head"&gt;avg&lt;/th&gt;
&lt;th class="head"&gt;max&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td colspan="7"&gt;&lt;em&gt;inserting&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;std::map&lt;/td&gt;
&lt;td&gt;269&lt;/td&gt;
&lt;td&gt;287&lt;/td&gt;
&lt;td&gt;355&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;first char&lt;/td&gt;
&lt;td&gt;218&lt;/td&gt;
&lt;td&gt;241&lt;/td&gt;
&lt;td&gt;395&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;84&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;length&lt;/td&gt;
&lt;td&gt;218&lt;/td&gt;
&lt;td&gt;240&lt;/td&gt;
&lt;td&gt;345&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;84&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;len./char&lt;/td&gt;
&lt;td&gt;165&lt;/td&gt;
&lt;td&gt;172&lt;/td&gt;
&lt;td&gt;207&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;61&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan="7"&gt;&lt;em&gt;searching&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;std::map&lt;/td&gt;
&lt;td&gt;295&lt;/td&gt;
&lt;td&gt;322&lt;/td&gt;
&lt;td&gt;483&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;first char&lt;/td&gt;
&lt;td&gt;243&lt;/td&gt;
&lt;td&gt;263&lt;/td&gt;
&lt;td&gt;460&lt;/td&gt;
&lt;td&gt;82&lt;/td&gt;
&lt;td&gt;82&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;length&lt;/td&gt;
&lt;td&gt;238&lt;/td&gt;
&lt;td&gt;248&lt;/td&gt;
&lt;td&gt;292&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;77&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;len./char&lt;/td&gt;
&lt;td&gt;184&lt;/td&gt;
&lt;td&gt;190&lt;/td&gt;
&lt;td&gt;241&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;62&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Download &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/stdmap-speedup"&gt;test program&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Fill word with selected bit</title>
  <link>http://0x80.pl/notesen/2010-04-01-clone-bit.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-04-01-clone-bit.html</guid>
  <pubDate>Thu, 01 Apr 2010 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="the-most-general-algorithm"&gt;
&lt;h1&gt;The most general algorithm&lt;/h1&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;mask bit:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
[10111010] =&amp;gt; [00010000]
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;clone word:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
a=[00010000], b=[00010000]
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;shift bit in first word to MSB, and to LSB in second word:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
a=[10000000], b=[00000001]
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;subtract c = a - b:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
c=[01111111]
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;add missing MSB &lt;strong&gt;c = c OR a&lt;/strong&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
c=[11111111]
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fill1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>Branchless signum</title>
  <link>http://0x80.pl/notesen/2010-04-01-branchless-signum.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-04-01-branchless-signum.html</guid>
  <pubDate>Thu, 01 Apr 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Problem: calculate value of &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Sign_function"&gt;sign(x)&lt;/a&gt;:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;-1 when x &amp;lt; 0&lt;/li&gt;
&lt;li&gt;0 when x = 0,&lt;/li&gt;
&lt;li&gt;+1 when x &amp;gt; 0.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My solution do not involve any hardware specific things like ALU flags nor
special instructions &amp;mdash; just plain AND, OR, shifts.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
; input: eax = X

movl %eax, %ebx
sarl $31, %eax  // eax = -1 if X less then zero, 0 otherwise

andl $0x7fffffff, %ebx
addl $0x7fffffff, %ebx // MSB is set if any lower bits were set
shrl $31, $ebx  // eax = +1 if X greater then zero, 0 otherwise

orl %ebx, %eax  // eax = result
&lt;/pre&gt;
&lt;p&gt;C99 implementation:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
int32_t sign(int32_t x) {
        int32_t y;
        y = (x &amp;amp; 0x7fffffff) + 0x7fffffff;
        return (x &amp;gt;&amp;gt; 31) | ((uint32_t)y &amp;gt;&amp;gt; 31);
}
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Transpose bits in byte using SIMD instructions</title>
  <link>http://0x80.pl/notesen/2010-03-31-simd-transpose-bits.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-03-31-simd-transpose-bits.html</guid>
  <pubDate>Wed, 31 Mar 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Method presented here allows to get any bit permutation, transposition
is just one of possible operations. Lookup-based approach would be
faster, but algorithm is worth to (re)show.&lt;/p&gt;
&lt;p&gt;Algorithm outline for 8-byte vector (with SSE instruction it is possible
to get 2 operations in parallel):&lt;/p&gt;
&lt;ol class="arabic"&gt;
&lt;li&gt;&lt;p class="first"&gt;fill vector with given byte:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
[11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001]
    ▲        ▲        ▲        ▲        ▲        ▲        ▲        ▲
    │        │        │        │        │        │        │        │
[11010001] ╶─┴────────┴────────┴────────┴────────┴────────┴────────┘&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;leave one bit per byte:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
[&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0000000|0&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;000000|00&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;00000|000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0000|0000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;000|00000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;00|000000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;0|0000000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;]&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;perform desired transposition (&amp;quot;move&amp;quot; bits around):&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
[0000000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;|000000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0|00000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;00|0000&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;000|000&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;0000|00&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;00000|0&lt;span style="color: blue; font-weight: bold"&gt;0&lt;/span&gt;000000|&lt;span style="color: blue; font-weight: bold"&gt;1&lt;/span&gt;0000000]&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;&lt;p class="first"&gt;perform horizontal OR of all bytes:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;[10001011]&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here is my &lt;a class="reference external" href="/articles/snippets.html#transpozycja-bitow-update"&gt;old MMX code&lt;/a&gt; (polish text); below SSE/SSE5 implementation details.&lt;/p&gt;
&lt;p&gt;Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports
SSSE3.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
# 1.1
movd       %eax, %xmm0
shufps    $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0

# 1.2
pxor      %xmm1, %xmm1
movd       %eax, %xmm0
pshufb    %xmm1, %xmm0
&lt;/pre&gt;
&lt;p&gt;Ad 2. Simple pand with mask packed_qword(0x8040201008040201).&lt;/p&gt;
&lt;pre class="literal-block"&gt;
pand  MASK1, %xmm0
&lt;/pre&gt;
&lt;p&gt;Ad 3. If plain SSE instructions are supported this step requires some
work. First, each bit is populated to fill the whole byte (using
&lt;tt class="docutils literal"&gt;pcmpeq&lt;/tt&gt; &amp;mdash; we get negated result), then mask bits on desired positons.&lt;/p&gt;
&lt;p&gt;SSE5 has powerful instruction &lt;tt class="docutils literal"&gt;protb&lt;/tt&gt; that can do perform rotation of
each byte with independent amount &amp;mdash; so in this case just one
instruction is needed.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
# SSE
pcmpeq  %xmm1, %xmm0
pandn   MASK2, %xmm0    # pandn - to negate

# SSE5
protb    ROT, %xmm0, %xmm0
&lt;/pre&gt;
&lt;p&gt;Ad 4. Since bits are placed on distinct positions, we can use
instruction &lt;tt class="docutils literal"&gt;psadbw&lt;/tt&gt;, that calculate horizontal sums of bytewide
differences from two registers (separately for low and high registers
halves). If one register is full of zeros, we get sum of bytes from
other register.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
psadbw  %xmm1, %xmm0
movd    %xmm0, %eax
&lt;/pre&gt;
&lt;p&gt;Depending on instruction set, three (SSE) or two (SSE5) additional tables are needed.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>PostgreSQL: get selected rows with given order</title>
  <link>http://0x80.pl/notesen/2010-03-30-postgresq-get-rows-in-order.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2010-03-30-postgresq-get-rows-in-order.html</guid>
  <pubDate>Tue, 30 Mar 2010 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Suppose that a database stores some kind of a dictionary and an user picks some
items, but wants to keep the order. For example the dictionary has entries with
id=0..10, and the user picked 9, 2, 4 and 0. This simple query does the
job:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
foo = SELECT (ARRAY[9,2,4,0])[i] AS index, i AS ord FROM generate_series(1, 4) AS i
SELECT * FROM dictionary INNER JOIN (foo) ON dictionary.id=foo.index ORDER BY foo.ord
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>Join locate databases</title>
  <link>http://0x80.pl/notesen/2008-12-03-join-locate.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-12-03-join-locate.html</guid>
  <pubDate>Wed, 03 Dec 2008 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;man locatedb says: &lt;em&gt;&amp;quot;Databases can not be concatenated together, even if
the first (dummy) entry is trimmed from all but the first database. This
is because the offset-differential count in the first entry of the
second and following databases will be wrong&amp;quot;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;It's true if we follow man authors &amp;mdash; but concatenation is possible
without reencoding any database.&lt;/p&gt;
&lt;p&gt;For details about the compression scheme algorithm please refer to
&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Incremental_coding"&gt;Wikipedia&lt;/a&gt;, the file format is described in
&lt;tt class="docutils literal"&gt;man locatedb&lt;/tt&gt;. In short: compression is based on common prefix
elimination in a sequence of strings &amp;mdash; when a string share prefix
with the previous string, we store pair (length of prefix, rest of string).
For example if previous string is &amp;quot;aaabbb&amp;quot; and current is &amp;quot;aaabcd&amp;quot;, then
output is (4, &amp;quot;cd&amp;quot;), where 4 is length of common prefix: &amp;quot;aaab&amp;quot;. Locate
files also store differences between prefixes lengths; for example (4,
&amp;quot;...&amp;quot;), (5, &amp;quot;...&amp;quot;), (2, &amp;quot;...&amp;quot;) is encoded as (4, &amp;quot;...&amp;quot;), (5-4=1, &amp;quot;...&amp;quot;),
(2-5=-3, &amp;quot;...&amp;quot;) &amp;mdash; this is the reason why we can't simply join database
files.&lt;/p&gt;
&lt;p&gt;However joining locate files isn't very complicated and, as I previously
stated, do not require reencoding databases. We have to set diff value
for the first entry of an appended file to negative value of the length of common
prefix for the last entry of first file.&lt;/p&gt;
&lt;p&gt;For example when the first file contains three entries (0, &amp;quot;...&amp;quot;), (10,
&amp;quot;...&amp;quot;), (-2, &amp;quot;...&amp;quot;), then last length is 0+10-2 = 8. The second file
contains (0, &amp;quot;...&amp;quot;), (5, &amp;quot;...&amp;quot;). After join: (0, &amp;quot;...&amp;quot;), (10, &amp;quot;...&amp;quot;),
(-2, &amp;quot;...&amp;quot;), (&lt;strong&gt;-8&lt;/strong&gt;, &amp;quot;...&amp;quot;), (5, &amp;quot;...&amp;quot;).&lt;/p&gt;
&lt;p&gt;Some time ago I wrote python &lt;a class="reference external" href="https://github.com/WojciechMula/locatedb"&gt;utility/library&lt;/a&gt;, and now extended it to
perform this task. Implementation details:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;To obtain the length L of common prefix of the last entry, the first database is
decoded in a dry-mode (no results are saved).&lt;/li&gt;
&lt;li&gt;Then the first file is simply copied to an output file.&lt;/li&gt;
&lt;li&gt;Before copy the second database skip first dummy-entry (diff=0,
string=&amp;quot;LOCATE02&amp;quot;) and skip diff=0 of second entry &amp;mdash; this need simple
file seek. Then save diff=-L, and finally copy rest of the second database
file.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I've tested joined database with native Linux locate (under Cygwin) and
didn't notice any problems.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>SSE4.1: PHMINPOSUW --- insertion sort</title>
  <link>http://0x80.pl/notesen/2008-08-03-sse4-insertionsort.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-08-03-sse4-insertionsort.html</guid>
  <pubDate>Sun, 03 Aug 2008 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Unusual application of PHMINPOSUW instruction as key part
of insertion sort for 8 element tables. I guess it won't find
any practical usage.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/WojciechMula/toys/blob/master/sse/sse4-insertionsort.c"&gt;Implementation&lt;/a&gt;:&lt;/p&gt;
&lt;pre class="code literal-block"&gt;
typedef uint16_t table[8];

table max[8] = {
    {0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000},
    {0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000},
    {0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000},
    {0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000},
    {0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000},
    {0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000},
    {0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000},
    {0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff}
};

void sse4_sort(table T) {
    uint32_t dummy;

    __asm__ volatile (
    &amp;quot;       movdqu (%%eax), %%xmm0          \n&amp;quot;
    &amp;quot;       xor %%ecx, %%ecx                \n&amp;quot;     // i = 0
    &amp;quot;1:                                     \n&amp;quot;
    &amp;quot;       phminposuw %%xmm0, %%xmm1       \n&amp;quot;     // find min, and its index j
    &amp;quot;       movd %%xmm1, %%edx              \n&amp;quot;
    &amp;quot;       movw   %%dx, (%%eax, %%ecx, 2)  \n&amp;quot;     // save min at i-th position
    &amp;quot;                                       \n&amp;quot;
    &amp;quot;       shrl   $16, %%edx               \n&amp;quot;
    &amp;quot;       shll    $4, %%edx               \n&amp;quot;
    &amp;quot;                                       \n&amp;quot;
    &amp;quot;       por  max(%%edx), %%xmm0         \n&amp;quot;     // set max at pisition j
    &amp;quot;                                       \n&amp;quot;
    &amp;quot;       addl    $1, %%ecx               \n&amp;quot;     // i += 1
    &amp;quot;       cmp     $8, %%ecx               \n&amp;quot;
    &amp;quot;       jl      1b                      \n&amp;quot;

    :
    : &amp;quot;a&amp;quot; (T)
    : &amp;quot;ecx&amp;quot;, &amp;quot;edx&amp;quot;
    );
}
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>SSSE3: PMADDUBSW and image crossfading</title>
  <link>http://0x80.pl/notesen/2008-06-21-sse4-crossfading.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-21-sse4-crossfading.html</guid>
  <pubDate>Sat, 21 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Image crossfading is a kind of alpha blending where a final pixel is
the result of linear interpolation of pixels from two images:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
result_pixel = pixel1 * alpha + pixel2 * (1 - alpha)
&lt;/pre&gt;
&lt;p&gt;where alpha lie in range [0, 1].  Of course when operating on &amp;quot;pixels&amp;quot;
color components are considered; components are unsigned bytes.&lt;/p&gt;
&lt;p&gt;SSSE3 introduced instruction &lt;tt class="docutils literal"&gt;PMADDUBSW&lt;/tt&gt;.  This instruction multiply
a destination vector of &lt;strong&gt;unsigned&lt;/strong&gt; &lt;strong&gt;bytes&lt;/strong&gt; by a source vector of
&lt;strong&gt;signed&lt;/strong&gt; &lt;strong&gt;bytes&lt;/strong&gt; &amp;mdash; the result is a vector of signed words.  Then
adjacent words are added with &lt;strong&gt;signed&lt;/strong&gt; saturation (the same operation
as &lt;tt class="docutils literal"&gt;PHADDSW&lt;/tt&gt;).&lt;/p&gt;
&lt;p&gt;This is exactly what crossafading needs.&lt;/p&gt;
&lt;p&gt;The obvious drawback is that instruction operates on signed values.
Because &lt;tt class="docutils literal"&gt;alpha&lt;/tt&gt; must be positive, this reduces resolution of alpha from 8
to 7 bits.  (was: &lt;em&gt;Because multiplication results are signed and then added,
the sum must not be greater than 32767 &amp;mdash; this requirement reduces
resolution by another bit.  Finally alpha must lie in range [0..63].&lt;/em&gt;)
&lt;a class="reference external" href="https://github.com/radioneko"&gt;Dmitry Petrov&lt;/a&gt; pointed out that alpha can be a 7-bit value, as such
value never cause an overflow. Let's assume that both &lt;tt class="docutils literal"&gt;pixel1&lt;/tt&gt; and
&lt;tt class="docutils literal"&gt;pixel2&lt;/tt&gt; have maximum value, and check if following inequality is true:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
(1) 255 * alpha + 255 * (127 - alpha) &amp;lt; 2^15 - 1
(2)                         255 * 127 &amp;lt; 2^15 - 1
(3)                             32385 &amp;lt; 32767
&lt;/pre&gt;
&lt;p&gt;Obviously the inequality is true.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE: conversion uint32 to float</title>
  <link>http://0x80.pl/notesen/2008-06-18-sse-uint32-to-float.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-18-sse-uint32-to-float.html</guid>
  <pubDate>Wed, 18 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;There is no such instruction &amp;mdash; &lt;tt class="docutils literal"&gt;CVTDQ2PS&lt;/tt&gt; converts signed 32-bit
ints.  Solution: first zero the MSB, such number is never negative in U2,
so mentioned instruction could be used.  Then add &lt;span class="math"&gt;2&lt;sup&gt;32&lt;/sup&gt;&lt;/span&gt; if the MSB
was set.&lt;/p&gt;
&lt;pre class="code cpp literal-block"&gt;
&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;CONST&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;SIMD_ALIGN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;packed_float&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)((&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cm"&gt;/* 2^31 */&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MASK_0_30&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SIMD_ALIGN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;packed_dword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0x7fffffff&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MASK_31&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;SIMD_ALIGN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;packed_dword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0x80000000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;convert_uint32_float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;__asm__&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;volatile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;movdqu   (%%eax), %%xmm0  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;movdqa    %%xmm0, %%xmm1  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pand   MASK_0_30, %%xmm0  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// xmm0 - mask MSB bit - never less then zero in U2
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;cvtdq2ps  %%xmm0, %%xmm0  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// convert this value to float
&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;psrad        $32, %%xmm1  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// populate MSB in higher word (enough to mask CONST)
&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pand       CONST, %%xmm1  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// xmm1 = MSB set ? float(2^31) : float(0)
&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;addps     %%xmm1, %%xmm0  &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// add 2^31 if MSB set
&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;movdqu    %%xmm0, (%%ebx) &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="cm"&gt;/* no output */&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;b&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;
&lt;p&gt;See &lt;a class="reference external" href="https://github.com/WojciechMula/toys/tree/master/sse-uint32-float"&gt;a sample implementation&lt;/a&gt;.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>Floating point tricks</title>
  <link>http://0x80.pl/notesen/2008-06-15-fptricks.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-15-fptricks.html</guid>
  <pubDate>Sun, 15 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="converting-float-to-int"&gt;
&lt;h1&gt;Converting float to int&lt;/h1&gt;
&lt;p&gt;Few years ago I've developed a method that do not need any floating-point
operations &amp;mdash; &lt;a class="reference external" href="/articles/snippets.html#konwersja-float-na-int"&gt;description&lt;/a&gt; is written in Polish, but sample code
should be easy to understand.  In short words mantissa is completed with
the implicit bit 23 (or 52) and treated as a natural number.  Then this number
is shifted left or right to place the dot position at 0 &amp;mdash; the shift amount
depends on the exponent value.&lt;/p&gt;
&lt;p&gt;Another method uses floating point operations and is limited to
positive number less than &lt;span class="math"&gt;2&lt;sup&gt;23&lt;/sup&gt;&lt;/span&gt; (float) (and &lt;span class="math"&gt;2&lt;sup&gt;52&lt;/sup&gt;&lt;/span&gt;
for doubles).&lt;/p&gt;
&lt;p&gt;When value &lt;span class="math"&gt;2&lt;sup&gt;23&lt;/sup&gt;&lt;/span&gt; is added to another float, then just 23
most significant bits are stored &amp;mdash; the fraction bits are shifted out.&lt;/p&gt;
&lt;p&gt;Let see an example, number 7.25 (111.01) has following floating point
representation:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌─┬────────┬───────────────────────┐
│0│10000001│&lt;span style="color: blue; font-weight: bold"&gt;1101&lt;/span&gt;0000000000000000000│
└─┴────────┴───────────────────────┘
 S exp+127    normalized mantissa&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;After adding &lt;span class="math"&gt;2&lt;sup&gt;23&lt;/sup&gt;&lt;/span&gt;:&lt;/p&gt;
&lt;div class="asciidiag"&gt;&lt;pre class="asciidiag"&gt;
┌─┬────────┬───────────────────────┐
│0│10010110│00000000000000000000&lt;span style="color: blue; font-weight: bold"&gt;111&lt;/span&gt;│
└─┴────────┴───────────────────────┘
 S exp+127    normalized mantissa&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Mantissa field &lt;strong&gt;treated as natural number&lt;/strong&gt; contains an integer part of number.&lt;/p&gt;
&lt;p&gt;Because addition is used, then the result is rounded or truncated, depending
on the current FPU's rounding settings.  When bare bit shift is used instead of
addition (as in the method mentioned earlier), then the number is always truncated.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: this method could be used to get fixed point, just smaller value
is needed: &lt;span class="math"&gt;2&lt;sup&gt;23 &amp;minus; &lt;i&gt;fraction&lt;/i&gt;&lt;i&gt;bits&lt;/i&gt;&lt;/sup&gt;&lt;/span&gt;, but this also limit the maximum value
of float.&lt;/p&gt;
&lt;p&gt;Implementation from sample program &lt;a class="reference external" href="https://github.com/WojciechMula/toys/blob/master/floating-point/float2int.c"&gt;float2int.c&lt;/a&gt;:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
void convert_simple() {
        double C = (1ll &amp;lt;&amp;lt; 52);
        union {
                double  val;
                int64_t bin;
        } tmp;

        int i;
        for (i=0; i &amp;lt; SIZE; i++) {
                tmp.val  = in[i] + C;
                tmp.bin  = tmp.bin &amp;amp; 0x000fffffffffffffll;
                out_2[i] = tmp.bin;
        }
}
&lt;/pre&gt;
&lt;p&gt;However this method is slower than ordinal FPU instructions, i.e.:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
fldl    (%eax)
fistpl  (%ebx)  (or fisttpl (%ebx) on CPU with SSSE3)
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>RDTSC on Core2</title>
  <link>http://0x80.pl/notesen/2008-06-08-rdtsc-on-core2.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-08-rdtsc-on-core2.html</guid>
  <pubDate>Sun, 08 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;RDTSC is incremented with bus-clock cycles, and then multiplied by
core-clock/bus-clock ratio.  From programmer view, RDTSC counter is
incremented by value greater then 1, for example on C2D E8200 it is 8.&lt;/p&gt;
&lt;p&gt;Latency of RDTSC in Pentium4 is about 60-120 cycles, on AMD CPU
around 6 cycles.&lt;/p&gt;
  </description>
 </item>
 <item>
  <title>PABSQ --- absolute value of two singed 64-bit numbers</title>
  <link>http://0x80.pl/notesen/2008-06-08-pabsq.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-08-pabsq.html</guid>
  <pubDate>Sun, 08 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;p&gt;Branch-less x86 code:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
movl  %eax, %ebx
sarl   $32, %ebx        ; fill ebx with sign bit
xorl  %ebx, %eax        ; negate eax (if negative)
subl  %ebx, %eax        ; increment eax by 1 (if negative)
&lt;/pre&gt;
&lt;p&gt;SSE2:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
pshufd $0b11110101, %xmm0, %xmm1        ; populate dwords 3 and 1
psrad   $32, %xmm1      ; fill quad words with sign bit
pxor  %xmm1, %xmm0      ; negate (if negative)
psubq %xmm1, %xmm0      ; increment (if negative)
&lt;/pre&gt;
  </description>
 </item>
 <item>
  <title>GCC asm constraints</title>
  <link>http://0x80.pl/notesen/2008-06-07-gcc-asm-constraints.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-07-gcc-asm-constraints.html</guid>
  <pubDate>Sat, 07 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="read-write-variables"&gt;
&lt;h1&gt;Read-write variables&lt;/h1&gt;
&lt;pre class="literal-block"&gt;
asm(
        &amp;quot;...&amp;quot;
        : &amp;quot;+a&amp;quot; (var)
);
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSSE3/SSE4: alpha blending --- operator over</title>
  <link>http://0x80.pl/notesen/2008-06-03-sse4-alphaover.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-03-sse4-alphaover.html</guid>
  <pubDate>Tue, 03 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/Alpha_blending"&gt;Alpha blending&lt;/a&gt; refers to many different operations.  This note
describes results for the &lt;strong&gt;over&lt;/strong&gt; operator that works on RGBA pixels with
premultiplied alpha.&lt;/p&gt;
&lt;p&gt;Basic formula:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
background = (alpha * foreground) + background
&lt;/pre&gt;
&lt;p&gt;where &lt;tt class="docutils literal"&gt;alpha&lt;/tt&gt; in range &lt;tt class="docutils literal"&gt;[0 .. 255]&lt;/tt&gt;, and &lt;tt class="docutils literal"&gt;+&lt;/tt&gt; denotes &lt;em&gt;add with
saturation&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The reference implementation coded in C:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
Rf =  foreground &amp;amp; 0xff
Gf = (foreground &amp;gt;&amp;gt;  8) &amp;amp; 0xff
Bf = (foreground &amp;gt;&amp;gt; 16) &amp;amp; 0xff
Af = (foreground &amp;gt;&amp;gt; 24) &amp;amp; 0xff

Rb =  background &amp;amp; 0xff
Gb = (background &amp;gt;&amp;gt;  8) &amp;amp; 0xff
Bb = (background &amp;gt;&amp;gt; 16) &amp;amp; 0xff

R = (Rf * Af)/256 + Rb
G = (Gf * Af)/256 + Gb
B = (Bf * Af)/256 + Bb

if (R &amp;gt; 255) R = 255
if (G &amp;gt; 255) G = 255
if (B &amp;gt; 255) B = 255

background = R | (G &amp;lt;&amp;lt; 8) | (B &amp;lt;&amp;lt; 16)
&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: dividing by 256 never bring component value 255 &amp;mdash; to obtain correct
range some additional operations are needed.  Probably no one notice
differences.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE4: grater/less or equal relations for unsigned bytes/words</title>
  <link>http://0x80.pl/notesen/2008-06-02-sse4-unsigned-gtlt.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-02-sse4-unsigned-gtlt.html</guid>
  <pubDate>Mon, 02 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="proof"&gt;
&lt;h1&gt;Proof&lt;/h1&gt;
&lt;p&gt;In the proof we consider three cases: x &amp;lt; y, x = y and x &amp;gt; y.&lt;/p&gt;
&lt;p&gt;The second and the last column (i.e. left and right side of equivalence)
are the same in both cases. QED&lt;/p&gt;
&lt;div class="section" id="greater-or-equal"&gt;
&lt;h2&gt;Greater or equal&lt;/h2&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="23%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="25%" /&gt;
&lt;col width="34%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;case&lt;/th&gt;
&lt;th class="head"&gt;x &amp;lt;= y&lt;/th&gt;
&lt;th class="head"&gt;min(x, y)&lt;/th&gt;
&lt;th class="head"&gt;min(x, y) = x&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;x &amp;lt; y&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;x&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;x = y&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;x&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;x &amp;gt; y&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;td&gt;y&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class="section" id="less-or-equal"&gt;
&lt;h2&gt;Less or equal&lt;/h2&gt;
&lt;table border="1" class="docutils"&gt;
&lt;colgroup&gt;
&lt;col width="23%" /&gt;
&lt;col width="18%" /&gt;
&lt;col width="25%" /&gt;
&lt;col width="34%" /&gt;
&lt;/colgroup&gt;
&lt;thead valign="bottom"&gt;
&lt;tr&gt;&lt;th class="head"&gt;case&lt;/th&gt;
&lt;th class="head"&gt;x &amp;gt;= y&lt;/th&gt;
&lt;th class="head"&gt;max(x, y)&lt;/th&gt;
&lt;th class="head"&gt;max(x, y) = x&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td&gt;x &amp;lt; y&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;td&gt;y&lt;/td&gt;
&lt;td&gt;false&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;x = y&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;x&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;x &amp;gt; y&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;td&gt;x&lt;/td&gt;
&lt;td&gt;true&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>16bpp/15bpp to 32bpp pixel conversions --- different methods</title>
  <link>http://0x80.pl/notesen/2008-06-01-sse-pix16to32bpp.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-01-sse-pix16to32bpp.html</guid>
  <pubDate>Sun, 01 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Basically this kind of conversion needs following steps:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;extract components R, G and B (using bitwise and)&lt;/li&gt;
&lt;li&gt;extend words from 6 or 5 bits to 8 bits (shift left)&lt;/li&gt;
&lt;li&gt;place components at desired places in a 32-bit word (shift, bitwise or)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="literal-block"&gt;
R = (pixel16 and 0x001f) shl 3
G = (pixel16 and 0x07e0) shr 5
B = (pixel16 and 0xf800) shr 11

pixel32 = R or (G shl 8) or (B shl 16)
&lt;/pre&gt;
&lt;p&gt;Since there aren't many pixels (32 or 64 thousand) lookup tables can be used.
First approach is to use one big table indexed by pixels treated as natural
numbers: this table has size 65536 * 4 bytes = 262144 bytes.  Just one memory
access is needed to get 32bpp pixel, however the table size is big, and even if
it fits in a L2 cache, then the memory latency kill performance.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
pixel32 = LUT[pixel16]
&lt;/pre&gt;
&lt;p&gt;Another approach needs two tables indexed by the lower and the higher byte of
a pixel, the final pixel is result of bitwise or.  These tables has size
2 * 256 * 4 bytes = 2048 bytes &amp;mdash; perfectly fit in a L1 cache.&lt;/p&gt;
&lt;pre class="literal-block"&gt;
pixel32 = LUT_hi[pixel16 shr 8] or LUT_lo[pixel16 and 0xff]
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE: modify 32bpp images with lookup tables</title>
  <link>http://0x80.pl/notesen/2008-06-01-sse-lookup32bpp.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-06-01-sse-lookup32bpp.html</guid>
  <pubDate>Sun, 01 Jun 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="x86-code"&gt;
&lt;h1&gt;x86 code&lt;/h1&gt;
&lt;p&gt;The x86 code is a base for further improvements.  If pixel is loaded into an
x86 register, following code can be used to extract all RGBA
components:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
movl  (%%esi), %%eax    ; eax - pixel

movzbl  %%al, %%ebx     ; R
movzbl  %%ah, %%ecx     ; G
shrl     $16, %%eax
movzbl  %%al, %%edx     ; B
movzbl  %%ah, %%eax     ; A

movl    LUT_R(,%%ebx,4), %%ebx
orl     LUT_G(,%%ecx,4), %%ebx
orl     LUT_A(,%%edx,4), %%ebx
orl     LUT_B(,%%eax,4), %%ebx ; ebx - transformed_pixel

movl    %%ebx, (%%edi)
&lt;/pre&gt;
&lt;p&gt;Code that works with RGB pixels is of course shorter:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
movl  (%%esi), %%eax    ; eax - pixel

movzbl  %%al, %%ebx     ; R
movzbl  %%ah, %%ecx     ; G
shrl     $16, %%eax
movzbl  %%al, %%edx     ; B

movl    LUT_R(,%%ebx,4), %%ebx
orl     LUT_G(,%%ecx,4), %%ebx
orl     LUT_B(,%%eax,4), %%ebx ; ebx - transformed_pixel

movl    %%ebx, (%%edi)
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSE4 string search --- modification of Karp-Rabin algorithm</title>
  <link>http://0x80.pl/notesen/2008-05-27-sse4-substring-locate.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-05-27-sse4-substring-locate.html</guid>
  <pubDate>Tue, 27 May 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;String search is a common task in text processing.  There are many
algorithms that try to minimize number of exact comparing substrings.&lt;/p&gt;
&lt;p&gt;One of them is &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Karp-Rabin"&gt;Karp-Rabin algorithm&lt;/a&gt; &amp;mdash; char-wise
comparison is performed only when values of hash function calculated for
a part of text and a substring are equal.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="http://en.wikipedia.org/wiki/SSE4"&gt;SSE4&lt;/a&gt; introduced complex instruction &lt;tt class="docutils literal"&gt;MPSADBW&lt;/tt&gt; which
calculate eight Manhattan distances (L1) between given 4-byte vector and
8 subsequent vectors; if the distance is zero, then vectors are equal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The idea of modification is to use equality of 4-byte substring's prefix
instead of hash values equality.&lt;/strong&gt; &lt;tt class="docutils literal"&gt;MPSADBW&lt;/tt&gt; is fast, it has latency 4
cycles and throughput 2 cycles.  Even if latency is not compensated,
overall performance is very promising &amp;mdash; 0.5 cycle per one 4-byte
vectors comparison.&lt;/p&gt;
&lt;p&gt;Unfortunately there are three disadvantages:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Searching a substring shorter than 4 chars need some additional
work.&lt;/li&gt;
&lt;li&gt;A hash is calculated for whole substring, &lt;tt class="docutils literal"&gt;MPSADBW&lt;/tt&gt; consider
just 4-byte prefix, thus the number of false-negative alarms
could be greater.&lt;/li&gt;
&lt;li&gt;At least the length of a substring must be known.  In the sample
application text length is also given &amp;mdash; this make program
shorter and faster.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSSE3: fast popcount</title>
  <link>http://0x80.pl/notesen/2008-05-24-sse-popcount.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-05-24-sse-popcount.html</guid>
  <pubDate>Sat, 24 May 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="introduction"&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Population count is a procedure of counting number of ones in a bit string.
Intel introduced instruction &lt;tt class="docutils literal"&gt;popcnt&lt;/tt&gt; with &lt;a class="reference external" href="http://en.wikipedia.org/wiki/SSE4"&gt;SSE4.2&lt;/a&gt; instruction
set. The instruction operates on 32 or 64-bit words.&lt;/p&gt;
&lt;p&gt;However &lt;a class="reference external" href="http://en.wikipedia.org/wiki/SSSE3"&gt;SSSE3&lt;/a&gt; has powerful instruction &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt;.  This instruction
can be used to perform a &lt;strong&gt;parallel&lt;/strong&gt; 16-way lookup; LUT has 16 entries and is
stored in an XMM register, indexes are 4 lower bits of each byte stored in
another XMM register.&lt;/p&gt;
&lt;/div&gt;
  </description>
 </item>
 <item>
  <title>SSSE3: printing hex values</title>
  <link>http://0x80.pl/notesen/2008-04-29-sse-hexprint.html</link>
  <guid isPermaLink="true">http://0x80.pl/notesen/2008-04-29-sse-hexprint.html</guid>
  <pubDate>Tue, 29 Apr 2008 12:00:00 +0100</pubDate>
  <description>
&lt;div class="section" id="simd-algorithm"&gt;
&lt;h1&gt;SIMD algorithm&lt;/h1&gt;
&lt;p&gt;Instruction &lt;tt class="docutils literal"&gt;PSHUFB&lt;/tt&gt; does &lt;strong&gt;parallel&lt;/strong&gt; lookup from 16-byte array
stored in an XMM register &amp;mdash; this is exactly what bin to hex conversion
needs.&lt;/p&gt;
&lt;p&gt;Code snippet showing the idea:&lt;/p&gt;
&lt;pre class="literal-block"&gt;
movdqa    (%eax), %xmm0 ; xmm0 = {0xba, 0xdc, 0xaf, 0xe8, ...}
movdqa     %xmm0, %xmm1 ; xmm1 -- bits 4..7 shifted 4 positions right
psrlw         $4, %xmm1 ; xmm1 = {0xad, 0xca, 0xfe, 0x80, ...}
punpcklbw  %xmm0, %xmm1 ; xmm0 = {0xba, 0xad, 0xdc, 0xca, 0xaf, 0xfe, 0xe8, 0x80, ...}
                        ; MASK = packed_byte(0x0f)
pand        MASK, %xmm1 ; xmm0 = {0xb0, 0xa0, 0xd0, 0xc0, 0xa0, 0xf0, 0xe0, 0x80, ...}
                        ;      -- bits 0..3
movdqa HEXDIGITS, %xmm0 ; HEXDIGITS = {'0', '1', '2', '3', ..., 'a', 'b', 'c', 'd', 'e', 'f'}
pshufb     %xmm1, %xmm0 ; xmm0 = {'b', 'a', 'd', 'c', 'a', 'f', 'e', '8', ...}
&lt;/pre&gt;
&lt;/div&gt;
  </description>
 </item>
 </channel>
</rss>
