ARM64 Assembly with Swift and XCode

Here is a popular interview question:

Given an unsigned integer, how would you count the number of set bits (that is, bits = 1)?

This is otherwise known as a Hamming Weight, and there are clever tricks for solving this problem using only a few lines of C. You could also use a loop with shift and compare:

There is actually a single CPU instruction that implements this entire function:

CNT (vector)

Population count per byte.


CNT Vd.T, Vn.T

This is an instruction, in ARM64 Assembly (AArch64), that performs the population count operation in one line of code. A similar instruction exists in most CPU assembly languages. iPhones use ARM-family chips. For example, my iPhone 6 uses an Apple A8, a System-on-a-Chip (SoC) built around an ARM64 CPU.

This inspired me to write an iPhone app that utilizes the CNT instruction, which means I get to write a little bit of ARM assembly code and deploy it on the iPhone. The code is on Github. This is what the finished product looks like:

The app is written in Swift but uses C and Assembly for the bit-counting operation. Calling ARM64 functions from Swift is a lot like calling a C function from Swift, so I wrote a C popcount() function to use with the simulator (compiles into x86 assembly) and an ARM64 popcount() function to deploy on the iPhone itself. The function is defined in the two files popcount.c and popcount.s. They use the preprocessor flags #ifdef __arm64__ and ifndef __arm64__ to tell the compiler which code goes with which platform.

Both files utilize the same header to declare the function. You can create a header called popcount.h using the File menu. But this is a Swift project, so we need a bridging header to make the functions callable from Swift. A bridging header is created automatically if you carefully follow the prompts when creating a C file, but you can create one manually if need be.

A function written in assembly code comes with a lot of boilerplate code for setting up the stack, returning results, aligning memory, and much more. Fortunately, there is a way to get a working example that can be used as a kind of template. The compiler will show you generated assembly if you use the right options. Take this simple Swift function:

Compiling this code manually will show the assembly code generated by the swiftc compiler. I used this command: xcrun -sdk iphoneos swiftc -emit-assembly -target arm64-apple-ios11.0 SimpleFunction.swift. It produces a very large assembly file, not all of which needs to be replicated in the popcount.s code. The code starting at the function label and ending with the ret statement is useful, because we can use it to create our own function in assembly.

The addTwoNumbers function is actually contained in the assembly code labelled __T014SimpleFunction13addTwoNumbersS2i_SitF. The name has been mangled to make it unique and identifiable across potentially many compiled objects. The single line adds x0, x0, x1  actually contains the entire body of this simple function in one command. Here is a breakdown of the command:

  • adds is an opcode, a single basic instruction in assembly. Assembly opcodes typically perform a simple math operation on one of a few built-in variables. They may also move a word of data in or out of memory. This opcode adds two numbers together, respecting the sign bit.
  • x0 is the destination variable for the operation, where the results are written. These variables are called registers in assembly, and they live on the CPU outside of memory. There are only 31 general-purpose registers on modern ARM chips, so assembly programs spend a lot of code moving data between memory and these local registers.
  • x0, x1 are the source registers. The code that came before this command takes care of pulling the values off of the call stack (placed there by the calling function) and putting them into registers.

There are a lot of other ARM64 commands, and a lot of details needed to understand function calls. But I won’t need to dive this deep to complete my task.

I need to replace this adds command with cnt. The cnt command is actually part of the ARM64 SIMD instruction set, and uses a separate set of registers from the general-purpose x0-x31 registers. I need a three-line program: one line to move the function parameter into a SIMD register, one to perform the count, and one to move the result back to a GP register. These three lines accomplish this:

Breaking down the assembly syntax again:

  • dup duplicates an element (copies it). We are copying from a general-purpose register, x0, into a specialized SIMD register, v0.2d.
  • v0.2d is a 128-bit SIMD register. x0 is a 64-bit register, so the .2d modifier instructs the assembler to duplicate x0 and put the two 64-bit copies into v0. This isn’t necessary for our purpose, but it is useful for other algorithms.
  • cnt executes the population count, replacing the bit pattern with the number counted. For example, 01101101 would become 00000101 (101=5, because there were five bits set in the data before the operation).
  • v0.16b indicates the same register as before, except this time the SIMD register is being divided into 16 lanes, each 1-byte in size. Each of the 16 8-bit lanes inside the vector is treated as a separate register and they are all executed in parallel. Vector instructions like to perform one operation on many pieces of data in parallel, hence SIMD: Same Instruction, Multiple Data elements.
  • fmov moves data between registers again, this time putting the results back in x0. The f here actually makes this a floating-point instruction, because FP and SIMD instructions share the same registers in ARM64. But it doesn’t make this a floating-point number.
  • v0.d[1] is once again the same SIMD register, but with a different way of splitting up the data elements. The indicates that a double-word (64 bits) is being copied out of element 1 of the SIMD register. This puts the results into our GP register.

Note: CNT only works in 8-bit chunks due to hardware limitations (more bits = more wires = less room to implement other features). If I wanted to support values larger than 255, I could add an addv command, which adds the 8-bit chunks together. There are many more ARM topics I have not explored.

The resulting assembly code reports the population count, and the Swift code displays the results to the user:

If you’re new to assembly language, it is worth noting some oddities of the assembly language environment. Assembly language gives you complete low-level control over the behavior of your CPU, but this means that understanding the details of CPU architecture is essential to writing efficient assembly. Instructions that rely on shared resources can create delays. Memory loads take several instruction cycles to complete.

Compilers are pretty good at allocating registers and scheduling instructions. It may be worth writing code in C and using the generated assembly as a starting point to see how it can be done. There is a lot more to writing effective assembly than just knowing the syntax of the language.

You can find lots of resources for learning ARM64 assembly. A crash course in ARM64 assembly might be a good place to start, and the ARMv8 Instruction Set Overview is both authoritative and exhaustive. Github member Richard Ross has written an entire iOS app in assembly.

You can practice ARM64 using QEMU, which, although it requires a lot of effort to set up, will let you analyze assembly instructions one-by-one as the execute.

Swift Fun with ArraySlice

There is a lovely article by Luna An describing ArraySlice objects in more detail than I do here. It covers Swift 3 at the moment, and you should note that Swift 4 includes support for single-ended ranges, so you can create slices ala [..<count].

I found a good use for an ArraySlice while trying to find quartiles in a set of data.

The problem of finding each quartile is essentially the same problem of finding the median from three different data sets, one being the original input set, the other two being the upper and lower half of the set after removing the original median element, if it exists.

Here, I use findMedian to perform all three tasks. I found that I had to do a bit of extra work because the array slice is not indexed starting at zero. I wonder why they chose to implement slices in this way?


Command Design Pattern in Swift

It is the summer of Swift.

I was perusing some wonderful design patterns from Oktawian Chojnacki, and I decided to play around with the Command pattern. The Command pattern represents commands as objects to go between a Caller, which references the commands, and a Receiver, which is referenced by the commands.

I’m reminded of a game :


Here, each command object operates on a Robo, telling it to make a single move. The commands are collected into a program.

The first draft of this code had the Commands store their target Robo in a property. I realized a problem with this in that my program would accept commands for any Robo, when only one Robo belongs to the program. My solution for this was to give control of the command target to the program itself.

Of course, now the commands are little more than glorified functions, which can be stored in arrays in Swift anyway.

Swift Dictionary Reduce

Why is it so hard to find examples of Swift dictionary reduce operations? Examples for arrays abound :

See this if you’re looking for a detailed HowTo.

But dictionary is the more advanced problem. How do I reduce when I have keys and values?

The key here is to note that a dictionary yields the contents of its sequence in tuples containing pairs of values. We still have our familiar $0, $1 arguments from reduce as above, where $0 is the partial sum and $1 is the individual element value. But now $0 and $1 are both tuples, and you get to their contents through $0.0, $0.1, $1.0, and $1.1.

This example concatenates the strings and adds the integers, and the two separate data types just happen to use the same symbol for the two operations.

Book Critic

I don’t like to step on anyone’s work. Having taught university computer science for a decade, I can tell you that it is difficult to present technical material in a way that is both correct and engaging. My students’ surveys remind me every semester that 100% satisfaction is elusive. But I want to critique the writing style of a book on Swift that I just browsed at Barnes and Noble. If you are easily tripped up on vague or ambiguous descriptions, please know that we share this obsessive trait.

See the section subtitled Building Blocks of Swift, in bold? It establishes the very basic programming concept of named storage, called variables in most material even if their values never change. The unchanging named values are also called constants. In Swift, these named values are declared with the syntax var variableName=value for variables and with let constantName=value for constants.

These are my complaints :

  • The first paragraph explains variables and constants, but only gives the keyword var for variables. What about let for constants? This leaves me hanging.
  • The sample code below P1 is obviously not illustrative of the example described, even though it prefaces the code with “as shown in this example:”.
  • The sample code below P1 contains errors! The reassignment of the constant will break at compile time. Syntax errors, while useful for examples, must be flagged clearly so that the reader does not struggle trying to figure out the code!
  • The last example in that section says that you can omit the var keyword, yet we clearly see the var keyword in the example code given. It is more clear to say that we can omit additional var keywords when declaring multiple variables.

So, buyer beware.

CERT, C, and Swift

Swift does not do implicit type conversions between integers of different sizes and signages. By contrast, Java does implicit conversion, but with less risk of unexpected effects due to the lack of unsigned integer types. C, that dinosaur, has a strange and secret show behind every elementary school math operation.

Why is type conversion confusing? I’ll use C as my example. CERT has some of the best low-level down-and-dirty descriptions of the language, so I’ll be borrowing heavily from their examples.

We’ll start with an easy one: Promotion. During arithmetic, everything is automatically promoted up to int or uint (32 bits on most platforms) in order to prevent overflow. It is curious that they don’t promote to something bigger (long), but x86 assembly uses 32-bit registers most of the time, so it’s a natural fit.

If we run this in the debugger, using the lldb command type format add –format “unsigned dec” “unsigned char” for clarity, we see the following:

(lldb) frame variable

(unsigned char) appleCharOp1 = 80

(unsigned char) appleCharOp2 = 70

(unsigned char) appleCharOp3 = 100

(unsigned char) appleCharResult = 56

(unsigned char) appleCharWrongResult = 2

The correct answer is the result of being able to hold the intermediate product, 5600, in a 32-bit variable. If you truncate the intermediate product to 8 bits, you get 0xE0, which is 224. Divide this by 2 gives you after rounding.

Mind you, the end result is that you get the expected answer as long as you don’t stick a weird cast in the middle like I did there. But here is an example where you get the wrong answer without any extra work:

Results are

(unsigned int) cantaloupeInt1 = 1073741823

(unsigned int) cantaloupeInt2 = 536870911

(unsigned int) cantaloupeInt3 = 268435455

(unsigned long long) cantaloupeLongLong1 = 10

That’s the wrong answer. You get the right answer if you cast each term to (unsigned long long).

There are many other factors to C integer conversion, such as precision and rank, but these examples suffice to show why it’s a tricky subject.

Enter Swift’s design choice. If you try the same basic code in Swift :

It simply refuses to compile. The operations on the right are legal, but the assignment is not.

Swift is Here to Stay

There are a few pockets of the internet complaining that Swift is not ready for production yet. I dive into these half expecting to see complaints about the changes in Swift 2.2, the rapidly changing standards, etc. Instead, those professional developers who are complaining are warning about lack of binary compatibility between old and new frameworks.

I am not deterred. I’m into Swift now.

This post is short on content. The plan is to follow Apple’s guide to grok basic syntax, followed by Ray’s guide on style to get the latest and greatest best-practices.

After that, the pickin’s get slim because I’m jonesing for Metal tutorials. Good news is that Jacob Bandes says that it makes more sense than OpenGL or CUDA. We shall see, we shall see.