Data Engineering

I’ve been teaching a course on database management systems for about six years now. We cover relational theory, SQL, normalization, MySQL, data modeling, Python, Jupyter, Docker, and PySpark. It’s a great class, and a lot of my students find their way into wonderful careers because of the skills they pick up.

Personally and professionally, I’ve always focused on the low-level, the technical. Even when I write code, I’m more fascinated by the syntax and nuances of the language than by the frameworks and toolkits that enable useful features. My interest in data science follows the same pattern: I focus most on the data scraping, the data management, the code and performance issues.

But this video of David McCandless inspired me! What a beautiful display of data and information. This is how data science can be used to drive decisions and spur us to action.

Beyond beauty, David’s presentation focuses on the context of data and the meaning behind the shapes and numbers. He provides useful comparisons, and rightly calls out the media for almost completely failing to do so.

How much is a billion dollars? What does it mean that a volcano spewed tons of CO2 into the atmosphere? Context and comparison can provide answers and motivate appropriate responses. When the news consumer has an understand of quantities and relationships, the news itself takes on a whole new, deeper meaning.

Data dashboards like datastudio.google.com and tableau allow the to assemble widgets into live presentations that change as new data comes in. Graphical libraries like D3 let the user create beautiful visualizations of data.

Data is coming to life!

Hire a Senior Developer

Well, folks, this isn’t the post I wanted to be writing.

My employer, Nyla Technology Solutions, is laying off me and two of my coworkers. I am now job-hunting with immediate effect, so to speak.

As a senior software developer, I bring over 25 years of experience in software development and nearly 15 years of experience in teaching graduate and undergraduate-level computer science courses at the university level. I love code, and I love to talk code.

I work and live in southern Virginia, but I am open to remote work with some travel.

Please find me on LinkedIn if you have any interest or if you just want to connect.

And wish me luck!

P.S. I’m attaching more links for the interested head-hunter.

See that pretty resume? My work history and education boiled down to one page, complete with infographics! Connect with me on LinkedIn or email me at mike@mikejfromva.com. I’ll be happy to provide one for interested recruiters.

I’m active on StackOverflow and HackerRank as well. If I experience any down time between jobs, I’ll start creating some open-source projects on GitHub and posting the tutorials here. 😄

— Mike J (from VA)

iOS Dev Camp DC 2019

Tomorrow morning I brave Washington D.C. traffic to get to iOS Dev Camp DC 2019. And I’m really looking forward to it!

This conference is always a lot of fun. The food is great, the talks are interesting, and the swag is top-shelf. But the thing I really love about iOS Dev Camp DC is the people. We are a smallish group of iOS enthusiasts from all over the country, and every year we have a great time. My friends know to DM me on Twitter so we can get together at the conference.

So what are you waiting for? Buy your ticket already. It’s $50 for the whole day, and the proceeds go to Women Who Code. If not, at least get on Twitter and follow @iosdevcampdc to keep up with what’s happening. Oh no…. #DoraMovie is trending. ABORT MISSION ABORT ABORT ABORT

Jobs Available in VA

I have recently learned about a few companies in Chesapeake and Newport News that are looking for tech talent (JEE, .NET, Java, or DevOps). If you are interested in learning more about permanent, full-time positions in southern VA, please send me a DM on any of my social accounts. I will be happy to forward your info to the appropriate recruiters.

Cracking into Rabin Karp

So the Jetpack plugin for WordPress added GIF blocks.

Solving coding puzzles can be fun and rewarding. Try HackerRank or CodeSignal. But solving them on a whiteboard during a job interview can be… more stressful than fun. The book Cracking the Coding Interview comes with high praise and recommendations from some of the industry’s top tech recruiters.

I’ve been on both sides of the tech interview process, and writing code without an IDE is something that every developer should at least try at some point in their career. I recently tackled a basic problem using Java that once stumped me.

A long time ago, a job interviewer asked me to create a string searching algorithm. Given strings A and B, simply find A if it occurs in B. It is always good to start by stating the basic facts and showing that the problem can be solved, even if the naive algorithm is algorithmically complex. I managed to implement a naive string search. Naive search is simple:

  • Iterate over B using the index bindex
    • Iterate over A using the index aindex
      • Compare the character A[aindex] to the character B[aindex+bindex]
      • If they are the same, proceed with the inner loop (over A)
      • If they differ, break out of the inner loop (over A)
    • If A’s loop completes, the string search has found a match

The big problem with this algorithm is that it works in O(A_{length} * B_{length}) time complexity. When the interviewer asked me to write an optimized version… I froze. I never did get the job.

I recently read about an optimized string search algorithm, and the old pain of defeat flared up again. I had to implement Rabin Karp.

Rabin Karp optimizes the inner loop in the following way:

  • Create a hash-code for A. For example:
for (int index = 0; index < pattern.length(); index += 1) {
	searchHashAccumulator = (searchHashAccumulator + search.charAt(index)) % hashIndex;
}
  • Create a hash-code array of length B_length
  • Iterate over B using the index bindex
    • Compute a cumulative hash for each character in B by adding the character at bindex to the cumulative hash value.
    • Store this hash in the array
for (int index = 1; index <= search.length() - pattern.length(); index += 1) {
	// add new character
	searchHashAccumulator += search.charAt(index + pattern.length() - 1);
	// remove old character
	searchHashAccumulator -= search.charAt(index - 1);
	// add padding to protect against negative numbers
	searchHashAccumulator += hashIndex;
	// modulo
	searchHashAccumulator %= hashIndex;
	searchHashes[index] = searchHashAccumulator;
}
  • Iterate over B as before, bindex blah blah blah
    • This time, before executing the inner loop over A, check the hash code of A with the appropriate hash code for B at position bindex
    • If they don’t match, short-circuit the inner loop. We never execute it.
    • If they do match, go ahead and run through A as before to verify a match was found.
search: for (int searchIndex = 0; searchIndex < search.length() - pattern.length() + 1; searchIndex += 1) {
	if (patternHash != searchHashes[searchIndex]) {
		System.out.println("Optimize!");
		continue;
	}
	for (int patternIndex = 0; patternIndex < pattern.length(); patternIndex += 1) {
		if (pattern.charAt(patternIndex) != search.charAt(searchIndex + patternIndex))
			continue search;
	}
	return searchIndex;
}

Our hash code will sometimes match even when the string is not found (false positive), but it usually won’t. If we assume this is the case, then the inner loop basically executes only a few times. This makes our new time complexity O(B_{length} + A_{length}) and adds a O(B_{length}) memory complexity to the mix. All in all, a lot better than the performance of the naive approach.

I implemented it with some JUnit test code on github. I plan to add code to this repo whenever I get the urge to try one of these coding puzzles to keep my coding tools sharp.

There is No Cloud…

There is no cloud. It's just someone else's computer.
Thanks to Twitter user Juan Pernia for the image.

I signed up for Amazon Web Services around a year ago. I remember playing around and being impressed. But I never did much more with it.

This semester I am once again teaching a Database Administration course at Christopher Newport University. One of my students shared with me that his summer job heavily utilized Amazon Elastic Compute Cloud, and I decided to take another look into it. AWS enables you to easily deploy a server to support a web application, run data analysis, or even support your own multiplayer video game. And you never have to worry about backups or hardware failures! EC2 is a rich software deployment platform, however, and the array of choices when bootstrapping a new server can be overwhelming to a newcomer.

I needed a quick and dirty solution to a specific problem: I use Docker to teach my database course. It allows the students to deploy containers (virtual machines with fixed software configurations) to host their own relational database (PostgreSQL) interface software (PgAdmin). Later in the semester, they use Docker to deploy a web server hosting Jupyter to enable them to get experience with Python, Jupyter, and Spark. PySpark and The PySpark Cookbook have helped to provide me with a way to introduce the students to basic data analytics in just a few weeks.

Problem: Docker is straight-up painful to install on a student’s laptop running Windows, and it does not reflect a realistic database server at that. I decided to look into web services as a solution, and took a look at Google Cloud, Microsoft Azure, Amazon Web Services, and others. I was surprised to run into an ad for Vultr while searching for resources. They are not one of the big names, so I decided to check them out.

I was delighted to find out that there are $5 and $50 one-month free trials available. The basic Docker container is only $5 per month, so I can essentially try everything I want for free. The startup could not be easier, in part because you are only prompted to select your application, not trying to estimate your performance and capacity needs. There is a built-in option for starting a docker host.

Once setup is complete, I was given an IP address for my server and account details for it. It’s amazing, but that’s really all there is to it! Here is what my first login looked like:

Michaels-MacBook-Pro-3:~ michaeljohnson$ ssh root@11.22.33.44
# NOT ACTUAL IP ADDRESS
root@11.22.33.44's password: # ENTERED PASSWORD HERE
Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-137-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

89 packages can be updated.
56 updates are security updates.

New release '18.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Last login: Thu Feb 14 16:45:47 2019 from 174.226.24.25
root@dockertest:~# docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     false
root@dockertest:~# 

By next semester, my students will all be subscribing to a web services provider as part of their materials for the database course.

New Hosting / New WordPress

I’d like to share my #UnpopularOpinion and say that GoDaddy is an excellent hosting provider with impressive tech support. I’ve always been quick to call and ask them questions, and they always explain everything without talking down to me. They know web developers.

I recently took advantage of a sale on GoDaddy’s Deluxe Hosting and secured three years worth of hosting for as many web sites as I like. About six months after making the payment, I finally set about the task of migrating my web sites over to a new server.

My first view of the new hosting service came through the new-hosting-wizard. You’re basically forced to pick one of your domains to migrate before you can access cPanel Admin, the main gateway to hosting management on GoDaddy. In addition to the wizard and cPanel, GoDaddy supplies a minimal management panel (confusingly called cPanel) under My Account / Hosting / cPanel. The wizard does little more than redirect DNS to the new server, so I was going to need to move all of my hosting files and web server configuration over manually or using GoDaddy tools.

I’m into doing things the hard way and keeping control, so I decided to ignore all high-level tools in cPanel and move my site files over using Secure Shell. The minimal-management panel has a block dedicated to ssh, and it was here I learned that GoDaddy picked an obtuse username and default password for me to manage cPanel Admin or connect with SSH. I’ll never need the password since cPanel Admin is automatically logged in when I log into godaddy.com. cPanel Admin has an SSH Access widget that will allow you to manage SSH keys. You can use the Import Key action from here to add an authorized key for the management account. I added the public key from my already-generated ssh profile on my Macbook Pro, and I was set with unfettered encrypted access to the server shell itself. More on this in a minute.

If you haven’t played with ssh, you really should. Yes, that’s a link to a nuclear physics experimenter’s guide to using ssh to access a particle accelerator from off site. I have… a colorful background.

In a nutshell, ssh uses something called Public Key Infrastructure to enable a two-key system where one key (a large number encoded into characters kept in file) can be used to encrypt data that the other key can decrypt. Each key can encrypt or decrypt whatever the other key can decrypt or encrypt. One key is labeled public and released while the other key is labeled private and kept secret. This gives us the ability to authenticate (use our private key to encrypt something so that the public key can be used to decrypt it) and to transfer our data in secret (use our public key to encrypt our data and only our private key and decrypt it). In practice, SSH actually uses this secret transfer to pass another temporary secret key for faster encryption.

But here is how easy it is to copy a site:

tar cvf - . | ssh foo@myhost.com 'cd public_html; tar xvf -'

This command works if you have all of your site files in your current working directory on your local machine. I keep a working copy of all of my source files local and use git to revision them. I don’t have a fancy upload tool – I just use git to tell me which files have been updated and ssh to transfer them.

I opted not to buy any email plans. My whole show runs on the cheap. You can set up a free email account through CPanel that forwards any email aimed at your domain to any account. I just forward them all to my GMail account. This tutorial saved me a lot of stress.

Finally, every web site needs to upgrade to HTTPS. Chrome and other browsers are flagging insecure sites. ZeroSSL has a very easy set of online tools to generate your certificates. The only catch is that you need to renew them manually every three months. This tutorial helped me make sense of the various key components. Once you understand which fields go where, it’s just a matter of finding the right panel and pasting in the right value. When it’s time to renew, this video shows what keys go where.

That’s all done! WordPress has been upgraded, and I’m ready for another year of (hopefully more frequently) blogging. The key takeaway here is that you can really get most things for free (email redirection, SSL) and the rest for cheap (domains about $15/year each, web hosting $60/year for unlimited sites). All this if you’re willing to do the work.

ARM64 Assembly with Swift and XCode

Here is a popular interview question:

Given an unsigned integer, how would you count the number of set bits (that is, bits = 1)?

This is otherwise known as a Hamming Weight, and there are clever tricks for solving this problem using only a few lines of C. You could also use a loop with shift and compare:

There is actually a single CPU instruction that implements this entire function:

CNT (vector)

Population count per byte.

Syntax

CNT Vd.T, Vn.T

This is an instruction, in ARM64 Assembly (AArch64), that performs the population count operation in one line of code. A similar instruction exists in most CPU assembly languages. iPhones use ARM-family chips. For example, my iPhone 6 uses an Apple A8, a System-on-a-Chip (SoC) built around an ARM64 CPU.

This inspired me to write an iPhone app that utilizes the CNT instruction, which means I get to write a little bit of ARM assembly code and deploy it on the iPhone. The code is on Github. This is what the finished product looks like:

The app is written in Swift but uses C and Assembly for the bit-counting operation. Calling ARM64 functions from Swift is a lot like calling a C function from Swift, so I wrote a C popcount() function to use with the simulator (compiles into x86 assembly) and an ARM64 popcount() function to deploy on the iPhone itself. The function is defined in the two files popcount.c and popcount.s. They use the preprocessor flags #ifdef __arm64__ and ifndef __arm64__ to tell the compiler which code goes with which platform.

Both files utilize the same header to declare the function. You can create a header called popcount.h using the File menu. But this is a Swift project, so we need a bridging header to make the functions callable from Swift. A bridging header is created automatically if you carefully follow the prompts when creating a C file, but you can create one manually if need be.

A function written in assembly code comes with a lot of boilerplate code for setting up the stack, returning results, aligning memory, and much more. Fortunately, there is a way to get a working example that can be used as a kind of template. The compiler will show you generated assembly if you use the right options. Take this simple Swift function:

Compiling this code manually will show the assembly code generated by the swiftc compiler. I used this command: xcrun -sdk iphoneos swiftc -emit-assembly -target arm64-apple-ios11.0 SimpleFunction.swift. It produces a very large assembly file, not all of which needs to be replicated in the popcount.s code. The code starting at the function label and ending with the ret statement is useful, because we can use it to create our own function in assembly.

The addTwoNumbers function is actually contained in the assembly code labelled __T014SimpleFunction13addTwoNumbersS2i_SitF. The name has been mangled to make it unique and identifiable across potentially many compiled objects. The single line adds x0, x0, x1  actually contains the entire body of this simple function in one command. Here is a breakdown of the command:

  • adds is an opcode, a single basic instruction in assembly. Assembly opcodes typically perform a simple math operation on one of a few built-in variables. They may also move a word of data in or out of memory. This opcode adds two numbers together, respecting the sign bit.
  • x0 is the destination variable for the operation, where the results are written. These variables are called registers in assembly, and they live on the CPU outside of memory. There are only 31 general-purpose registers on modern ARM chips, so assembly programs spend a lot of code moving data between memory and these local registers.
  • x0, x1 are the source registers. The code that came before this command takes care of pulling the values off of the call stack (placed there by the calling function) and putting them into registers.

There are a lot of other ARM64 commands, and a lot of details needed to understand function calls. But I won’t need to dive this deep to complete my task.

I need to replace this adds command with cnt. The cnt command is actually part of the ARM64 SIMD instruction set, and uses a separate set of registers from the general-purpose x0-x31 registers. I need a three-line program: one line to move the function parameter into a SIMD register, one to perform the count, and one to move the result back to a GP register. These three lines accomplish this:

Breaking down the assembly syntax again:

  • dup duplicates an element (copies it). We are copying from a general-purpose register, x0, into a specialized SIMD register, v0.2d.
  • v0.2d is a 128-bit SIMD register. x0 is a 64-bit register, so the .2d modifier instructs the assembler to duplicate x0 and put the two 64-bit copies into v0. This isn’t necessary for our purpose, but it is useful for other algorithms.
  • cnt executes the population count, replacing the bit pattern with the number counted. For example, 01101101 would become 00000101 (101=5, because there were five bits set in the data before the operation).
  • v0.16b indicates the same register as before, except this time the SIMD register is being divided into 16 lanes, each 1-byte in size. Each of the 16 8-bit lanes inside the vector is treated as a separate register and they are all executed in parallel. Vector instructions like to perform one operation on many pieces of data in parallel, hence SIMD: Same Instruction, Multiple Data elements.
  • fmov moves data between registers again, this time putting the results back in x0. The f here actually makes this a floating-point instruction, because FP and SIMD instructions share the same registers in ARM64. But it doesn’t make this a floating-point number.
  • v0.d[1] is once again the same SIMD register, but with a different way of splitting up the data elements. The d indicates that a double-word (64 bits) is being copied out of element 1 of the SIMD register. This puts the results into our GP register.

Note: CNT only works in 8-bit chunks due to hardware limitations (more bits = more wires = less room to implement other features). If I wanted to support values larger than 255, I could add an addv command, which adds the 8-bit chunks together. There are many more ARM topics I have not explored.

The resulting assembly code reports the population count, and the Swift code displays the results to the user:

If you’re new to assembly language, it is worth noting some oddities of the assembly language environment. Assembly language gives you complete low-level control over the behavior of your CPU, but this means that understanding the details of CPU architecture is essential to writing efficient assembly. Instructions that rely on shared resources can create delays. Memory loads take several instruction cycles to complete.

Compilers are pretty good at allocating registers and scheduling instructions. It may be worth writing code in C and using the generated assembly as a starting point to see how it can be done. There is a lot more to writing effective assembly than just knowing the syntax of the language.

You can find lots of resources for learning ARM64 assembly. A crash course in ARM64 assembly might be a good place to start, and the ARMv8 Instruction Set Overview is both authoritative and exhaustive. Github member Richard Ross has written an entire iOS app in assembly.

You can practice ARM64 using QEMU, which, although it requires a lot of effort to set up, will let you analyze assembly instructions one-by-one as the execute.

Swift BigInteger?

I recently ran across the need for a large integer in Swift – something that needs more than the 64 bits provided by Int64. This is not the first time. I have played around with Swift Decimal in the past (NSDecimal, NSDecimalNumber) only to find they come up short in precisely representing a rational number. Python, for example, makes it trivial to find the Nth decimal place of any rational number by storing the repeating decimal pattern under the hood and making that info available to you.

This poor substitute of mine only supports the binary OR and POPULATION operations and has no string representation. But it worked for what I needed.

 

The Gap

So I’ve been silent here for the better part of a year, and I just wanted to say why: I made  a major career change and started a new job. I write Android apps for a tech company here in Williamsburg, VA now. The job is great, but I have been very busy. I intend to continue posting about iOS, Math, Python, AI, Databases, and now Android.

Cheers!